Top Albums and Runtime with Web Scraping#

  • This project was very open ended; we had to think of a question that we could solve through web scraping data.

  • My question was related to my interests in the music world, friends of mine always said “short pop songs are all that matter” so I wanted to see if album run time varied across the hit albums of the last few years.

  • This was a fantastic exercise in actually understanding how data scientists might collect data but also how much is needed to come to a real conclusion, my analysis was good but it was clear I need another piece to this story, data that would show what the average album runtime was to compare.

  • It’s also interesting because Taylor swift dominated the charts and had relatively long albums, but that doesn’t mean we should assume she’s popular from this alone as there are obviously other factors in play.

  • A more interesting route would be to see if upcoming stars and groups tend to release longer or shorter albums.

Question: Does album run time have anything to do with being a top selling album in a year?#

import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns

This is a link to the starting wiki: albums

The bs4 object#

After first retreaving and parsing the html I explored the websit using inspect and found ‘tbody’ was holding the lists of names and links. All the links to the individual albums were wrapped in i tags, using “.a” so we isolate the individual links and names.

albums_url = 'https://en.wikipedia.org/wiki/Top_Album_Sales#2020'
albums_html = requests.get(albums_url).content
cs_albums = BeautifulSoup(albums_html, 'html.parser')
type(cs_albums)
bs4.BeautifulSoup
wiki_content = cs_albums.find_all('tbody')
wiki_content[1].find_all('i')
[<i><a class="mw-redirect" href="/wiki/1989_(Taylor_Swift_album)" title="1989 (Taylor Swift album)">1989</a></i>,
 <i><a href="/wiki/2014_Forest_Hills_Drive" title="2014 Forest Hills Drive">2014 Forest Hills Drive</a></i>]
wiki_content[1].find('i').a
<a class="mw-redirect" href="/wiki/1989_(Taylor_Swift_album)" title="1989 (Taylor Swift album)">1989</a>
wiki_content[1].find('i').a.string
'1989'

Getting consistency#

Searching trhrough tags we can scrape a single name, now we can apply this to the entire wiki_content. I define a start to the url since its not included in the href to construct working url’s. The next step was isolating the runtime and year released from the constructed link, then applying the method to everything.

album_names = [i.a.string for i in wiki_content[1].find_all('i')]
album_names
['1989', '2014 Forest Hills Drive']
album_names = [i.a.string for name in wiki_content 
               for i in name.find_all('i') 
               if i.a and i.a.string]
wiki_content[1].find('i').a
<a class="mw-redirect" href="/wiki/1989_(Taylor_Swift_album)" title="1989 (Taylor Swift album)">1989</a>
wiki_content[1].find('i').a['href']
'/wiki/1989_(Taylor_Swift_album)'
url_start = "https://en.wikipedia.org"
test_url = url_start + wiki_content[1].find('i').a['href']
test_url
'https://en.wikipedia.org/wiki/1989_(Taylor_Swift_album)'

I start cleaning my data br removing all na’s and duplicates#

albums_clean_df = albums_df.dropna()
albums_clean_df.describe()
Album Name Release Year Length Rating Run-Time(Mins) links
count 355 355 355 355 355
unique 302 14 4 79 303
top A Star Is Born 2017 Average 48 https://en.wikipedia.org/wiki/A_Star_Is_Born_(...
freq 8 47 203 18 8
albums_clean_df =albums_clean_df.drop_duplicates()
albums_clean_df
Album Name Release Year Length Rating Run-Time(Mins) links
0 1989 2014 Average 48 https://en.wikipedia.org/wiki/1989_(Taylor_Swi...
1 2014 Forest Hills Drive 2014 Long 64 https://en.wikipedia.org/wiki/2014_Forest_Hill...
3 Title 2015 Short 32 https://en.wikipedia.org/wiki/Title_(album)
6 Now 53 2015 Long 74 https://en.wikipedia.org/wiki/Now_That%27s_Wha...
7 If You're Reading This It's Too Late 2015 Long 68 https://en.wikipedia.org/wiki/If_You%27re_Read...
... ... ... ... ... ...
428 F-1 Trillion 2024 Average 57 https://en.wikipedia.org/wiki/F-1_Trillion
429 Days Before Rodeo 2014 Average 50 https://en.wikipedia.org/wiki/Days_Before_Rodeo
430 Crazy 2024 Very Short 14 https://en.wikipedia.org/wiki/Crazy_(Le_Sseraf...
431 Luck and Strange 2024 Average 43 https://en.wikipedia.org/wiki/Luck_and_Strange
433 The Rise and Fall of a Midwest Princess 2023 Average 49 https://en.wikipedia.org/wiki/The_Rise_and_Fal...

304 rows × 5 columns

albums_clean_df.describe()
Album Name Release Year Length Rating Run-Time(Mins) links
count 304 304 304 304 304
unique 302 14 4 79 303
top The Album 2017 Average 41 https://en.wikipedia.org/wiki/Face_the_Sun
freq 2 42 179 16 2

Removing irrelevant Data#

I noticed we really only get relevant data from starting from 2015, theres not much to comapir from 2014 since only 4 albums are listed and the years below that seem to be dummy variables.

albums_clean_df['Release Year'].value_counts()
Release Year
2017    42
2016    34
2020    34
2019    31
2021    30
2023    30
2018    29
2015    28
2022    20
2024    19
2014     4
2001     1
1967     1
1969     1
Name: count, dtype: int64
albums_clean_df = albums_clean_df[albums_clean_df['Release Year'] >= 2015]
albums_clean_df['Release Year'].value_counts()
Release Year
2017    42
2016    34
2020    34
2019    31
2021    30
2023    30
2018    29
2015    28
2022    20
2024    19
Name: count, dtype: int64
albums_clean_df['Length Rating'].value_counts()
Length Rating
Average       175
Long           69
Short          38
Very Short     15
Name: count, dtype: int64
albums_clean_df['Run-Time(Mins)'].mean()
50.51851851851852
albums_clean_df.groupby('Length Rating')['Run-Time(Mins)'].mean()
Length Rating
Average           45.56
Long          83.057971
Short         28.052632
Very Short         15.6
Name: Run-Time(Mins), dtype: object
sns.lineplot(data=albums_clean_df,x='Release Year', y='Run-Time(Mins)')
<Axes: xlabel='Release Year', ylabel='Run-Time(Mins)'>

png

sns.catplot(data=albums_clean_df,x='Release Year', y='Run-Time(Mins)',kind='bar', hue='Length Rating')
<seaborn.axisgrid.FacetGrid at 0x1a0ea0538f0>

png

Summary#

Overall, top chart albums stay around 45 minutes. If we combine counts of small ,very small, and long albums we come to 122 which is still lower than average at 175. I would say this leads to inconclusive data, if anything this may say more about the amount of work the music indusry puts into projects. Removing duplicates was useful but It would be more useful to observe months in the year to see if theres any dominating artists that write longer or shorter albums.

albums_clean_df.to_csv('Album_Runtime.csv', index=False)