Top Albums and Runtime with Web Scraping

Top Albums and Runtime with Web Scraping#

This project was very open ended; we had to think of a question that we could solve through web scraping data.
My question was related to my interests in the music world, friends of mine always said “short pop songs are all that matter” so I wanted to see if album run time varied across the hit albums of the last few years.
This was a fantastic exercise in actually understanding how data scientists might collect data but also how much is needed to come to a real conclusion, my analysis was good but it was clear I need another piece to this story, data that would show what the average album runtime was to compare.
It’s also interesting because Taylor swift dominated the charts and had relatively long albums, but that doesn’t mean we should assume she’s popular from this alone as there are obviously other factors in play.
A more interesting route would be to see if upcoming stars and groups tend to release longer or shorter albums.

Question: Does album run time have anything to do with being a top selling album in a year?#

import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns

This is a link to the starting wiki: albums

The bs4 object#

After first retreaving and parsing the html I explored the websit using inspect and found ‘tbody’ was holding the lists of names and links. All the links to the individual albums were wrapped in i tags, using “.a” so we isolate the individual links and names.

albums_url = 'https://en.wikipedia.org/wiki/Top_Album_Sales#2020'
albums_html = requests.get(albums_url).content
cs_albums = BeautifulSoup(albums_html, 'html.parser')
type(cs_albums)

bs4.BeautifulSoup

wiki_content = cs_albums.find_all('tbody')

wiki_content[1].find_all('i')

[<i><a class="mw-redirect" href="/wiki/1989_(Taylor_Swift_album)" title="1989 (Taylor Swift album)">1989</a></i>,
 <i><a href="/wiki/2014_Forest_Hills_Drive" title="2014 Forest Hills Drive">2014 Forest Hills Drive</a></i>]

wiki_content[1].find('i').a

<a class="mw-redirect" href="/wiki/1989_(Taylor_Swift_album)" title="1989 (Taylor Swift album)">1989</a>

wiki_content[1].find('i').a.string

'1989'

Getting consistency#

Searching trhrough tags we can scrape a single name, now we can apply this to the entire wiki_content. I define a start to the url since its not included in the href to construct working url’s. The next step was isolating the runtime and year released from the constructed link, then applying the method to everything.

album_names = [i.a.string for i in wiki_content[1].find_all('i')]

album_names

['1989', '2014 Forest Hills Drive']

album_names = [i.a.string for name in wiki_content 
               for i in name.find_all('i') 
               if i.a and i.a.string]

wiki_content[1].find('i').a

<a class="mw-redirect" href="/wiki/1989_(Taylor_Swift_album)" title="1989 (Taylor Swift album)">1989</a>

wiki_content[1].find('i').a['href']

'/wiki/1989_(Taylor_Swift_album)'

url_start = "https://en.wikipedia.org"

test_url = url_start + wiki_content[1].find('i').a['href']
test_url

'https://en.wikipedia.org/wiki/1989_(Taylor_Swift_album)'

Getting the runtime and year released from the constructed link#

test_html = requests.get(test_url).content
test_info = BeautifulSoup(test_html,'html.parser')

test_info.find('span', class_="min")

<span class="min">48</span>

test_min = test_info.find('span', class_="min").string
test_min = int(test_min.strip())
test_min

test_info.find('span', class_='bday dtstart published updated itvstart').string

'2014-10-27'

test_date = test_info.find('span', class_='bday dtstart published updated itvstart').string
test_date

'2014-10-27'

test_year = test_date.split('-')[0]
test_year

'2014'

Method to construct Links#

album_links = [url_start + i.a['href'] 
               for link in wiki_content 
               for i in link.find_all('i') 
               if i.a and 'href' in i.a.attrs]

album_links[1]

'https://en.wikipedia.org/wiki/2014_Forest_Hills_Drive'

After finding out how to extract the run time I use a loop to do requests through all associated links#

album_years = []
run_time = []
time_size = []

I also added a time_size catagory to help group by times

for test_url in album_links:
    
    test_html = requests.get(test_url).content
    test_info = BeautifulSoup(test_html,'html.parser')
    try: 
        test_min = test_info.find('span', class_="min").string
        test_date = test_info.find('span', class_='bday dtstart published updated itvstart').string
        test_year = test_date.split('-')[0]
        
        test_year = int(test_year.strip())
        album_years.append(test_year)
        
        test_min = int(test_min.strip())
        run_time.append(test_min)
        
        
        if test_min < 20:
            time_size.append('Very Short')
        elif test_min >= 20 and test_min < 35:
            time_size.append('Short')
        elif test_min >= 35 and test_min < 60:
            time_size.append('Average')
        elif test_min >= 60:
            time_size.append('Long')
        else:
            time_size.append(pd.NA)
    except:
        run_time.append(pd.NA)
        time_size.append(pd.NA)
        album_years.append(pd.NA)

albums_df = pd.DataFrame({'Album Name':album_names, 'Release Year':album_years, 'Length Rating':time_size, 'Run-Time(Mins)':run_time,'links':album_links})

albums_df

	Album Name	Release Year	Length Rating	Run-Time(Mins)	links
0	1989	2014	Average	48	https://en.wikipedia.org/wiki/1989_(Taylor_Swi...
1	2014 Forest Hills Drive	2014	Long	64	https://en.wikipedia.org/wiki/2014_Forest_Hill...
2	1989	2014	Average	48	https://en.wikipedia.org/wiki/1989_(Taylor_Swi...
3	Title	2015	Short	32	https://en.wikipedia.org/wiki/Title_(album)
4	American Beauty/American Psycho	<NA>	<NA>	<NA>	https://en.wikipedia.org/wiki/American_Beauty/...
...	...	...	...	...	...
444	Billboard Philippines	<NA>	<NA>	<NA>	https://en.wikipedia.org/wiki/Billboard_Philip...
445	Billboard Türkiye	<NA>	<NA>	<NA>	https://en.wikipedia.org/wiki/Billboard_T%C3%B...
446	Billboard Việt Nam	<NA>	<NA>	<NA>	https://en.wikipedia.org/wiki/Billboard_Vi%E1%...
447	Music & Media	<NA>	<NA>	<NA>	https://en.wikipedia.org/wiki/Music_%26_Media
448	Billboard Radio Monitor	<NA>	<NA>	<NA>	https://en.wikipedia.org/wiki/Billboard_Radio_...

449 rows × 5 columns

I start cleaning my data br removing all na’s and duplicates#

albums_clean_df = albums_df.dropna()

albums_clean_df.describe()

	Album Name	Release Year	Length Rating	Run-Time(Mins)	links
count	355	355	355	355	355
unique	302	14	4	79	303
top	A Star Is Born	2017	Average	48	https://en.wikipedia.org/wiki/A_Star_Is_Born_(...
freq	8	47	203	18	8

albums_clean_df =albums_clean_df.drop_duplicates()

albums_clean_df

	Album Name	Release Year	Length Rating	Run-Time(Mins)	links
0	1989	2014	Average	48	https://en.wikipedia.org/wiki/1989_(Taylor_Swi...
1	2014 Forest Hills Drive	2014	Long	64	https://en.wikipedia.org/wiki/2014_Forest_Hill...
3	Title	2015	Short	32	https://en.wikipedia.org/wiki/Title_(album)
6	Now 53	2015	Long	74	https://en.wikipedia.org/wiki/Now_That%27s_Wha...
7	If You're Reading This It's Too Late	2015	Long	68	https://en.wikipedia.org/wiki/If_You%27re_Read...
...	...	...	...	...	...
428	F-1 Trillion	2024	Average	57	https://en.wikipedia.org/wiki/F-1_Trillion
429	Days Before Rodeo	2014	Average	50	https://en.wikipedia.org/wiki/Days_Before_Rodeo
430	Crazy	2024	Very Short	14	https://en.wikipedia.org/wiki/Crazy_(Le_Sseraf...
431	Luck and Strange	2024	Average	43	https://en.wikipedia.org/wiki/Luck_and_Strange
433	The Rise and Fall of a Midwest Princess	2023	Average	49	https://en.wikipedia.org/wiki/The_Rise_and_Fal...

304 rows × 5 columns

albums_clean_df.describe()

	Album Name	Release Year	Length Rating	Run-Time(Mins)	links
count	304	304	304	304	304
unique	302	14	4	79	303
top	The Album	2017	Average	41	https://en.wikipedia.org/wiki/Face_the_Sun
freq	2	42	179	16	2

Removing irrelevant Data#

I noticed we really only get relevant data from starting from 2015, theres not much to comapir from 2014 since only 4 albums are listed and the years below that seem to be dummy variables.

albums_clean_df['Release Year'].value_counts()

Release Year
  42
  34
  34
  31
  30
  30
  29
  28
  20
  19
   4
   1
   1
   1
Name: count, dtype: int64

albums_clean_df = albums_clean_df[albums_clean_df['Release Year'] >= 2015]

albums_clean_df['Release Year'].value_counts()

Release Year
  42
  34
  34
  31
  30
  30
  29
  28
  20
  19
Name: count, dtype: int64

albums_clean_df['Length Rating'].value_counts()

Length Rating
Average       175
Long           69
Short          38
Very Short     15
Name: count, dtype: int64

albums_clean_df['Run-Time(Mins)'].mean()

50.51851851851852

albums_clean_df.groupby('Length Rating')['Run-Time(Mins)'].mean()

Length Rating
Average           45.56
Long          83.057971
Short         28.052632
Very Short         15.6
Name: Run-Time(Mins), dtype: object

sns.lineplot(data=albums_clean_df,x='Release Year', y='Run-Time(Mins)')

<Axes: xlabel='Release Year', ylabel='Run-Time(Mins)'>

png

sns.catplot(data=albums_clean_df,x='Release Year', y='Run-Time(Mins)',kind='bar', hue='Length Rating')

<seaborn.axisgrid.FacetGrid at 0x1a0ea0538f0>

png

Summary#

Overall, top chart albums stay around 45 minutes. If we combine counts of small ,very small, and long albums we come to 122 which is still lower than average at 175. I would say this leads to inconclusive data, if anything this may say more about the amount of work the music indusry puts into projects. Removing duplicates was useful but It would be more useful to observe months in the year to see if theres any dominating artists that write longer or shorter albums.

albums_clean_df.to_csv('Album_Runtime.csv', index=False)