CSCI 5622 - MACHINE LEARNING (SPRING 2023)
UNIVERSITY OF COLORADO BOULDER
​
BY: SOPHIA KALTSOUNI MEHDIZADEH
DATA PREPARATION & EXPLORATION
Data Sources
This project uses a subset of the Million Song Dataset (MSD) called the Taste Profiles subset. This dataset includes over 48 million observations of user-song-playcount data. This is an interesting dataset to use to explore musical preferences: it includes over 1 million unique users and almost 400,000 unique songs, and "play counts" may be used as a naturalistic analogue for a preference measure. The datafile can be downloaded from the Taste Profiles page (linked above) as a .txt file (under "Getting the dataset" > "TRIPLETS FOR 1M USERS). A small preview of this raw data frame is shown to the right (Figure 3).
​
Some additional files from the MSD will also be needed for this analysis. The Taste Profiles data frame includes only a song ID with no readable song or artist name. This information can be acquired from the MSD Additional Files (under "Additional Files" > "#1. List of all track Echo Nest ID"). A small preview of this raw data frame is shown to the right (Figure 4).
​
Finally, song and artist metadata are acquired using the Spotify Web API. Song and artist names from the MSD subset are used to submit a search query to Spotify's API, which returns Spotify's IDs for the requested song/artist (among other things). These IDs are then used to get track and artist information from the API, such as genre and audio features (which we will need for our analysis). An example of this API call is shown in Figure 5. The Spotify Web API offers a large and descriptive variety of audio features for each track which will be helpful when transforming the data out of the genre space later on.
​
Data collection, data frame assembly, exploration, and cleaning were done in Python. The full code is available here.
Figure 3: First five rows of the raw "Taste Profiles" dataset. This dataset includes over 48 million points of user-song-playcount data. As is seen in the snippet, users and songs are encoded with a dataset-specific ID. Readable song and artist information is included elsewhere, and will need to be merged.
Figure 4: First five rows of the raw track information. This file contains the mappings between song IDs and the names of the 1 million songs in the MSD. An important note about this file is that "track" refers to the unique instance/recording of a "song," and thus one song may have multiple tracks. This can be seen if you search this data frame for songID duplicates.
Figure 5: Code snippet for a Spotify Web API call. This block of code iterates through all of the unique songs in the Taste Profiles dataset and uses the song and artist name to get a search result from the Spotify API. If successful, the result will contain the Spotify IDs for both the song and the artist, which is needed for future API calls.
Figure 6: Overview of the merged Taste Profiles and track information dataframes. Readable artist and title (song) names are now included for each observation. Only 65 rows with missing song names have been removed.
Figure 7: Visualization of number of data points per column/attribute in the merged data frame. The six column names seen in Figure 6 are shown on the x-axis (bottom), while the bars and corresponding values at the top indicate the number of existing data points. Only the last column (title) is missing data.
Cleaning & Preparation
​
Find and remove duplicate songs from the track information data frame. In this data frame (Figure 4), "track" refers to the unique instance/recording of a "song," and thus one song may have multiple tracks. This can be seen if you search this data frame for songID duplicates. For the purposes of this analysis, only unique songs are needed.
​
Merge the Taste Profile data frame with the cleaned track information data frame. This will add the corresponding readable song and artist names for each row of the data in the Taste Profile data frame. A small preview of this combined data frame is shown to the left (Figure 6).
​
Check for missing values. Upon checking for any missing values in this resulting data frame, it appears there are a small number of rows (65) with missing song titles. See Figure 7 for a visualization of data points per attribute. Looking more closely, it seems that the problem stems from just 4 unique songs / 2 unique artists. Since this is a very small amount of data compared to the overall size of the data frame, these rows are removed.
Figures 10 & 11: Visualizations of the playcount variable within each unique user. Top- shows the distribution of the average user playcount. There is a heavy skew towards the bottom end of the range, with the median of all users' averages at around 2 plays. Bottom- shows the distribution of the variance of users' playcounts. Again there is a large skew in the distribution, with many users having a playcount variance of zero, suggesting that all of their songs have the same number of plays. This may be problematic if we intend to use playcount as an analogue for preference.
Figure 12: Distribution of datapoints (number of songs) per user. All users in the dataset have at least ten datapoints. As with the other variables, there seems to be also be a very large yet skewed range here. Depending on the number of datapoints needed per user for the analysis, users with fewer datapoints may need to be removed.
Exploration
Before data from the Spotify API are added, the collected MSD data are explored.
​
Figures 8 & 9: Top- The distribution of playcounts in the dataset is heavily skewed towards the minimum of the range (by several orders of magnitude). Bottom- By zooming in, some more detail is revealed about the distribution of this variable, including just one datapoint all the way at the maximum.
Figure 13: Do all observations/songs in the dataset come from unique artists, or do some users have multiple songs by the same artist? The histogram suggests that the overwhelming majority of users have just one song per unique artist in their data. Although there are some users who do have multiple songs by the same artist in their data, with one user having 97 songs by the same artist.
More data cleaning & preparation
PLAYCOUNT VARIANCE. First, let's remove users that have a low variance (= 0) in playcounts. For example, if someone has data where all the playcounts are just 1s (i.e. a bunch of songs that they just listened to one time) it's unclear if these songs are really preferred or not. Additionally, if the variance of a user's playcounts is very low, we won't be able to use that variable to estimate a degree of preference (more vs less preferred) for any of their reported songs.
​
PLAYCOUNT MAXIMUM. Next, let's remove users that have a maximum playcount of less than three. This allows us to remove some more users with unclear preference levels in their data (without getting rid of too much of the data).
​
The resulting data frame overview is shown to the right (Figure 14A). Approximately 95% of the original data still remains.
​
Remove songs tagged with matching errors. In the documentation of the MSD and Taste Profile subset, some song IDs were identified to be unreliable (either incorrect or unverifiable). More information on this error is available here. A list of the unreliable songs can be found at that link ("List of Song - Track Mismatches"). These songs can then be compared against the song IDs in the data we have so far (Figure 14A), and removed from the data as recommended by the MSD documentation. The resulting data frame is summarized in Figure 14B.
​
Figure 15: Code snippet for a Spotify Web API call. This block of code iterates through all of the unique songs in the Taste Profiles dataset and uses the Spotify artistID found in Figure 5 to get metadata for 50 artists at a time. Returned data include associated genre labels, and artist popularity. A similar process is done for retreiving track metadata (not pictured here).
Figure 14A: Overview of the merged data frame from Figure 6 after more data cleaning has been performed based on the playcount variable. Approximately 95% of the original amount of data still remains.
Figure 14B: Overview of the merged data frame from Figure 14A after removing unreliable songs as identified by the MSD documentation.
ADDING METADATA FROM SPOTIFY. Using the "Get Several Artists," "Get Several Tracks," and "Get Tracks' Audio Features" API calls, artist and song meta data is added to the MSD Taste Profiles data shown in Figure 14 above. The attributes that are added are:
-
artistGenres: genre labels associated with the artist.
-
artistPop: artist popularity score (Spotify calculation).
-
trackPop: track popularity score (Spotify calculation).
-
trackAcoustic: track "acousticness" score.
-
trackDanceable: track "danceability" score.
-
trackDurMS: track duration in msec.
-
trackEnergy: track energy score.
-
trackInstrum: track "instrumentalness" score.
-
trackKey: key the track is in.
-
trackLoud: track overall loudness in dB.
-
trackMode: track mode (major/1 or minor/0).
-
trackSpeech: track "speechiness" score.
-
trackTempo: track overall tempo in BPM.
-
trackVal: track overall valence measure.
-
albumYear: release date of the track's album.
​
More information on how Spotify calculates these attributes can be found in the corresponding API documentation.
Figure 16: Four rows of merged MSD Taste Profile and Spotify API data.
More data exploration
Figure 17: The Spotify API references features like tempo and loudness contributing to the perception (and resulting score) of energy. This is reflected visually here, with faster and louder tracks having a higher energy score than slower and quieter ones.
Figure 18: Two relationships can be visualized in this figure. First tempo and energy seem to be correlated, which is something we also saw in Figure 17. Next, tempo and "danceability" seem to have a Gaussian relationship, suggesting that tempi around 120 bpm are the most "danceable." Interestingly, this corresponds to the human walking pace.
Figure 19: This plot focuses specifically on the lower end of the playcount range (less than 100), as that is how the data are skewed. Surprisingly, playcounts in this dataset seem to be inversely correlated to Spotify's track popularity metric.
Figure 20: When separating track popularity between tracks in a minor key vs. those in a major key, it appears that of the most popular songs, more are in a minor key.
Figure 21: Even though minor keys are commonly described as "sad" and major keys as "happy," it appears that there is no clear relationship in this dataset between track mode and the valence estimation.