CLUSTERING

Overview

Clustering is a type of unsupervised learning. The algorithm finds underlying groups (clusters) from data which is unlabeled. Generally, the goal with clustering is to group similar data vectors based on their "distance" from one another. This distance can be calculated in a number of different ways according to the context of your analysis, such as Euclidean distance (L2 norm), Manhattan distance (L1 norm), Cosine distance/similarity, Jaccard distance, Edit distance, and more. The algorithm attempts to minimize the distances between vectors assigned to the same cluster and maximize the distances between different clusters.

There are two main types of clustering algorithms: partitional and hierarchical. Partitional clustering divides data vectors into non-overlapping clusters (Figure 22), which means that the number of desired clusters needs to be specified. This includes density-based and K-means methods. Hierarchical clustering does not require the number of clusters to be specified. Instead, a hierarchical algorithm may start with all vectors as their own cluster and then merge the closest pairs until only one cluster remains (agglomerative/bottom-up method). Or, it may start with one all-inclusive cluster and divide it until each cluster only contains one vector (divisive/top-down method). In either case, the result is a tree-like hierarchy of clusters (visualized by a dendrogram) which can reveal even more details about the structure within each higher-level cluster (Figure 23).

For this project, we will see if we can discover clusters in our collected song data based on the features given by the Spotify API. In both the Val-Aro-Dep and MUSIC models discussed in the Introduction page, those components were found by rating music from various genres according to descriptive sound-related (instrumental, loud, electric...) or psychological (intense, relaxing, sad...) attributes. We were able to get some similar types of attribute scores from the Spotify Web API. Our goal will be to see if partitional and hierarchical clustering methods can reveal some underlying structure in our data that goes beyond categorization by genre.

Figure 22: Illustration of what partitional clustering (with 3 clusters) might reveal for the music data in this project.

Figure 23: Illustration of what hierarchical clustering might reveal for the music data in this project.

Figure 24: The distribution of the six selected attributes after standardization. The plot shows many outliers present in the data, which often causes problems with many clustering algorithms.

Figure 25: Histogram of the trackInstum attribute. An extreme skew exists towards the lower end of the range. Because this may affect how the data are clustered, we correct for it by resampling the data to create a uniform distribution in this attribute.

Figure 26: Distribution of the six selected attributes after outlier removal and resampling.

Figure 27: Frequency analysis of words present in the genre labels returned from the Spotify API. Most frequent words are used to save labels for our data which we will use when inspecting the clusters later.

Data Preparation

LINK TO SAMPLE OF THE DATA

LINK TO CODE

From the Spotify API song attributes, those that are the closest matches to what was used in prior literature are:

trackAcoustic (sound related, could be the inverse analogue to the "electric" attribute in MUSIC paper)
trackInstrum (sound related, instrumental is an attribute used in MUSIC paper)
trackLoud (sound related, loud is an attribute used in MUSIC paper)
trackTempo (sound related, fast is an attribute used in MUSIC paper)
trackVal (psychological, sad is an attribute used in MUSIC and V-A-D paper)
trackDanceable (psychological, danceable is an attribute used in V-A-D paper)

To prepare the data for clustering, we first extract these six attributes from our dataset.

Standardization. Because each attribute has a different range, they are all first standardized using the StandardScalar() function from the skLearn Python library (Figure 24).

Outlier removal and resampling. Plotting the distribution of each attribute after standardization reveals outliers present in the data. All outliers (points farther than 1.5 times the interquartile distance from the edge of the box) are removed from the trackTempo, trackLoud, and trackDanceable attributes. The trackInstrum attribute however is heavily skewed, which won't be corrected by outlier removal (Figure 25). Instead, we divide this attribute into 50 bins and resample this data to create a uniform distribution (3800 points per bin, for a total of 190k points). This is still a sufficient amount of data for clustering. The final data distribution is shown in Figure 26.

Save data labels separately (as index labels). Since clustering methods are unsupervised, our data must be unlabeled. However, it will be helpful to save the labels separately so that we can use them to inspect the clusters that are created. The "labels" we will use in our data will be the genre labels pulled from Spotify. Since Spotify gives us many genre labels per song, we will attempt to condense this by only pulling the most "umbrella" genre terms (ex. Figure 27). Any datapoints that are found to be missing genre information are removed at this stage.

A preview of the final clustering data is given in Figure 28 and is also available here. The code for preparing the data for clustering can be found here.

Figure 28: Preview of the data prepared for clustering analysis.

Figure 29: Results from using Elbow, Silhouette, and Gap Statistic methods to determine optimal k (# clusters) in the data. K=3 seems to be commonly supported by Elbow and Silhouette. It is also the first local maximum of the Gap statistic, even though it automatically determined K=2 to be the better fit...

Figure 30: Side by side comparison of clustering results for k = 2, 3, and 4. The clusters appear to overlap to some extent in all three clustering solutions in the two dimensions plotted. K = 3 was the solution suggested by the measures in Figure 29.

Results

LINK TO CODE

Partitional clustering (k-means). Before applying the k-means clustering algorithm we will use the elbow, silhouette, and gap statistic methods to help determine the optimal number of clusters for our data (what to set for "k"). Figure 29 shows the results from all three methods, with k = 3 appearing to be the most promising. Figure 30 shows side by side clustering results using k values of 2, 3, and 4. The clusters appear to overlap to some extent in all three clustering solutions in the two dimensions plotted, which makes sense considering the number of different genres being examined and the continuous/fluid nature of the attributes being used. Figures 31-33 show each of the clustering solutions in more detail with a summary of each cluster described in the figure caption. Although each cluster contains a mixture of individual genres, each cluster appears to represent a unique combination of qualities from the attributes in the dataset.

Hierarchical clustering (hclust with cosine similarity). Hierarchical clustering reveals many more relationships and subclusters in the data. The 3-cluster solution is at a very high height relative to the tree. The dendrogram (Figure 34) seems to suggest that the data is clustered in at least 5-8 large groups. On the lowest level of the tree, we can see pairs of tracks that were grouped together by the data attributes. Interestingly, the genres assigned to these tracks are not always associated with each other. This suggests that perhaps the attributes we are using to describe the tracks are capturing similarities that are not evident just by looking at genre labels.

Figure 31: K=2 cluster solution. The clusters are fairly even, with 92 points in cluster 1 and 108 points in cluster 2. Cluster 1 is characterized by moderately acoustic, quieter, faster, negative valence, and not danceable tracks. Whereas cluster 2 is characterized by more electronic, louder, positive valence, and danceable tracks.

Figure 32: K=3 cluster solution. The clusters are less evenly split, with 31 points in cluster 1, 64 points in cluster 2, and 105 points in cluster 3. Clusters 1 and 2 are both characterized by negative valence, not danceable tracks, but cluster 1 is highly acoustic, quiet, and slower, whereas cluster 2 is more electronic, louder, and faster. Cluster 3 on the other hand is characterized by danceable and positive valence tracks that are moderately electronic.

Figure 33: K=4 cluster solution. The clusters are fairly even, with 54 points in cluster 1, 58 points in cluster 2, 32 points in cluster 3, and 56 points in cluster 4. Cluster 1 is characterized by electronic, vocal, louder, positive valence, and danceable tracks. Cluster 2 is characterized by electronic, instrumental, louder, fast, negative valence, and not danceable tracks. Cluster 3 is characterized by acoustic, instrumental, quiet, slow, negative valence, and not danceable tracks. And Cluster 4 is characterized by electronic, instrumental, quieter, danceable tracks.

Figure 34: Hierarchical clustering dendrogram. The dendrogram seems to suggest that the data is clustered in at least 5-8 large groups, unlike 3 clusters suggested by kmeans (partitional) clustering metrics. We can explore what an 8-cluster solution would look like in Figure 35.

Figure 35: Kmeans with 8 clusters (as informed by the dendrogram in Figure 34). In just two dimensions, we see a lot of overlap in the clusters (but the percentages in the axis labels suggest that our data is higher-dimensional). Each cluster appears to contain a variety of genre labels, yet qualitatively each cluster seems to have its own characteristics. For example, cluster 5 (light blue) seems to consist of "heavier" music (metal, hardcore, grunge...) as compared to cluster 1 (red; indie, ambient, dance, electro...). Differences between clusters are more easily visualized in Figure 36.

Figure 36: Comparison of the 8 clusters shown in Figure 35. Top- relative comparison of cluster centers/means on the six attributes used for all eight clusters. Bottom- comparison of most frequent terms present in each cluster (as a percentage of total number of terms in the cluster). Each cluster appears to capture distinct combinations of characteristics in the data. This clustering solution may be used in future parts of the project to create alternative labels for the data (as opposed to using the genre labels).

Conclusions

In this section, we clustered song data according to their ratings on six attributes we retrieved from the Spotify API: Acousticness, Instrumentalness, Loudness, Tempo, Valence, and Danceability. We were able to see how songs could be grouped together according to similarities/differences in these high-level descriptive attributes, and that these groupings went past grouping by the typical genre labels used to describe these songs. This suggests that music may be able to be affectively described by some set of more objective, high-level characteristics than the subjective genre labels frequently used, which supports the idea of the alternative models described in the introduction to this project.