DECISION TREES

Overview

Decision trees are a form of supervised learning. They learn from labeled data how the data space can be segmented using the data attributes. The "rules" which the model uses for this segmentation can be summarized in a tree structure (Figure 43). Decision trees can be used for regression (in the case of a quantitative output variable) or classification (in the case of a qualitative output variable). The data attributes may be continuous, ordinal, or nominal. Once the model has been trained on a subset of data, it can then be used to similarly segment unlabeled data.

The decision tree model must determine which attributes to use (and in which order) to optimally segment the data space, as well as the corresponding threshold attribute values for the split. For this reason, there are generally an infinite number of possible trees for the same dataset (Figure 44). Decision tree algorithms take a "greedy" approach to this problem, by making a series of choices which locally maximize the "information gain." Information gain is the quantitative measure of how much a split in the tree succeeds in unmixing the data. This can be calculated either through the difference in Entropy before the split versus the weighted average of Entropy after the split, or the difference in impurity (using GINI) before the split versus the weighted average of impurity after the split. Equations are given to the right, and an example calculation for "goodness of split" using these measures is illustrated in Figure 45.

Figure 43: On the left is an example of a decision tree used to classify preferred vs. not preferred songs using the attributes "release year" and "artist popularity." The highest conditional node in the tree is the "root node," terminal nodes are "leaf nodes," and intermediate conditional notes are "internal nodes." Data attributes are organized in order of their importance to the segmentation problem, with the root node attribute being the most important, and the internal nodes decreasing in importance the further they are from the root. An illustration of the segments of the data space created by this decision tree is on the right.

Figure 44: An example illustrating how it is possible to make an infinite number of trees for a given dataset, by switching out or rearranging attributes, and modulating splitting thresholds.

Figure 45: Example calculation of "goodness of split" using information gain and entropy. Entropy is calculated for the node on the left side (parent node) and each of the two children (right side). Entropy values are then used to calculate the information gained from the split.

Data Preparation

LINK TO SAMPLE OF THE DATA

LINK TO CODE

For a preview image of the raw data/starting point, please see FIGURE 16 on the Data Prep & Exploration tab of this website.

For this project, we will be using decisions trees to attempt to classify whether a song is "preferred" or "not preferred" by a user given a set of attributes from our acquired MSD and Spotify data.

Creating the data labels/classes. The outcome variable for this classification problem will be called "Preferred" and will have a value of 1 for preferred, and 0 for not preferred. We will determine whether or not a song is preferred using the playcount information in our dataset. For each user, their mean playcount (from all of their listened-to songs) is calculated (Figure 46). For a given song which a user listened to, if the song playcount is over their mean playcount, then that song is labeled as "preferred" (1). Alternatively, it is labeled as "not preferred" (0). See Figure 47.

Balancing the classes in the dataset. After labeling our data, we check the prevalence of each class in the dataset. In doing this, we find that there's about half as many "preferred" observations as there are "not preferred" observations. This imbalance should be corrected for, as the bias in our dataset will likely also bias our classifier. The data are sampled such that there are an equal number of observations from both classes.

Creating a "preference distance" attribute. Before attempting classification, we create one more attribute which may assist with the preference classification task. Please refer back to the Introduction tab for an overview of how this quantitative metric was initially developed and utilized in prior works. In this project, we will be calculating "preference distance" a little differently. In prior works this metric was calculated within the context of the MUSIC five-factor model space. Due to time constraints with this project and the time-consuming nature of accurately translating our data into the MUSIC space, we will do an analogous calculation using some of the Spotify acoustic attributes instead of the MUSIC factors. The following attributes will be used from our dataset:

trackAcoustic: track "acousticness" score.
trackDanceable: track "danceability" score.
trackEnergy: track energy score.
trackInstrum: track "instrumentalness" score.
trackSpeech: track "speechiness" score.
trackVal: track overall valence measure.

I intentionally selected the high-level attributes from our dataset, to more closely parallel the high-level descriptive nature of the MUSIC factors and did not include the loudness or tempo attributes. Some of the other dataset attributes not mentioned here will still be used for the classification task (see below).

For each user, their mean for each of the attributes listed above is calculated (from all their listened-to songs). This gives us a 6-dimensional centroid for each user which can be taken to quantitatively represent their average listening habits on each of these six musical attributes (Figure 46). From there, a "preference distance" is calculated for each of a user's songs by taking the cosine similarity of the song's corresponding attribute values and their "centroid." This metric represents how similar a given song is on these six musical attributes to what that user has typically reported listening to. See Figure 47.

Other attributes for training the model. So far the variables we have discussed in the context of our decision tree task are the outcome variable ("Preference") and the preference distance input variable. We will also use the following attributes directly from our original dataframe as inputs to the model:

artistPop: artist popularity score
trackPop: track popularity score
trackDurMS: track duration in msec.
trackKey: key the track is in.
trackMode: track mode (major/1 or minor/0)
albumYear: year of release

A preview of the final dataframe is shown in Figure 48.

Figure 46: A temporary dataframe created to store mean attribute values for each user in our dataset. The mean_playcount variable is calculated from the playcounts for all of a user's songs. Mean playcount will be used later to create our dataset labels. Similarly, the six right-most columns of the dataset represent the mean acoutic attribute value for that user's data. These mean/center values for each user represent their average listening habits (in terms of musical characteristics) and will be used later to calculate the preference distance (similarity) metric.

Figure 47: This dataframe was created by merging the raw dataframe/starting point with the dataframe from Figure 46. The outcome variable "Preferred" can be seen on the right side. The dataframe was sampled such that both classes equally make up the data. The additional preference distance attribute has also been added to the dataframe (right side).

Figure 48: Preview of the final dataframe.

Split data into train / test subsets. Because decision trees learn from labeled data, we need to train the model on a subset of labeled data and then test its performance on a separate, previously unseen subset of the data. We will use the common split proportion of 80% of our data for training, and 20% for testing. The rows of the dataframe (the one in Figure 48) are shuffled/randomized. Then, the first 80% of the rows are saved as the training set, and the last 20% are saved as the testing set. This gives us two disjoint sets / data files. Sets that are not disjoint will misleadingly inflate the performance of the model. The full dataset, as well as the split train/test files can be found here. Figure 49 below shows a preview of the train and test sets.

Figure 49: Training set (top) and testing set (bottom) previews (first five rows of each).

Tree 2. Next, we incorporate the year of release as an additional attribute for the decision tree to see if this aids in the classification of user preferred songs. The result is shown in Figure 51. Popularity metrics remain as the most significant splitting attributes. Year of release does not seem to make many significant structural changes to the tree, and the classification accuracy does not improve. This suggests that there is no overall correlation between year of release and user preference in our data. However, it is still possible that year of release plays a significant role when looked at within each user's data (ex. perhaps user 1 generally prefers music from the 80s and user 2 prefers music from the 2010s). This possibility will be further explored in future work.

Tree 3. Lastly, we incorporate the preference distance attribute into our model. The result is shown in Figure 52. As was expected, preference distance is now the most important splitting attribute for our dataset. This makes sense, since it is the only user-specific attribute we included. Below preference distance in this decision tree are the popularity metrics which we observed earlier. Classification accuracy for this tree is slightly improved, but still poor overall. However, this result suggests that perhaps addition of more user-specific attributes will aid in the classification of user music preference. We will continue to explore this in future works.

Results

All of the following decision trees are created using the ctree function in R. The control parameters used to limit the growth of the tree are:

mincriterion = 0.999
minsplit = 10,000

Code is available here.

Higher resolution images of the trees can be downloaded here.

Tree 1. First, we try decision tree classification using only the artist and track popularity scores and the low-level track features (duration, key, and mode). The result is shown in Figure 50. As was expected, the artist and track popularity metrics dominate the top part of the tree, indicating that these attributes are the more important factors when classifying preference. This makes sense, since "popularity" signifies a general preference for a large population. Overall classification accuracy for this tree is poor, which was also expected since we are using non-user-specific attributes for the classification of individual preference.

Figure 50: Classification decision tree for Preference ~ artistPop + trackPop + trackDur + trackKey + trackMode. Confusion matrix and overall accuracy are shown to the right.

Figure 51: Classification decision tree for Preference ~ albumYear + artistPop + trackPop + trackDur + trackKey + trackMode. Confusion matrix and overall accuracy are shown to the left.

Figure 52: Classification decision tree for Preference ~ pref_dist + albumYear + artistPop + trackPop + trackDur + trackKey + trackMode. Confusion matrix and overall accuracy are shown to the right.

Conclusions

In this section, we used decision trees to attempt to classify user preferred and not-preferred songs. We experimented using a variety of features for this task, including low level track characteristics (like duration, mode, and key), popularity measures, year of release, and a calculated individualized "preference distance" metric, which captured how similar a given song was to what that user typically listened to. While the popularity measures were important model attributes, the model which incorporated the individualized preference distance had the highest performance, and the preference distance metric was deemed to be the most important attribute for the classification model. This is promising and suggests that it is possible to quantitatively estimate listener preference for a song using such metrics. Our results also suggest the usage of more individualized metrics over "raw" metrics for calculating user preference. For example, future work may adapt the year of release variable to account for the general time period of music that a user tends to listen to.