CSCI 5622 - MACHINE LEARNING (SPRING 2023)
UNIVERSITY OF COLORADO BOULDER
​
BY: SOPHIA KALTSOUNI MEHDIZADEH
NAIVE BAYES
Overview
Naive Bayes is a form of supervised learning. The algorithm learns from labeled data how to classify observations using Bayes' rule and the conditional probabilities of attributes/values in the dataset. The algorithm is considered "naive" because it makes the strong assumption that the attributes in the dataset are independent (when in reality this may not be completely true). Naive Bayes is mostly suited for data which is categorical (nominal or ordinal), however it may be used with numeric data as well with some adjustments (ex. discretization). In the case of numeric data, the algorithm also assumes that these attributes are normally distributed. Once the model has been trained on a subset of data, it can then be used to similarly classify unlabeled data.
​
Given input data vector x, the Naive Bayes algorithm will predict the class (outcome) for this observation which maximizes the posterior probability P(c | x) -- i.e. the probability of class c given vector x. Bayes' rule states that this is equivalent to P(x | c) * P(c) / P(x), where P(x | c) denotes the likelihood of vector x given class c. If x consists of independent attribute values A1, A2, ..., An, then this likelihood can be calculated by expanding to the form P(A1 | c) * P(A2 | c) * ... * P(An | c). The implication of this is that if one of the conditional probabilities in the expansion is zero (in other words that class and feature combination is not seen by the algorithm during the training process), then the entire expression will become zero (which then negates the information from that observation). To prevent this, a smoothing step is important and often required for Naive Bayes models.
​
Multinomial Naive Bayes is used to specifically refer to an implementation of the Naive Bayes algorithm for multinomially distributed data-- i.e., datasets in which there are two or more outcomes (classes). This form of the Naive Bayes algorithm uses the frequency of attribute values in the dataset for the prediction. Alternatively, Bernoulli Naive Bayes refers to an implementation of the Naive Bayes algorithm for the Bernoulli distribution-- i.e., a discrete probability distribution with Boolean/binary outcomes (classes). This form of Naive Bayes uses Boolean/binary values indicating the existence of an attribute value in the dataset for the prediction.
Data Preparation
​
For a preview image of the raw data/starting point, please see FIGURE 16 on the Data Prep & Exploration tab of this website.
​
For this project, we will be using multinomial Naive Bayes to attempt to classify whether a song is "preferred" or "not preferred" by a user given a set of attributes from our acquired MSD and Spotify data.
​
Creating the data labels/classes. The outcome variable for this classification problem will be called "Preferred" and will have a value of 1 for preferred, and 0 for not preferred. We will determine whether or not a song is preferred using the playcount information in our dataset. For each user, their mean playcount (from all of their listened-to songs) is calculated (Figure 53). For a given song which a user listened to, if the song playcount is over their mean playcount, then that song is labeled as "preferred" (1). If not, it is labeled as "not preferred" (0). See Figure 54.
​
Balancing the classes in the dataset. After labeling our data, we check the prevalence of each class in the dataset. In doing this, we find that there's about half as many "preferred" observations as there are "not preferred" observations. This imbalance should be corrected for, as the bias in our dataset will likely also bias our classifier. The data are sampled such that there are an equal number of observations from both classes.
​
Creating a "preference distance" attribute. Before attempting classification, we create one more attribute which may assist with the preference classification task. Please refer back to the Introduction tab for an overview of how this quantitative metric was initially developed and utilized in prior works. In this project, we will be calculating "preference distance" a little differently. In prior works this metric was calculated within the context of the MUSIC five-factor model space. Due to time constraints with this project and the time-consuming nature of accurately translating our data into the MUSIC space, we will do an analogous calculation using some of the Spotify acoustic attributes instead of the MUSIC factors. The following attributes will be used from our dataset:
-
trackAcoustic: track "acousticness" score.
-
trackDanceable: track "danceability" score.
-
trackEnergy: track energy score.
-
trackInstrum: track "instrumentalness" score.
-
trackSpeech: track "speechiness" score.
-
trackVal: track overall valence measure.
​
I intentionally selected the high-level attributes from our dataset, to more closely parallel the high-level descriptive nature of the MUSIC factors and did not include the loudness or tempo attributes. Some of the other dataset attributes not mentioned here will still be used for the classification task (see below).
​
For each user, their mean for each of the attributes listed above is calculated (from all their listened-to songs). This gives us a 6-dimensional centroid for each user which can be taken to quantitatively represent their average listening habits on each of these six musical attributes (Figure 53). From there, a "preference distance" is calculated for each of a user's songs by taking the cosine similarity of the song's corresponding attribute values and their "centroid." This metric represents how similar a given song is on these six musical attributes to what that user has typically reported listening to. See Figure 54.
​
Other attributes for training the model. So far, the variables we have discussed in the context of our classification task are the outcome variable ("Preference") and the preference distance input variable. We will also use the following attributes from our original dataframe as inputs to the model:
-
artistPop: artist popularity score
-
trackPop: track popularity score
-
trackKey: key the track is in.
-
trackMode: track mode (major/1 or minor/0)
-
albumYear: year of release
​
Learning from the decision tree results on the previous page, for this classification task we will adapt the year of release attribute to be normalized within each participant. This is done by taking the absolute difference between a song's year of release and that user's mean year of release value (from all of their listening data; Figure 53). This creates an attribute which represents the chronological similarity of a song to what the user usually listens to. Outliers for this attribute are removed from the data. See Figure 54.
Figure 53: A temporary dataframe created to store mean attribute values for each user in our dataset. The mean_playcount variable is calculated from the playcounts for all of a user's songs. Mean playcount will be used later to create our dataset labels. Similarly, the mean_albumYear variable is calculated from the album release data for all a user's songs. The other six columns of the dataset represent the mean acoutic attribute value for that user's data. These mean/center values for each user represent their average listening habits (in terms of musical characteristics) and will be used later to calculate the preference distance (similarity) metric.
Figure 54: This dataframe was created by merging the raw dataframe/starting point with the dataframe from Figure 53. The outcome variable "Preferred" can be seen on the right side. The dataframe was sampled such that both classes equally make up the data. The additional preference distance and year similarity attributes have also been added to the dataframe (right side).
Numerical attribute discretization. Since Naive Bayes performs best on categorical data (and we have a lot of numerical attributes in our dataset), we will discretize our data. Popularity metrics (ranging from 0-100) are discretized into bins of 10. Year similarity (ranging from 0-20 after outlier removal) is discretized into bins of 5 years. Preference distance is split into three levels/bins using the decision tree results and the splitting thresholds returned by the model-- "1": 0.998-1 (most preferred/similar), "2": 0.842-0.998 (intermediate preferred/similar), "3": 0-0.842 (least preferred/similar).
​
A preview of the final dataframe is shown in Figure 55.
​
Split data into train / test subsets. Because Naive Bayes classifiers learn from labeled data, we need to train the model on a subset of labeled data and then test its performance on a separate, previously unseen subset of the data. We will use the common split proportion of 80% of our data for training, and 20% for testing. The rows of the dataframe (the one in Figure 55) are shuffled/randomized. Then, the first 80% of the rows are saved as the training set, and the last 20% are saved as the testing set. This gives us two disjoint sets / data files. Sets that are not disjoint will misleadingly inflate the performance of the model. The full dataset, as well as the split train/test files can be found here. Figure 56 below shows a preview of the train and test sets.
Figure 55: Preview of final dataframe.
Figure 56: Training set (top) and testing set (bottom) previews (first five rows of each).
Figure 57: Visualizations of the conditional probabilities (in training data) for three of the model attributes and the two classes (preferred-1 vs. not preferred-0). The preference distance attribute (top) appears to have the most distinction between the two classes. As a reminder, we discretized preference distance such that a 1 represents the "nearest distance" / most similarity, and a 3 the "farthest distance" / least similarity. Track popularity (middle) seems to have some minute differences between the two classes. As a reminder, 0 is least popular and 9 is most popular. Despite using a "year similarity" metric (bottom) as opposed to the raw year of release information, this does not appear to have contributed towards the classification.
Results
The Naive Bayes model was run in R using the naivebayes library.
-
formula = Preferred ~.
-
laplace smoothing = 1
​
​
The a priori probabilities from the training dataset are:
-
Preferred [1] = 0.49989
-
Not preferred [0] = 0.50011
​
The conditional probabilities for three of the (training) dataset attributes and the two outcomes/classes are shown in Figure 57. Preference distance appears to have the most distinct differences between the two classes, with a higher probability of "category 1" observations (most preferred/similar) and a lower probability of "category 3" observations (least preferred/similar) in the preferred class. This adds to the validation of the preference distance metric so far, although it may be improved in future work to separate the two classes further. Some small differences between the two classes can also be observed for the track popularity attribute (ex. smaller probability of category 1 and 2 / low popularity observations and higher probability of category 5+ mid/high popularity observations in the preferred class. Despite transforming the year of release attribute to account for any individual differences and chronological trends in what users listen to, this does not seem to have contributed towards the classification of user preferred vs. not preferred songs. The other attributes from the data which are not pictured showed no notable differences between the two classes.
​
Model performance on the testing data (confusion matrix + accuracy score) is shown below. The model performs only marginally above chance for two classes (50%).
Conclusions
In this section, we used multinomial Naive Bayes to attempt to classify user preferred and not-preferred songs. We used a variety of features for this task, including low level track characteristics (mode and key), popularity measures, chronological similarity, which captured how close in time a given song's release was to the era of music a user typically listened to, and an individualized "preference distance" metric, which captured how similar in style a given song was to what that user typically listened to. The preference distance metric continues to be the most significant contributor towards the classification task. Despite using chronological similarity instead of a raw year of release value, this did not improve classification compared to the results from the decision tree model (previous page). This suggests that perhaps there is no relationship in our data between when a song is released and a user's preference. Future work will focus on improving and iterating on the preference distance measure, as the stylistic similarity appears to be the most correlated with user preference.