ASSOCIATION RULE MINING (ARM)

Overview

ARM is a type of unsupervised learning. It uses the Apriori algorithm to find inherent or hidden relationships/patterns ("rules") within data which is unlabeled. This analysis is done on transaction-type data, in which each observation ("transaction") in the dataset consists of a list of "items" associated with it (Figure 37). Generally, the goal of ARM is to uncover patterns of items/ sequences/ structures that frequently occur together or are strongly correlated. The "rules" (or patterns) are articulated in the form X --> Y ("X implies Y") where X represents the item(s) that are the premise or start of the pattern, and Y represents the item(s) that are the conclusion of the pattern (correlated with X).

Rules found within a dataset can be evaluated using three quantitative measures: support, confidence, and lift. The (relative) support for the rule X-->Y is the fraction of transactions in the dataset which contain both X and Y. The confidence for the rule X-->Y is the conditional probability that a transaction containing X also contains Y. The lift for the rule X-->Y is the probability of X and Y divided by the probability of X times the probability of Y. If the lift of the rule is 1, then X and Y are independent. If the lift is less than 1, then X and Y are negatively correlated (substitutes). If the lift is greater than 1, then X and Y are positively correlated (complementary). These three measures can be used to set thresholds when evaluating the rules to find the most significant, or frequent, rules for the dataset.

For this project, we will see if we can discover rules in our user listening data from the Million Song Dataset. Our goal will be to see if the Apriori algorithm can reveal some underlying structure, or networks in our data based on the different genres that users listened to (Figure 38). For instance, does listening to certain genres imply listening to certain others?

Figure 37: Illustration of what user listening data may look like in "transaction" format.

Figure 38: Illustration of a simple network of rules/patterns that ARM might reveal in our music listening data.

Data Preparation

LINK TO CODE

LINK TO SAMPLE OF THE DATA

Getting genres as "items." First we need to transform the data into a transaction format as illustrated in Figure 37, where the user ID is analogous to the "transaction ID" and the "items" are the types of music that the user listened to. We extract the userID and artistGenres columns from our dataset. As we've seen from working with the data so far, multiple words are listed in the artistGenres column (returned from the Spotify API call). To make this analysis more feasible, we need to reduce this information to a single term per song per user. This is done by pulling umbrella genre terms only (see methods of Clustering section), and specifically selecting only the most frequently used term per observation. The result is shown in Figure 39.

Collecting user datapoints into a single "transaction." In Figure 39 we can see that each row represents one user song (i.e. user IDs are repeated according to how many songs of theirs are in the data). We will transform the data such that each row represents one user, with all of their song information listed together. Any duplicate information will be removed (for example if a user listened to 3 "rock" songs, "rock" will only be listed once). The result is shown in Figure 40.

Saving/exporting in transaction format. Finally we will prepare the data to be read properly in R. As each row is now a unique user/"transaction" the lengthy userID is no longer necessary and is dropped from the dataframe. The "items" column is reformatted to be a string (not a list) of comma-separated terms, and the result is saved as a .txt file (without the index). A preview can be seen in Figure 41, and the data is available here.

Figure 39: Section of dataset showing "items" created from the "artistGenres" attribute using word/term frequency and filtering for "umbrella" terms.

Figure 40: Section of dataset showing "items" consolidated by user ID. Each row is a unique user with the "items" column containing a list of their unique "items" (song types).

Figure 41: Preview of data prepared for ARM (transaction/basket format saved as .txt)

Figure 42: Relative item frequency within our transaction data. Most frequent items (genres) are shown towards the bottom. The red vertical line represents the minimum support threshold we will apply, which is equal to ~1/# unique items in the data.

Confidence top 15: The resulting top 15 rules for confidence are shown below. We can see the RHS consists only of the "rock" item, which makes sense as it is one of the most broad and all-encompassing genres and consequently the most frequently occurring item in the dataset. The network for these rules is also shown below, where we can easily see "rock" at the center of the network.

Lift top 15: The resulting top 15 rules for lift are shown below. All the lift values are greater than one, meaning that the LHS and RHS of these rules are highly complementary. While most of these rules are quite intuitive (ex. it makes sense that someone who listens to dance, pop, rock, and trance also likes house music), others seem a little more surprising (ex. rule #15). One possible explanation could be that this represents individuals with very widespread music tastes/listening habits (people who listen to "a little bit of everything").

Results

LINK TO CODE

Selecting support and confidence thresholds. First we need to determine the appropriate minsup and minconf parameters to provide to the Apriori algorithm. We will start by setting the confidence threshold to 50%. To inform the support threshold, I first plotted the relative item frequencies in our transaction data (Figure 42). This can give us a sense of how many rules will be produced and for which items in our dataset according to the support value we select. For example, setting the minsup to 33% produces only 4 rules, using only the pop and rock items. This is too high of a threshold. We have 72 different types of items in this dataset; if each item were equally likely in the data, that would be a support of 1/72 = ~0.014. Therefore, we will start by setting the support threshold to 1% (where the vertical red line is drawn). The resulting Apriori algorithm function call is as follows:

Support top 15: The resulting top 15 rules for support are shown below. We can see that of course these rules consist of the most common items/genres in our dataset. The RHS consists only of the top two items (rock & pop), which makes sense given how much more frequently these occur in the data compared to the other items. The top two rules have a LHS which is empty, which suggests that these items/genres "stand on their own" (a significant # of transactions only include this item). The network for these rules is also shown below.

Lift bottom 15: Out of curiosity, the bottom 15 rules for lift are also shown here (below). All of these lift values are close to 1.0, meaning that these rules describe relationships in which the LHS and RHS are almost independent of each other. Interestingly, there is no rule found for which the lift value is less than one (which would imply a substitution effect).

Conclusions

In this section, we used ARM and the Apriori algorithm to explore user's music listening behaviors. We analyzed the different styles of music (genres) that users listened to, and were able to derive some frequent patterns of style preferences/music tastes that occurred across users in our data. One limitation of this initial analysis is the reliance on genre labels. By nature, certain labels are more broad than others, resulting in their frequent and inconsistent use throughout the dataset (ex. two instances of "pop" music could be wildly different from one another). The result was patterns that were heavily skewed towards these broader labels. In a future iteration of this analysis, songs could be instead labeled using the results from the clustering portion of this project. Using the cluster labels instead of genre labels would provide a more precise, objective label for each song based on its acoustic attributes. The observation of this genre-labeling limitation in the current analysis provides further support for alternative, genre-free forms of musical preference modelling.