Introduction

Association rule mining is a data mining technique that identifies patterns or relationships between variables in a dataset. It involves finding frequent itemsets or combinations of items that occur together in a dataset. Association rule mining is widely used in various fields such as marketing, finance, healthcare, and e-commerce.

The insights provided by association rule mining include identifying the most frequently occurring itemsets, correlations between variables, and the association between products, customers, and buying patterns. Association rule mining helps in making informed decisions and developing targeted marketing campaigns.

Fig 1. Image from DataCamp showing a transactional dataset, a frequently occurring itemset of {Diapers and Beer} and a rule {Diapers} -> {Beer}

Association rule mining is widely used in market basket analysis as can be seen in the oversimplified example in the above figure, cross-selling and upselling, and customer segmentation. In market basket analysis, it is used to analyze the relationship between different products and customers’ buying patterns. In cross-selling and upselling, it is used to suggest complementary or higher-priced products to customers based on their buying patterns. In customer segmentation, it is used to group customers based on their preferences and buying patterns.

Association rule mining is widely used in market basket analysis as can be seen in the oversimplified example in the above figure, cross-selling and upselling, and customer segmentation. In market basket analysis, it is used to analyze the relationship between different products and customers’ buying patterns. In cross-selling and upselling, it is used to suggest complementary or higher-priced products to customers based on their buying patterns. In customer segmentation, it is used to group customers based on their preferences and buying patterns.

Data Format and metrics in ARM

The data format required for association rule mining is transactional data, which contains a list of items bought by a customer in a single transaction. The data can be represented in different formats such as binary format, transactional format, and market basket format. The binary format represents the presence or absence of an item in a transaction, the transactional format represents a list of items bought by a customer in a transaction, and the market basket format represents a list of items bought by multiple customers.

The metrics used to judge the strength of a rule are briefly shown in the figure below, with explanations followed by the image.

Fig 2. How to measure support, confidence and lift. Source: ResearchGate

The support of a rule in association rule mining refers to the frequency at which an itemset appears in a dataset. It is calculated by dividing the number of transactions that contain the itemset by the total number of transactions in the dataset. For example, if a dataset contains 1,000 transactions, and the itemset {A, B} appears in 100 transactions, then the support of the rule {A, B} is 0.1 or 10%.

The confidence of a rule in association rule mining refers to the conditional probability of the consequent given the antecedent. It is calculated by dividing the number of transactions that contain both the antecedent and the consequent by the number of transactions that contain the antecedent. For example, if a dataset contains 1,000 transactions, and the antecedent {A} appears in 200 transactions, and the itemset {A, B} appears in 100 transactions, then the confidence of the rule {A => B} is 0.5 or 50%

Lift is a measure of the strength of association between two items in a dataset. It is a ratio of the observed support of both items occurring together to the expected support if the two items were independent of each other. A lift value of greater than 1 indicates a positive association between the two items, a value of 1 indicates no association, and a value less than 1 indicates a negative association.

The formula for lift is as follows:

lift(X, Y) = support(X ∪ Y) / (support(X) × support(Y))

where X and Y are two items, and ∪ represents the union of the two items.

To calculate the lift for a rule, we first need to calculate the support and confidence for the rule. Let’s say we have a dataset of transactions that contains the following items: bread, butter, milk, and eggs. We want to calculate the lift for the rule “bread and butter” implies “milk”.

To calculate the support for the rule, we need to find the number of transactions that contain both “bread” and “butter”, as well as “milk”. Let’s say that 30 out of 100 transactions contain both “bread” and “butter”, and out of those 30 transactions, 20 also contain “milk”. The support for the rule is then:

support(bread ∩ butter ∩ milk) = 20/100 = 0.2

Next, we need to calculate the support for “bread” and “butter” separately:

support(bread) = 70/100 = 0.7 support(butter) = 60/100 = 0.6

The confidence of the rule is then:

confidence(bread ∪ butter -> milk) = support(bread ∩ butter ∩ milk) / support(bread ∩ butter) = 20/30 = 0.67

Finally, we can calculate the lift of the rule:

lift(bread ∪ butter -> milk) = support(bread ∩ butter ∩ milk) / (support(bread) × support(butter)) = 0.2 / (0.7 × 0.6) = 0.48

Since the lift value is less than 1, we can conclude that there is a negative association between “bread and butter” and “milk”. In other words, customers who buy bread and butter are less likely to buy milk.

In conclusion, association rule mining is a valuable data mining technique used in various fields to discover relationships and patterns in large datasets. It requires transactional data and can provide insights into customer behavior, product relationships, and market trends. Support, confidence, and lift are important metrics used to measure the strength and significance of the association between different items or variables in a dataset.

Apriori Algorithm

Apriori algorithm is used to trim supersets of rules when they don’t meet a threshold specified in parameters of the apriori function in R. A quick example to understand apriori algorithm is below:

Let’s assume there is a rule:

{beer} ==> {soap} with a support of 0.50.

This means that beer and soap are together in 50% of transactions. Now let’s look at another rule:

{beer} ==> {soap, tomatoes} (elements in this rule are a superset of the elements in the previous rule)

The rule above will have a support of equal to or lower than 0.50. It is impossible for a superset rules to have a support higher than the subset rule.

Apriori trims the results to remove any superset rules that don’t meet the desired support criteria.

Application in Electric vehicle sales prediction project

ARM will be used for discovery in this project. ARM will be used to find the strongest relationships between the element ‘electric vehicle’ (antecedent) and its consequent. The data that is used for this technique is data extracted from the newsapi.org API. The articles related to electric vehicle were extracted. This information will be used to brainstorm some more features that can help determine what impacts the sale of Electric vehicles. ARM will also be used to find whether there are association of electric vehicles with state names in the said articles.

Another additional application of ARM in this project will be to use research papers about Electric vehicles and find the characteristics that most relate to electric vehicles.

Since this is a discovery method, the results of Association Rule mining will help determine strong relationships and advance the project in the right direction.

1. Data Prep for Association Rule Mining in R:

The data that is used for ARM in this project is sourced from newsapi.org using their API.

The data was in json format and was converted into a list of lists. The details of the data collection and cleaning process can be found on the Data Collection and Cleaning tab. Although, in the final data found on the tab linked, the stopwords were not removed from the document. To ensure appropriate and relevant rules, the stopwords need to be removed.

The data used for this analysis can be found here.

2. Process and Analysis:

:

Fig 3. Item frequency plot of the descriptions of 100 articles related to Electric vehicles

Getting an item frequency plot can give insight about the contents of the data. On the left is the item frequency plot of the descriptions of articles about electric vehicles

The words that are most frequent in the document are electric, vehicle, companies, tesla, car, vehicles, ev, company, tuesday ?, and cars. Although these are all expected words, this plot does not provide the information to be discovered. The purpose is to find association and more features that could be used for the analysis.

Fig 4. Top 15 rules sorted by highest support
Fig 5. Top 15 rules sorted by highest confidence
Fig 6. Top 15 rules sorted by highest lift

The above images do not provide a solid conclusive result. The highest support in the rules obtained from the dataset is 0.16. This means that electric and vehicle show up together 16 times in the 100 articles. The low values of support and inconclusive results could be because of low number of articles collected, low quality of articles from newapi.org.

3. Code for above analysis:

The code for above analysis can be found here.

Results and conclusion

From the above three images, it can be concluded that the rules are vague and inconclusive. The expectation of applying this model on the news articles was to find factor that affect electric vehicle sales. From the data it can be found that elon and musk are highly associated, electric and vehicle are highly associated, Hyundai and rena have a high association and have been prominent in the data, Japanese and rena along with japanese and hyundai have a high association. There were also rules pertaining to Joe Biden, which does bring light to the governmental involvement in accelerating adoption of electric vehicles. There is one rule that associates Warren with Buffet, this could be a result of some stock price implications due to electric vehicles.

Overall, this approach to using ARM for the project was not successful. Perhaps, the data needs to be discretized and then the relationship between variables could be looked at for rules. Another approach could be using research articles about electric vehicle and their growth instead of news articles about electric vehicles.