Association rule learning

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.

In addition to the above example from market basket analysis, association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics.

[3] Following the original definition by Agrawal, Imieliński, Swami[2] the problem of association rule mining is defined as: Let

Every rule is composed by two different sets of items, also known as itemsets, X and Y, where X is called antecedent or left-hand-side (LHS) and Y consequent or right-hand-side (RHS).

Association rules are made by searching data for frequent if-then patterns and by using a certain criterion under Support and Confidence to define what the most important relationships are.

Support is the evidence of how frequent an item appears in the data given, as Confidence is defined by how many times the if-then statements are found true.

With the use of the Association rules, doctors can determine the conditional probability of an illness by comparing symptom relationships from past cases.

Other examples of where support can be used is in finding groups of genetic mutations that work collectively to cause a disease, investigating the number of subscribers that respond to upgrade offers, and discovering which products in a drug store are never bought together.

For larger datasets, a minimum threshold, or a percentage cutoff, for the confidence can be useful for determining item relationships.

Generating thresholds allow for the association between items to become stronger as the data is further researched by emphasizing those that co-occur the most.

Overall, using confidence in association rule mining is great way to bring awareness to data relations.

This is why it is important to look at other viewpoints, such as Support × Confidence, instead of solely relying on one concept incessantly to define the relationships.

If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

The concept of association rules was popularized particularly due to the 1993 article of Agrawal et al.,[2] which has acquired more than 23,790 citations according to Google Scholar, as of April 2021, and is thus one of the most cited papers in the Data Mining field.

If we apply a statistical test for independence with a significance level of 0.05 it means there is only a 5% chance of accepting a rule if there is no association.

Apriori is given by R. Agrawal and R. Srikant in 1994 for frequent item set mining and association rule learning.

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently.

Example: Assume that each row is a cancer sample with a certain combination of mutations labeled by a character in the alphabet.

This is due to the FP-growth algorithm not having candidate generation or test, using a compact data structure, and only having one database scan.

ECLAT, stands for Equivalence Class Transformation) is a backtracking algorithm, which traverses the frequent itemset lattice graph in a depth-first search (DFS) fashion.

a DFS may check the nodes in the frequent itemset lattice in the following order: {a} → {a, b} → {a, b, c}, at which point it is known that {b}, {c}, {a, c}, {b, c} all satisfy the support constraint by the downward-closure property.

Items in each transaction have to be sorted by descending order of their frequency in the dataset before being inserted so that the tree can be processed quickly.

If many transactions share most frequent items, the FP-tree provides high compression close to tree root.

Once the recursive process has completed, all frequent item sets will have been found, and association rule creation begins.

[31] The ASSOC procedure[32] is a GUHA method which mines for generalized association rules using fast bitstrings operations.

The association rules mined by this method are more general than those output by apriori, for example "items" can be connected both with conjunction and disjunctions and the relation between antecedent and consequent of the rule is not restricted to setting minimum support and confidence as in apriori: an arbitrary combination of supported interest measures can be used.

OPUS is an efficient algorithm for rule discovery that, in contrast to most alternatives, does not require either monotone or anti-monotone constraints such as minimum support.

This anecdote became popular as an example of how unexpected association rules might be found from everyday data.

[36] Daniel Powers says:[36] In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores.

[41] Generalized Association Rules hierarchical taxonomy (concept hierarchy) Quantitative Association Rules categorical and quantitative data Interval Data Association Rules e.g. partition the age into 5-year-increment ranged Sequential pattern mining discovers subsequences that are common to more than minsup (minimum support threshold) sequences in a sequence database, where minsup is set by the user.

A Venn Diagram to show the associations between itemsets X and Y of a dataset. All transactions that contain item X are located in the white, left portion of the circle, while those containing Y are colored red and on the right. Any transaction containing both X and Y are located in the middle and are colored pink. Multiple concepts can be used to depict information from this graph. For example, if one takes all of the transactions in the pink section and divided them by the total amount of transactions (transactions containing X (white) + transactions containing Y(red)), the output would be known as the support. An instance of getting the result of a method known as the confidence, one can take all of the transactions in the middle (pink) and divide them by all transactions that contain Y (red and pink). In this case, Y is the antecedent and X is the consequent.
Frequent itemset lattice, where the color of the box indicates how many transactions contain the combination of items. Note that lower levels of the lattice can contain at most the minimum number of their parents' items; e.g. {ac} can have only at most items. This is called the downward-closure property . [ 2 ]
The control flow diagram for the Apriori algorithm