Apriori algorithm

Apriori[1] is an algorithm for frequent item set mining and association rule learning over relational databases.

The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation or IP addresses[2]).

Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing).

Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data.

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently.

After that, it scans the transaction database to determine frequent item sets among the candidates.

At each step, the algorithm is assumed to generate the candidate sets from the large item sets of the preceding level, heeding the downward closure lemma.

accesses a field of the data structure that represents candidate set

Many details are omitted below, usually the most important part of the implementation is the data structure used for storing the candidate sets, and counting their frequencies.

Assume that a large supermarket tracks sales data by stock-keeping unit (SKU) for each item: each item, such as "butter" or "bread", is identified by a numerical SKU.

To do this, we will say that an item set is frequent if it appears in at least 3 transactions of the database: the value 3 is the support threshold.

The first step of Apriori is to count up the number of occurrences, called the support, of each member item separately.

By scanning the database for the first time, we obtain the following result All the itemsets of size 1 have a support of at least 3, so they are all frequent.

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawned other algorithms.

Candidate generation generates large numbers of subsets (The algorithm attempts to load up the candidate set, with as many as possible subsets before each scan of the database).

is the horizontal width (the total number of items) present in the database.

Later algorithms such as Max-Miner[3] try to identify the maximal frequent item sets without enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-up approach.