A C++ Frequent Itemset Mining Template Library

Frequent Itemset Mining (FIM) is the most researched field of frequent pattern mining. Over one hundred FIM algorithms were proposed - the majority claiming to be the most efficient. To clarify this chaos and the contradictions, two FIMI competitions were organized. It not only resulted in some excellent implementations but the community also get better understanding (limits, drawbacks, advantages) of the algorithms. Although FIMI'03 and FIMI'04 raised higher the level of publication, it has not solved many problem and we still suffer from the lack of understanding. For more information about the curse of FIM click here.

The separate work of different research groups, however, cannot be so efficient than tightly cooperating groups that interchange their knowledge. The goal of this FIM environment is to support this tight cooperation by providing an environment that includes base library functions, like IO, coding, decoding, ... and some high level description of the most important algorithms, such as apriori, eclat, FP-growth together with some very efficient implementations of them. The environment also contains many test classes to test the solutions for specific problems.

Flexibility and plugability are achieved by a template-based solution, that does not prevent the compiler form code optimization (like virtual function-based solution), which results in codes that are competitive with the best solutions.

If you use this library in a research paper then a citation to either of the following papers is welcome.

Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme, Benchmarking Frequent Itemset Mining Algorithms: from Measurement to Analysis, ACM SIGKDD Workshop on Open Source Data Mining Workshop (OSDM'05), in Bart Goethals and Siegfried Nijssen and Mohammed J. Zaki editors, pages 36 - 45, Chicago, IL, USA. 2005.
[PDF] [PS] [Slides] [BibTex]
Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006,
[PDF] [PS],[BibTex]

Programs included

The library includes an some optimized input/output and coding/decoding classes, allocators, many well designed data structures (trie, Patricia-tree, ) database cachers, some very efficient Apriori, Eclat and FP-growth algorithms, an Apriori algorithm that finds frequent sequences of items and an association rule miner that uses an Apriori to find frequent itemsets. Also, there exist many test classes that aims to test the efficiency of differrent techniques, like transaction caching, dead-end pruning, equisupport pruning, candidate generation technique, etc.

Major programs included in the Library

Name Description arguments

compile command

main source file

fim This program includes the most important FIM algorithms, i.e. Apriori, Eclat, FP-growth There are 4 mandatory parameters:

the type of algorithm to run, i.e. apriori, apriori-lowmem, fp-growth

the file that contain the transactions in market-basket format,

absolute support threshold.

the output file the frequent itemsets are sent to,

There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine.
Example:
fim apriori ../data/kosarak.dat 1500 output.txt 8

make ../fim^*

main.cpp

assoc_rule This is an association rule miner, that uses Apriori to find the frequent itemsets There are 5 mandatory parameters:

The file that contain the transactions in market-basket format,

absolute support threshold.

confidence threshold

lift threshold

the output file the frequent itemsets are sent to,

Example:
assoc_rule ../data/kosarak.dat 1500 0.2 3 output.txt

make assoc_rule^*

main-assoc-rule.cpp

apriori_simple A simplified Apriori implementation that omits many speed-up techniques. We intend this program for students or researchers, who would like to have an efficient but still easy-to-understand and easy-to-modify Apriori implementation. If you don't find the speed of this Apriori satisfactory, you can check the implementation in main.cpp. There are 3 mandatory parameters:

the file that contain the transactions in market-basket format,

absolute support threshold.

the output file the frequent itemsets are sent to,

There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine.
Example:
./apriori-simple ../data/kosarak.dat 1500 output.txt 8

make apriori_simple^*

main-apriori-simple.cpp

fsm An Apriori-based frequent item sequence miner that uses a trie to store the candidates. More information is found in the paper:

Ferenc Bodon, Trie-based APRIORI Implementation for Mining Frequent Itemsequences, ACM SIGKDD Workshop on Open Source Data Mining Workshop (OSDM'05), in Bart Goethals and Siegfried Nijssen and Mohammed J. Zaki editors, pages 56 - 65, Chicago, IL, USA. 2005.
[PDF] [PS][Slides][BibTex] There are 4 mandatory parameters:

the type of algorithm to run, i.e. apriori, apriori-noprune, apriori-intersectprune

the file that contain the transactions in market-basket format,

absolute support threshold.

the output file the frequent itemsets are sent to,

There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine.
Example:
fsm apriori ../data/seq/kosarak2_10_4.dat 1500 output.txt 5

make ../fsm^*

main-fsm.cpp

^*Before compiling any program please run the cd src;make dep command.

Major programs included in the Library
Name	Description	arguments
fim	This program includes the most important FIM algorithms, i.e. Apriori, Eclat, FP-growth	There are 4 mandatory parameters: the type of algorithm to run, i.e. apriori, apriori-lowmem, fp-growth the file that contain the transactions in market-basket format, absolute support threshold. the output file the frequent itemsets are sent to, There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine. Example: `fim apriori ../data/kosarak.dat 1500 output.txt 8`
make ../fim^*
main.cpp
assoc_rule	This is an association rule miner, that uses Apriori to find the frequent itemsets	There are 5 mandatory parameters: The file that contain the transactions in market-basket format, absolute support threshold. confidence threshold lift threshold the output file the frequent itemsets are sent to, Example: `assoc_rule ../data/kosarak.dat 1500 0.2 3 output.txt`
make assoc_rule^*
main-assoc-rule.cpp
apriori_simple	A simplified Apriori implementation that omits many speed-up techniques. We intend this program for students or researchers, who would like to have an efficient but still easy-to-understand and easy-to-modify Apriori implementation. If you don't find the speed of this Apriori satisfactory, you can check the implementation in `main.cpp`.	There are 3 mandatory parameters: the file that contain the transactions in market-basket format, absolute support threshold. the output file the frequent itemsets are sent to, There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine. Example: `./apriori-simple ../data/kosarak.dat 1500 output.txt 8`
make apriori_simple^*
main-apriori-simple.cpp
fsm	An Apriori-based frequent item sequence miner that uses a trie to store the candidates. More information is found in the paper: Ferenc Bodon, Trie-based APRIORI Implementation for Mining Frequent Itemsequences, ACM SIGKDD Workshop on Open Source Data Mining Workshop (OSDM'05), in Bart Goethals and Siegfried Nijssen and Mohammed J. Zaki editors, pages 56 - 65, Chicago, IL, USA. 2005. [PDF] [PS][Slides][BibTex]	There are 4 mandatory parameters: the type of algorithm to run, i.e. apriori, apriori-noprune, apriori-intersectprune the file that contain the transactions in market-basket format, absolute support threshold. the output file the frequent itemsets are sent to, There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine. Example: `fsm apriori ../data/seq/kosarak2_10_4.dat 1500 output.txt 5`
make ../fsm^*
main-fsm.cpp

Download

binaries:

For

linux (i386, gcc 3.4.5),
windows Note! To run windows binaries you have to have cygwin1.dll in your path. The simplest solution is that you can copy the cygwin1.dll that I have used to the folder where exe files are located. Nevertheless, I propose you to download the latest dll from the website of cygwin

source:

fim_env.tar.bz2: Source code and documentation (date of issue: 17. September 2006.)

Before compiling please run the cd src;make dep command. The compile commands are given in the table above.

To get (update) your local copy of the documentation run make doc in the src directory. The documentation will be found under doc/html. Please note, that you need doxygen and graphviz installed to be able to generate the documentation.

Documentation

On-line documentation
some theoretical remars are found in the following papers (please go over these works before writting us emails):
1. Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme, Benchmarking Frequent Itemset Mining Algorithms: from Measurement to Analysis, ACM SIGKDD Workshop on Open Source Data Mining Workshop (OSDM'05), in Bart Goethals and Siegfried Nijssen and Mohammed J. Zaki editors, pages 36 - 45, Chicago, IL, USA. 2005.
  [PDF] [PS] [Slides] [BibTex]
2. Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006,
  [PDF] [PS],[BibTex]

Test environment

Main developers

Ferenc Bodon (apriori, I/O, trie)
Balázs Rácz (fp-growth, allocators, trie, patricia-tree, scripts)
Lars Schmidt-Thieme (eclat)

Useful links

Similar template library, called DMTL, was proposed by Hasan et al. The purpose of DMTL differs from the purpose of our library. The DTML is more pattern and application oriented, we concentrate on algorithms and data structures. DMTL is an excellent tool if we would like to mine a new type of patterns using the existing algorithms, and our library comes at hand if we would like to change a part of an existing algorithm. In DTML the algorithms are the building blocks, while in our library we disassemble the methods as much as it makes sense.

Todo list

Note, that the library is in developement phase. It will be extended by many new classes and functionalities, some interfaces will change, the documentation will be improved and a restructuring is also expected.

A more detailed todo list is found here.

Name	Description	arguments
compile command
main source file
fim	This program includes the most important FIM algorithms, i.e. Apriori, Eclat, FP-growth	There are 4 mandatory parameters: the type of algorithm to run, i.e. apriori, apriori-lowmem, fp-growth the file that contain the transactions in market-basket format, absolute support threshold. the output file the frequent itemsets are sent to, There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine. Example: `fim apriori ../data/kosarak.dat 1500 output.txt 8`
make ../fim^*
main.cpp
assoc_rule	This is an association rule miner, that uses Apriori to find the frequent itemsets	There are 5 mandatory parameters: The file that contain the transactions in market-basket format, absolute support threshold. confidence threshold lift threshold the output file the frequent itemsets are sent to, Example: `assoc_rule ../data/kosarak.dat 1500 0.2 3 output.txt`
make assoc_rule^*
main-assoc-rule.cpp
apriori_simple	A simplified Apriori implementation that omits many speed-up techniques. We intend this program for students or researchers, who would like to have an efficient but still easy-to-understand and easy-to-modify Apriori implementation. If you don't find the speed of this Apriori satisfactory, you can check the implementation in `main.cpp`.	There are 3 mandatory parameters: the file that contain the transactions in market-basket format, absolute support threshold. the output file the frequent itemsets are sent to, There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine. Example: `./apriori-simple ../data/kosarak.dat 1500 output.txt 8`
make apriori_simple^*
main-apriori-simple.cpp
fsm	An Apriori-based frequent item sequence miner that uses a trie to store the candidates. More information is found in the paper: Ferenc Bodon, Trie-based APRIORI Implementation for Mining Frequent Itemsequences, ACM SIGKDD Workshop on Open Source Data Mining Workshop (OSDM'05), in Bart Goethals and Siegfried Nijssen and Mohammed J. Zaki editors, pages 56 - 65, Chicago, IL, USA. 2005. [PDF] [PS][Slides][BibTex]	There are 4 mandatory parameters: the type of algorithm to run, i.e. apriori, apriori-noprune, apriori-intersectprune the file that contain the transactions in market-basket format, absolute support threshold. the output file the frequent itemsets are sent to, There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine. Example: `fsm apriori ../data/seq/kosarak2_10_4.dat 1500 output.txt 5`
make ../fsm^*
main-fsm.cpp