Problem Definintion - finding frequent itemsets in large transactional datasets

The main problem associated with finding frequent itemsets in large transactional datasets is to establish association rule amongst the frequent itemsets. A lot of different algorithms were introduced â” some were enhanced to suit different datasets and situations. Algorithms in the form of Horizontal layout (e.g. Apriori, AprioriTID, Apriori Hybrid e.t.c) and Projected layout algorithms (e.g. FP-growth, prepost, prepost+, FIN e.t.c.) were discussed in this thesis. These different algorithms have different have different strengths and weaknesses depending on the dataset or situation they are being used for. As a measure of performance of the algorithms, the memory consumption and average execution times of these algorithms were determined and compared.
3.1 Problem Statement
Let I={i1,i2,i3,â¦,in} be a set of items and n is considered the dimensionality of the problem. Let DB be the transactional database with each transaction T â” which contains unique identifier TID and a set of items, such that T â I. A transaction T is said to contain itemset X, which is called a pattern, i.e. X â T â I.
The problem is to find all itemset X from DB such that its support count is greater than or equal to the given minimum support threshold.
3.2 Objective
The main objective of this thesis is to solve the above mentioned problem by:
1) Using the different algorithm and a different datasets to find the frequent itemsets.
2) Compare the algorithms based on average execution time and memory usage to determine the best algorithm amongst them.
3.3 Motivation
Frequent itemsets mining is a major field of study in data mining. Its application varies from association rule mining, correlations, graph pattern determination and various other data mining tasks. The existing of several algorithms both classical and recently developed makes it interesting to determine the suitable and most efficient algorithm to use. The major challenge encountered in the frequent itemset mining is generation of large result set which is accompanied by huge memory consumption. This is mainly associated with the fact that if the threshold set is relatively low, an exponentially large number of itemsets are generated and some algorithms take up much memory and take longer time for generation of the frequent itemsets. The main motivation behind this thesis is to compare the classical algorithms -which provide a base for mining frequent itemsets, with the newly proposed and state of the art algorithms to see if there is a reltive improvement in the field of mining frequent itemsets.
3.4 Methodology Used
to implement the proposed solution of the problem that is being taken care of in this thesis work, the following methodology is used:-
ï We try to implement the existing classical algorithms and find their strength and weaknesses by the literature survey.
ï To compare the existing classical algorithms with the newly proposed algorithms.
ï Develop a program for the problem and try and implement the techniques.
ï Validate the program using real life datasets.
3.4.1 Datasets
The task of implementing the methodology involves collection of data to be used. In this thesis, for the purpose of testing the developed program, data is collected from the standard UCI repository. The task of collecting the data is challenging due to the number of characteristics which include â” the number of records in the dataset and the sparseness of the data (each record contains only small portion of items). For this experiment, we choose different datasets with different properties to prove the strength and weakness of the algorithms. Table 3.4 shows the datasets and their characteristics.
Table 3.1
Dataset No of Items No of transaction Type size(kb)
Auto 26 205 Dense 30
Colic 24 368 Dense 63
Hepatitis 20 155 Sparse 17
Zoo 18 101 Sparse 15
3.4.2 Technique
To implement the above experiment, a standalone GUI was developed using JAVA. In order to gain a better comparison of the strengths and weaknesses of the algorithms, the datasets already obtained from the UCI repository are imported into the program and it executes and produces the frequent itemsets (as the output file), time taken to produce the frequent itemsets, the memory consumption, total number of itemsets produced and the size at which the algorithm terminates.
This data produced by the program are collected for the individual algorithms on a separate table using different level of the support count threshold (5-20%). After collecting data for each algorithm, a graphical representation of the result of execution on each dataset is made to determine the most effective of the algorithms.
3.5 Comparison
Table 3.2
Algorithm Technique Storage structure Memory utilization Time
Apriori Uses apriori property and join and prune method Array based Takes large amount of memory due to large candidate set generation Due to large candidate set generation, takes much time.
AprioriTID Uses apriori_gen function for candidate set generation but uses support of previous candidate subset for pruning Array based Takes significantly less memory than the standard algorithm since the database is only scanned once. Takes relatively less time than apriori algorithm.
Fp-growth It constructs conditional frequent pattern tree and conditional pattern base from the database which satisfy the threshold. Tree based Due to compact structures and no candidate generation, it requires less memory. Execution time is relative to the dataset being used. But it takes some time due to the compact data structure.
Prepost+ It constructs a PPC-tree and determines N-list of the frequent pattern and uses superset equivalence for pruning Tree based Due to no candidate set generation, it requires less memory compared to the apriori but takes more memory compared to the FP-growth. Execution time is relatively less than the classical apriori method but relatively equal in pace with the fp-growth. Although it might run faster than the fp-growth in some cases.
Thus, from the above literal comparison, the perpost+ and the FP-growth seem more effective than the standard apriori algorithm. The issue of large candidate set generation have been completely eliminated.
This comparison is much better justified in the next chapter where the comparison is done using numerical data and presented in a graphical format.

Essay: Problem Definintion – finding frequent itemsets in large transactional datasets

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: