Abstract
In the last decade, the amount of collected data, in various computer science applications, has grown considerably. These large volumes of data need to be analysed in order to extract useful hidden knowledge. This work focuses on association rule extraction. This technique is one of the most popular in data mining. Nevertheless, the number of extracted association rules is often very high, and many of them are redundant. In this paper, we propose an algorithm, for mining closed itemsets, with the construction of an it-tree. This algorithm is compared with the DCI (direct counting & intersect) algorithm based on min support and computing time. CHARM is not memery-efficient. It needs to store all closed itemsets in the memory. The lower min-sup is, the more frequent closed itemsets there are so that the amounts of memory used by CHARM are increasing.
Author Contributions
Copyright© 2020
Fakir Youssef, et al.
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Competing interests The authors have declared that no competing interests exist.
Funding Interests:
Citation:
Introduction
The field of data mining appeared with the promise of providing the tools and techniques to discover useful and previously unknown knowledge in these data fields. Data mining has been adopted for researches dealing with the automatic discovery of implicit information or knowledge within the databases The issue of mining frequent itemsets emerges first as a sub problem of mining association rules. Frequent itemsets play a vital role in many data mining tasks that try to find compelling patterns from databases such as association rules, classifiers, correlations, clusters, sequences, and many more of which the mining of association rules is one of the most famous problems The original reason for searching association rules came from the need to analyse supermarket transaction data, that is, to examine client’s behaviour in terms of the purchased products. Association rules describe how often items are bought together. For example, an association rule: {milk-oil (76%)} assert that 6 out of 7 clients that bought milk also bought oil. Such rules can be practical for decisions about product pricing, promotions, store arrangement and many others. The extraction of association rules is primarily based on two fundamental notions: support and confidence. To determine these rules, support and confidence need be at least equal to minimum thresholds of minimum confidence and minimum support described via the user. For this, the search for association rules has been oriented towards these two objectives: a. Determine the set of frequent itemsets that appear in the database with support greater than or identical to minsup. The problem of the extraction of frequent itemsets is of exponential complexity in the size m of the set of items as the number of potential frequent itemsets is 2m. Then, the phase of searching for these items and frequent is the costliest extraction of association rules. b. Generate the set of associative rules, from these frequent itemsets, with a confidence measure greater than or identical to minconf. Indeed, the time of this phase is very small compared to the cost of extracting frequent itemsets because the generation of association rules is a problem that depends exponentially on the size of set in a frequent itemsets. Once all frequent itemsets and their support are known, the association rule generation is straightforward. Hence, the problem of mining association rules is reduced to the problem of determining frequent itemsets and their support. In this paper, we show that it is not requisite to mine all frequent itemsets to guarantee that all non-redundant association rules will be found. Therefor we are going to discuss two approaches: a. Approach based on the discovery of "closed" itemsets, coming from the theory of formal concepts propose to generate only a compact and generic subset of associative rules. This subset is much smaller than the size of the set of all rules. We show that it is sufficient to consider only the closed frequent itemsets. Moreover, all non-redundant rules are found by only considering rules among the closed frequent itemsets. The set of closed frequent itemsets is much smaller than the set of all frequent itemsets. This approach proposes to reduce the cost of extracting frequent itemsets based on the fact that the set of frequent closed itemsets is a generating set of the set of frequent itemsets. This approach makes it possible to decrease the number of extracted rules by keeping only the interesting ones to give the possibility to better visualize them and exploit them. b. Approach that uses maximal frequent itemsets: A maximal set of elements is a frequent set of elements that is not included in an appropriate superset that is a common set of elements. The set of frequent maximal items is therefore a subset of the set of frequent closed items, which is a subset of frequent itemsets. That makes the set of frequent maximum items usually a lot smaller than the set of frequent items and smaller than the set of frequent closed items. This paper also gives the comparison of algorithms based on execution time and support value. The paper is organized as follows. Section 2 deals with the association Rule Mining Algorithms (CHARM algorithm). Section 3 present the experimental results. Conclusion is given in section 4. After developing the main ideas behind closed association rule mining, we now present CHARM Developed by Zaki and al CHARM simultaneously explores the item space and the transaction space, above a new IT-tree CHARM uses a highly efficient hybrid search method that ignores multiple levels of the computer tree to quickly identify frequent closed-element sets, instead of having to enumerate many possible subsets. It uses a hash-based fast approach to remove non-closed items when checking for under-consumption. CHARM also uses a new vertical representation of data called diffset Enumeration of closed sets using a double tree of itemset-tidset (itemset -transaction identification set) search. Using the technique called diffsets to reduce the memory footprint of intermediate calculations. Finally, uses a hash-based fast approach to remove all "unclosed" sets found during the calculation. The pseudo algorithm of CHARM is shown in CHARM begins by initializing the class of prefixes P of the nodes to be examined by the frequent 1-itemsets and their associated tidsets (transaction identification set). The two generic steps are instantiated as follows: This step is implemented via the CHARM-PROPERTY procedure ( This stage is implemented via the CHARM-EXTEND procedure. It combines the IT-pairs, which appear in the class of prefixes P. For each IT pair Xi × (Xi)J, it combines it with other IT pairs Xj × (Xj)J following it in lexicographic order. Each Xi will generate a new class of prefixes Pi, which would initially be empty. The two IT-pairs combined will produce a new pair X × Y, where X = Xi ∪Xj and Y= (Xi) ∩ J (Xj)J. Finally, the algorithm gives in its output FC (The Set of Frequent Closed Itemsets) that we seek. The algorithm is programmed on Java. We illustrate the CHARM algorithm on the following example that describes purchased products in an electronics store ( In Let Itemset X, t (X) be the set of all tidset that contains X. CHARM searches for frequent closed sets on a new search space in the IT-tree where each node is a pair X × t (X), for example: AT × 135. All children in node X share the same X prefix and belong to the same equivalence classes. According to these, we can set our It-tree as illustrated in Initially we have five branches, corresponding to the five items and their tidset from our example database. To generate the children of item When we try to extend We now start processing the Finally, we remove
Scanner
PC
Notebook
Laser
Printer
A
C
D
T
W
Transaction Id
Purchased items
1
A C T W
2
C D W
3
ACTW
4
ACDW
5
A C D T W
6
C D T
A
C
D
T
W
TID
A
C
D
T
W
1
1
2
1
1
1
1
1
1
1
3
2
4
3
2
2
1
1
1
4
3
5
5
3
3
1
1
1
1
5
4
6
6
4
4
1
1
1
1
5
5
5
1
1
1
1
1
6
6
1
1
1
Purchased items
Apparition in
count
Purchased items
Apparition in
Count
A
1345
4
DW
245
3
C
123456
6
TW
135
3
D
2456
4
ACD
45
2
T
1356
4
ACT
135
3
W
12345
5
ACW
1345
4
AC
1345
4
CDT
56
2
AD
45
2
CDW
245
3
AT
135
3
CTW
135
3
AW
1345
4
DTW
5
1
CD
2456
4
ACDT
5
1
CT
1356
4
ACDW
45
2
CW
12345
5
CDTW
5
1
DT
56
2
ACDTW
5
1
Results
The Algorithms was coded in Java using the dataset collected from electronics stors. For the comparison we used DCI-closed, a famous algorithm for mining frequent closed itemsets to be compared with CHARM Algorithm. For performance comparison, we used the original source or object code for DCI-closed provided to us by
Conclusion
In this paper, an efficient algorithm (called CHARM) for mining closed frequent itemsets is presented. Using an new IT-Tree framework, this algorithm explores simultaneously the itemset space and tidset space The IT-Tree skips many level to identify quickly the closed frequent itemsets. According to the experiment CHARM perform better on higher minsup compared to the algorithm DCI-Closed for mining closed patterns. CHARM faces a memory-inefficient challenge since it needs to maintain all closet itemsets in the memory to check if an itemset is closed or not. For this reason, an improvement of CHARM is necessary. As future work were are going to optimise CHARM.in order to solve the memory-inefficient.