Mining Non Redundant Association Rules

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

CHAPTER 3

Introduction

A Non-Taxonomy dataset (sometimes known as a single level/flat dataset) is one where there is no concept hierarchy involved in the dataset. All items in the dataset are unrelated. None of the items is a super-topic or sub-topic of another. In general, the attributes are independent of each other.

The association rules constitute a very important task in the process of data mining. The general objective is to find frequent co-occurrence of items within a set of transactions. These co-occurrences are called associations. The idea of discovering such rules derived from market basket analysis where the goal is to mine patterns describing the customer’s purchase behavior. A simple association rule can look as follows

Computer  antivirus_software [support = 10%, confidence = 80%]

This rule expresses the relationship between computer and antivirus_software. The support measures states that computer and antivirus_software appeared together in 10% of all transactions. The confidence measure describes the chance that there is a computer in a transaction if there is also antivirus_software. In the above rule, 80% of all transactions involving computer also involved antivirus_software. Association rules are considered interesting if they satisfy user specified thresholds called minimum support threshold or minsupp, minimum confidence threshold or minconf. The Association Rules are strong if it satisfies minsupp and minconf. It is possible to extract a large number of association rules from a dataset when the user specified thresholds are very less. In these cases, the number of the discovered rules may be huge and many of them may be redundant.

In this chapter, the proposed work is to determine and removing these redundant association rules using concept lattices of Formal concept Analysis and Min-Max Approximate Rules. A brief description of related work on Non-Redundant Association Rules, Formal concept analysis, the concise representations of extracted rules and its quality is mentioned. Finally, the algorithm for non-redundant association rule for single level datasets using concept Analysis mentioned.

RELATED WORK

Various strategies have to Non-Redundant Frequent Itemset Patterns. This has led to work on problems such as mining of Closed Itemsets [44], Maximal Itemsets [16], Non-derivable Itemsets [55], and pattern profiles [70].

Zaki et al.[58] defines closed itemset as an itemset whose support is not equal to support of any of its proper superset. Closed itemsets with support greater than the user specified support threshold (minsupp) are frequent closed itemsets [73]. An itemset is maximal itemset, if none of its super itemsets is frequent [25]. A frequent itemed is derivable if its support can be exactly inferred from the support of its sub itemsets based on the inclusion-exclusion principle, otherwise, they are known as Non-Derivable itemsets [56]. These itemsets are used for Non-Redundant Association Rule mining.

Closed Itemsets [44] and Non-derivable itemsets (NDI) [55] are lossless forms of compressing Frequent Itemset patterns. The Frequent Itemsets and associated frequency counts is derived from the compressed representation. Calders and Goethals [55] have found that for some datasets and its support thresholds have number of NDI less than number of closed itemsets, while other datasets and support thresholds have number of closed itemset is less than number of NDI [55]. Maximal Itemsets allow greater compression when compared with Closed Itemsets, but the representation is lossy. In Maximal Itemsets the Frequent Itemsets are computed exactly. However, the exact frequency counts associated with these Frequent Itemsets cannot be determined. There are some other lossy representations besides maximal itemsets. Top-k patterns approach by Han et al. [27] presents the most Frequent k Closed Itemsets to the end-user. Xin et al. [66] extended this work to extract redundancy aware top-k Frequent Itemset patterns. Error-tolerant patterns by Yang et al. [71] and Pei et al. [48] allow certain amount of fluctuations in evaluating supports of itemset patterns. An approach by Afrati et al. [10] uses K-Itemsets to recover a collection of Frequent Itemsets. However, it is difficult to find a method to recover the support information with their approach. Yan et al. [70] demonstrated that the pattern profile approach can effectively summarize itemsets for Non-Redundant. However, from an efficiency perspective it is not clear how well this approach will scale to large datasets. Unfortunately, there is no clear way to improve this problem. The pattern profiles are not itemset patterns themselves. It is interesting to find a method to use them for data analysis.

For these reasons, the proposed work is to discover Non-Redundant Association Rules for Non-Taxonomy Datasets using Formal concept analysis.

Formal Concept Analysis

R. Wille [11, 62 and 63] has introduced Formal Concept Analysis (FCA). The FCA applied in many applications like psychology, sociology, anthropology, medicine, biology, linguistics, computer sciences, mathematics and industrial engineering.

Formal concept analysis (FCA) is a method of data analysis across various domains. FCA analyzes data, which describe relationship between a particular set of transactions and a particular set of items. A distinguishing feature of FCA is an inherent integration of three components of conceptual processing of data and knowledge, namely

The discovery and reasoning with concepts

Data Discovery and reasoning with dependencies in data

Visualization of data concepts

Integration of these components makes FCA a powerful tool in data and knowledge engineering applied to various problems [62].

A concept is a unit of thoughts consisting of two parts, the extension and the intension. The extension covers all objects belonging to this concept and the intension comprises all attributes valid for all those objects. Hence, objects and attributes play a prominent role together with several relations like e.g. the hierarchical "subconcept-superconcept" relation between concepts, the implication between attributes, and the incidence relation "an object has an attribute" [11, 70].

In R. Wille [62] has introduced Formal Concept Analysis in connection with a partial order and lattice theory. A relation R on a set A is a partial order on A if R is reflexive, antisymmetric and transitive on A. This partial order is a Lattice if every two-element subset of A has a least upper bound (superset or join) and a greatest lower bound (subset or meet). This order and lattice theory connected with Galois connections induced by relations. The theory is based on a set theoretical model for conceptual hierarchies. This model emphasizes a concept as a unit of thoughts consisting of two parts, the extension and the intension (comprehension).

Formal Context

Radim et al [5] defined that, FCA is a method for analysis of object-attributed data tables. An object-attributed data table describing which objects have which attributes can be identified with a triplet (T, I, R), where T is a non empty set of objects in a dataset, Y is a non empty set of attributes items, and R  (T x I) is an object-attributes relation. Objects and attributes correspond to table rows and columns respectively. Thus the objects and attributes can be treated as transactions and items in a dataset. (Dataset is also a table representation). The element (t, i)  R indicates that object or transaction t has attribute i. The table entry corresponding to row t and column i contains 1, if (t, i)  R, otherwise the table entry contains 0. In terms of FCA, a triplet (T, I, R) is called a formal context. In formal contexts the relation (t, i)  R (also written as tRi) is read as follows: The object or transaction t is in relation R to the attribute i. [5, 11, 62 and 63].

In the basic setting, the input data to FCA is in the form of a table called a cross table. This describes a relationship between objects or transactions (represented by table rows) and attributes or items (represented by table columns). An example of such table is Table 3.1. In these rows represents objects or transactions and columns represents an attributes or Items bought in a store. A table entry containing 1 indicates that the corresponding row has the corresponding item. A table entry containing a 0 (zero) or blank symbol (empty entry) indicates that the object does not have the attribute or item.

TID

A

B

C

D

E

1

1

1

1

1

2

1

1

1

3

1

1

1

1

4

1

1

1

1

5

1

1

1

1

1

6

1

1

1

Table 3.1. Cross-table.

For a formal context, object t from T is a transaction and element i from I is an attribute or Item. (t, i)  R indicates that transaction t has attributed item i. For a given a cross table with n rows and m columns, a corresponding formal context (T, I, R), consists of a set T = {t1, t2, …, tn} and a set I = {i1, i2, …, im} and a relation R defined by : (ti, ij)  R, if and only if, the table entry corresponding to row i and column j contains a value 1.

Concept - Forming Operators

The formal context induces a pair of operators called concept-forming operators [5]. For a formal context (T, I, R), then the concept-forming operators are α: 2I→2T and β: 2T→2I are defined for every X  I, Y  T. Then

} (1)

} (2)

The X is a subset of T and Y is a subset of I. X is the subset of all objects or transactions from T shared by all attributes or items from X and Y is the set of all attributes or items from I sharing all object or transactions from Y.

For example, if we consider the above Table-1, then the, concept operators are,

{A}α = {1, 3, 4, 5}, {B, E}α = {1, 2, 3, 4}

Similarly, {1}β = {A, B, D, E}, {1, 2, 3, 4}β = {B, E}

The notion of a formal concept is fundamental in FCA. Formal concepts are particular clusters in cross-tables, defined by means of attribute sharing. [5, 62, 63 and 70]

Formal concept:

A formal concept in (T, I, R) is a pair (X, Y) of X  I, and Y  T such that and [11, 70].

For a formal concept (X, Y) in (T, I, R), X and Y are called the extent and intent of (X, Y). (X, Y) is a formal concept if and only if X contains items sharing all transactions from Y and Y contains transactions shared by all items from X [5, 11 and 70].

For Example {A, B, E}α = {1, 3, 4, 5} and {1, 3, 4, 5}β = {A, B, E}. This can be represented as (X1, Y1) = {{A, B, E}, {1, 3, 4, 5}}.

Subconcept – Superconcept ordering:

For Formal Concepts (X1, Y1) and (X2, Y2) of (T, I, R), then (X1,Y1)  (X2,Y2), if X1 X2 or Y2  Y1 [11, 70].

ï‚£ represents the subconcept-superconcept ordering

(X1,Y1) ï‚£ (X2, Y2), means that (X1,Y1) is more specific or also called as subconcept or join lattice and (X2,Y2) is more general called superconcept or meet lattice.

The collection of all formal concepts of a given formal context is a concept lattice of FCA [11, 70].

Concept Lattice

Denoted by (T, I, R) the collection of all formal concepts of (T, I, R) i.e.

(T, I, R) = {(X, Y)  2I x 2T | Xα = Y, Yβ = X} (3)

(T, I, R), equipped with the subconcept - superconcept ordering  is called a concept lattice of (T, I, R) and such a subconcept is called join lattice and superconcept is called meet lattice [11, 62, 63, and 70]

Galois Connection

With the help of above concepts lattice the Galois connection defined as

A Galois connection between sets T and I is pair (,), where, : 2I→2T and : 2T→2I satisfying for X, X1, X2 I, Y, Y1, Y2  T, then

Monotonicity: X1  X2  (X2)  (X1) (4)

Monotonicity Y1  Y2  (Y2)  (Y1) (5)

X  ((X)) or (o) X (6)

Y  ((Y)) or (o) Y (7)

Closure Operator

A Closure Operator on a set T is a mapping (t, t): 2T→2T, satisfying for each Y, Y1, Y2  T

Extension: Y  (t, t)(Y) (8)

Monotonicity: Y1  Y2  (t, t)(Y1)  (t,t)(Y2) (9)

Idem potency: (t, t)(Y) = (t, t)((t, t)(Y)) or (o)(t, t)Y (10)

By using, the above-defined FCA and Galois connection can be applied to generate the Association Rules [11, 70]

Present Work on Mining Non-Redundant Association Rules using Concept Lattices

In the proposed work, the association rules are defined and derived using concept lattices of Formal Concept Analysis. First the Closed itemset, Association Rules and its measures support and confidence are defined using the concept lattices of FCA. Using the FCA and Closed itemsets the non-redundant association rules are derived on a sample dataset.

The proposed approach is applied on a sample dataset, later applied on Census and Mushroom Datasets taken from UCI KDD Repository [1] .

Association Rules

Let I = {I1, I2, … , Im} be a set of unique items with m members, t is a transaction that contains a set of items from I so that i  I and T is a database or dataset containing a set of identifiable transactions t. As mentioned an association rule is implication of the following form A B, where A, B, and are known as itemsets, containing a set of items where  =. An item i appear in a transaction t, then both i and t have a binary relation denoted by (t, i)  R. The following mappings define the Galois connection of the binary relation, where X IY [73].

(11)

(12)

From this is known as the transaction mapping of X, while is known as the item mapping of Y

The Galois connection, defined in section 3.1.5, used to define the measures of association rules.

Support: The support of an Itemset A, can be denoted as supp(A), is determined as the percentage of the transactions, which contain A. Therefore [73]

(13)

|A| is the number of transactions that contains item A and |T| is the total number of transactions. This is similarly to probability of A or P(A)

Confidence: The confidence of an association rule of the form AB can be denoted conf(AB) where A, B  I. This is determined as the percentage of the transactions containing A that also contain B. Therefore [73]

(14)

Frequent itemset: Let A be a set of items, T be the transaction database and minsupp be the user specified minimum support. An itemset X in A (i.e., X is a subset of A) is said to be a frequent itemset in T with respect to minsupp if support(X)T > minsupp [73]

The problem of mining association rules decomposed into two sub-problems:

Find all itemsset whose support is greater than the user-specified minimum support, minsup. Such itemsets are called frequent itemsets.

Use the frequent itemsets to generate the desired rules. The general idea is that if, say ABCD and AB are frequent itemsets, then we can determine if the rule AB  CD holds by checking the following inequality

> minconf, where the rule holds with confidence minconf

From the Formal concept analysis, the set of all frequent itemsets is only meet semilattice. For any two itemsets, only their meet is guaranteed to be frequent, while their join may or may not be frequent. This follows from the principle of association rule mining that is, if an itemset is frequent, then all its subsets are also frequent [11, 70]

To illustrate the process of mining association rules using Formal Concept analysis, the following sample of Mushroom dataset is considered and represented in Table 3.2. The Mushroom dataset contains 8125 rows and 23 attributes. Each row is identified with a Transaction id or TID, out which 5 attributes and 6 rows are considered to apply the proposed work concepts. These 5 attributes are named as A = "gill-size", B = "gill-spacing", C = "ring-number", D = "ring-type", E = "gill-color".

TID

Items

1

ABDE

2

BCE

3

ABDE

4

ABCE

5

ABCDE

6

BCD

Table 3.2. Mushroom Sample Dataset

There are totally 25(=32) itemsets. {A}, {B}, {C}, {D}, and {E} are all 1-itemsets, {AC} is a 2-itemset, and so on. All Frequent Itemset with min support =50% is

Itemsets

Support

B

100%

E, BE

83%

A, C, D, AB, AE, BC, BD, ABE

66%

AD, CE, DE, ABD, ADE, BDE, BCE, ABDE

50%

Table 3.3. Frequent Itemset for Mushroom Sample Dataset

ABDE, BCE are maximal-by-inclusion frequent itemsets i.e., they are not a subset of any other frequent itemset.

Generating Confident Rules

This step is relatively straightforward; rules of the form , Where X, Y are generated frequent itemset and p ≥ minconf. The following table shows the generated confidence rules.

Association Rules

Confidence

A→B, A→E, A→BE, C→B, D→B, E→B

100%

AB→E, AD→B, AD→E, AE→B, CE→B,

100%

DE→A, DE→B, AD→BE, DE→AD, ABD→E,

100%

ADE→B, BDE→A

100%

B→E

83.33%

E→AB, BE→A, E→A

80%

B→AE

66.67

Table 3.4: Association Rules for Mushroom sample Dataset

From the above generated frequent itemset ABE can generate 6 possible rules those are , , , , and

Now in the proposed work, the closed itemset is defining with the help of Galois connection defined in section 3.3.6

3.4.3 Closed Itemset: If X be a subset of I then X is a closed Itemset if and only if the following holds, o(X) = X.

A pair (X, Xα) as X x Xα and (Y, Yβ) pair as Y x Yβ. The mapping Xα is the set of transaction mapping of all items in X. Similarly, Yβ is the set of itemset mapping of all transactions in Y. For example {A, B, E}α = {1, 3, 4, 5}, and {2, 4, 5}β = {B, C, E}. In terms of individual elements Xα and Yβ are represented as

(15)

(16)

For example {A, B, E}α = {A}α  {B}α  {E}α = {1, 3, 4, 5}  {1, 2, 3, 4, 5, 6}  {1, 2, 3, 4, 5}. Similarly {1}β  {2} β = {A, B, D, E}  {B, C, E} = {B, E}.

Let X  I and Y  T. Let (i, t)(X) denote the composition of the two mappings o(X) = ((X)). Dually, let (t, i)(Y) = (Y) = ((Y)). Then (i, t): 2I→2T and (t, i): 2T→2I are both closure operators on the itemsets and transaction set.

A closed itemset as an itemset X that is that same as its closure, i.e X = (i, t)(X). For example, the itemset {A, B, E} is closed. A closed transaction set is a transaction set Y = ti(Y). For example, the transaction set {1, 3, 4, 5} is closed. The mappings (i, t) and (t, i), being closure operators. For example X = {A, B}, then X is a subset of its closure, since (i, t)({A, B}) = ((({A, B})) = ({1, 3, 4, 5}) = {A, B, E}. Since (i, t)({A, B}) ≠ (({A, B})), So, {A, B} is not a closed set. Idempotency property says that once an item is mapped to the transaction set and transaction mapped to the itemset is same for a closed item. For example, consider the itemset {A, B, E}, then (i, t)({A, B, E}) = (({A, B, E})) = ({1, 3, 4, 5}) = {A, B, E}. Hence, the {A, B, E} is a closed item.

A concept X1 x Y1 is a subconcept of X2 x Y2, denoted as X1 x Y1 ≤ X2 x Y2, iff X1  X2 (or iff Y2  Y1). Let (£) denote the set of all possible concepts in the database. Then the ordered set ((£), ≤) is a complete lattice, called the Galois lattice. For example, Figure 3.1 shows the Galois lattice for the given example database, which has 10 concepts. The least element (subconcept of Table 1 items) is the concept is {C} x {1, 2, 3, 4, 5, 6} ((C) = {1, 2, 3, 4, 5, 6}, ({1, 2, 3, 4, 5, 6}) = C) and the greatest element is the concept {A, B, C, D, E} x {5}. Notice that the mappings between the closed pairs of itemsets and transaction sets are anti-isomorphic, i.e., concepts with large cardinality itemsets have small transaction sets, and vice versa.

Figure 3.1: Galois Lattice for Sample Mushroom Dataset

The concept generated by a single item x  I is called an item concept, and is given as i(x) = xα x (i, t)(x) (i.e.cartesian product of xα, (i, t)(x)). Similarly, the concept generated by a single transaction y  T is called a tid concept, and is given as t(y) = yβ x (t, i)(y). For example, the item concept i({A}) = (({A})) x ({A}) = ({1, 3, 4, 5}) x {1, 3, 4, 5} = {A, B, E} x {1, 3, 4, 5}. Further, the tid concept t({2}) = ({2}) x (({2})) = {B, C, E} x ({B, C, E}) = {B, C, E} x {2, 4, 5}.

In Fig-1, each concept labeled with the item concept or tid concept. It is equivalent to lattice minimal labeling, with item or tid labels. These are represented with bold letters shown in the above figure.

It is easy to reconstruct the concepts from the minimal labeling. For example, consider the tid concept t({2}) = ({2}) x (({2})). To obtain the closed itemset X, in the proposed work, appended all item labels reachable below it. Conversely, to obtain the closed transaction set Y appended all labels reachable above t({2}). Consider E, C and B are all the labels reachable by a path below it. Thus X = {B, C, E} forms the closed itemset 4 and 5 are the labels reachable above t({2}). Thus Y = {2, 4, 5}, giving the concept {B, C, E} x {2, 4, 5}, which matches the concept shown in the Figure 3.2.

Figure 3.2. Meet Semi Lattice of Frequent Itemset for Sample Mushroom Dataset

For any itemset X, its support is equal to the support of its closure, i.e., supp(X) = supp ((i, t)(X)).

This states that all frequent itemsets are determined by the frequent closed itemsets or frequent concepts. Furthermore, the set of frequent closed itemsets is bounded above by the set of frequent itemsets, and is typically much smaller, especially for dense datasets. For very sparse datasets, the two sets may be equal. To illustrate the benefits of closed itemset mining, contrast Figure 1, showing the set of all frequent itemsets. The set of all closed frequent itemsets are shown in Table 3.5. In this, there are only seven closed frequent itemsets, out of 19 frequent itemsets. This example clearly illustrates the benefits of mining the closed frequent itemsets.

Tidset

Frequent Concepts

{2,4,5}

{B,D,E}

{1,3,5}

{A,B,D,E}

{1,3,4,5}

{A,B,E}

{1,2,3,4,5}

{B,E}

{1,3,5,6}

{B,D}

{2,4,5,6}

{B,C}

{1,2,3,4,5,6}

{B}

Table 3.5: Frequent Concepts for Mushroom Sample Dataset

Association Rule Generation Using Concept Lattice

In the last section, we showed that the support of an itemset X equals the support of its closure (i, t)(X). Thus, it suffices to consider rules only among the frequent concepts. In other words, the rule is the same as the rule where p is the confidence.

From the concept lattice is it is sufficient to consider rules among adjacent concepts, since other rules inferred using transitivity, that is:

Transitivity: Let X1, X2, X3 be frequent closed itemsets, with X1  X2  X3. If and then .

The proposed work considers two cases of association rules, those with 100% confidence, i.e., with p = 1.0 (Exact rules), and those with p < 1.0 (Approximate Rules).

Exact Rules with 100% confidence

An association rule has confidence p = 1.0 if and only if (X2)  (X1). i.e. all 100% confidence rules are those that are directed from a super-concept (X1 x X1α) to a sub-concept (X2 x X2α)). Because (X2)  (X1) (since X1  X2). For example, consider the item concepts i({E}) = {B, E} x {1, 2, 3, 4, 5} and i({B}) = {B} x {1, 2, 3, 4, 5, 6}. The rule is a 100% confidence rule. Note that if we take the itemset closure on both sides of the rule, we obtain , i.e., a rule between closed itemsets, but since the antecedent and consequent are not disjoint in this case, we prefer to write the rule as although both rules are exactly the same.

Association Rules

Confidence

DE→A, DE→AB, BDE→A

100%

DE→A, DE→AB, BDE→A

100%

A→E, A→BE, AB→E

100%

E→B

100%

D→B

100%

C→B

100%

Table 3.6: Rules with 100% confidence of Mushroom Sample Dataset

In the above Table 3.6, the rules are most general. For example, consider the rules, and. We prefer the rule since the latter two are obtained by adding one (or more) items to either the antecedent or consequent of. In other words is more general than the latter two rules. In fact, the addition of C to either the antecedent or the consequent has no effect on the support or confidence of the rule. In this case, the other two rules are redundant.

Let Ri stand for a 100% confidence rule , and let R = {R1,R2, , , Rn} be a set of rules such that I1 = (i, t)() , and I2 = (i, t)() for all rules Ri. Then all the rules are equivalent to the 100% confidence rule, and thus are redundant.

The first rule that (i, t)({D, E}{A}) = (i, t)({A, D, E}) = {A, B, D, E}. Similarly for the other two rules (i, t)({D, E}{A, B}) = (i, t)({A, B, D, E}) = {A, B, D, E}, and (i, t)({B, D, E}{A}) = (i, t)({A, B, D, E}) = {A, B, D, E}. Thus for these three rules, the closed itemset is I1 = {A, B, D, E}, similarly I2 = {A, B, E}. All three rules correspond to the edge between the tid concept t({1, 3}) and the item concept i({A}). Finally is the most general rule and so other are redundant.

A set of such general rules constitutes a generating set, i.e., a rule set, from which all other 100% confidence rules can inferred. Note that in this proposal do not address the question of eliminating self redundancy within this generating set, i.e., there may still exist rules in the generating set that can be derived from other rules in the set. In other words we do not claim anything about the minimality of the generating set.

Frequent Itemset

Confidence

DE→A

100%

A→E

100%

E→B

100%

D→B

100%

C→B

100%

Table 3.7: generating set with 100% confidence

Table 3.7 shows the generating set, which includes the 5 most general rules, , , , . All other 100% confidence rules can be derived from this generating set by application of simple inference rules. For example, the rule is a transitivity from the two rules , . The rule can be obtained by augmentation of the two rules and, etc. It can easily verify that all the 19 100% confidence rules produced by using frequent itemsets, as shown in Table 3, can be generated from this set of 5 rules, produced using the closed frequent itemsets

Approximate Rules with confidence less than 100%

Now turn to the problem of finding a generating set for frequent itemset with confidence less than 100%. But in this the rules go from sub-concepts to super-concepts.

Let Ri stand for a 100% confidence rule , and let R = {R1, R2, , , Rn} be a set of rules such that I1 = and I2 = for all rules Ri. Then all the rules are equivalent to the 100% confidence rule , and thus are redundant.

The three rules E→A, E→AC and BE→A can be applied to the above theorem, then we get I1=(i,t)({E}) = (({E})) = ({1, 2, 3, 4, 5}) = {B, E}, and I2 = (t, i)({E}{A}) = (i,t)({BE}{A}) = (({A, B, E})) = {A, B, E}. The support of the rule is |(I1I2)α| = |{A, B, E}α| = 4, and the confidence is given as |(I1I2)α|/|(I1)α| = 4/5 = 0.8. similarly we get Finally , are the most general rule for less than 100% confidence, the other two are redundant.

Frequent Itemset

Confidence

E→A

83.33%

B→E

8.%

Table 3.8: Generating set with <100% confidence

By combining the generating set for rules with p = 1.0, shown in Table 3.7 and the generating set for rules with 1.0 > p ≥ 0.8, shown in Table 3.8, obtain a generating set for all association rules with minsup =50%, and minconf = 80%: {, , , , , , }.

It can be easily verified that all the association rules shown in Table 3.3, example database from Table 3.2, can be derived from this set. Using the closed itemset approach generated seven rules, whereas, the 22 rules are generated in traditional association mining. To see the contrast further, consider the set of all possible association rules that can mine. With minsup = 50%, the least value of confidence can be 50%. There are 60 possible association rules versus only 13 in the generating set.

Experiments

All Experiments were performed on an Intel Core Duo 1.66GHz PC with 1GB of memory, running on windows XP and XLMiner. The Experiments are applied on two datasets Census-Income and Mushroom Datasets obtained from UCI KDD Machine Learning Repository. The Database Characteristics are the Mushroom dataset contains 8125 rows and 23 attributes and Census-Income contains 32562 rows 15 attributes, the proposed model applied on these data sets to identify the no of frequent itemsets, no of closed frequent itemsets and no. of non redundant frequent itemsets.

The following are the Association Rules generated with minsupp=80% and minconf=80%.

Association Rules

Supp

Conf

gill-size = c and gill-spacing = f ï‚® ring-number = w

81.26

100

ring-type = o and ring-number = wgill-spacing = f

89.71

100

gill-spacing = f ring-number = w

97.41

99.90

ring-type = o and gill-spacing = f ï‚® ring-number =w

89.80

99.89

ring-number = w ï‚® gill-spacing = f

97.54

99.77

gill-size = c and ring-number = w ï‚® gill-spacing = f

81.49

99.73

ring-type = o gill-spacing = f

92.17

97.44

ring-type = o ring-number = w

92.17

97.33

gill-size = c ring-number = w

83.85

97.18

gill-size = c gill-spacing = f

83.85

96.92

gill-size = c and gill-spacing = f ring-type = o

81.27

95.00

gill-size = c and gill-spacing = f and ring-number = w ring-type = o

81.27

95.00

gill-size = c ring-type = o

83.85

94.89

gill-size = c and ring-number = w ring-type = o

81.49

94.74

gill-spacing = f ring-type = o

97.42

92.19

gill-spacing = f and ring-number = w ring-type =o

97.32

92.18

ring-number = w ring-type = o

97.54

91.98

ring-type = o gill-size = c

92.17

86.33

ring-type = o and ring-number = w gill-size = c

89.71

86.06

ring-type = o and gill-spacing = f and ring-number = w gill-size = c

89.71

86.06

ring-type = o and gill-spacing = f ï‚®gill-size = c

89.81

85.97

ring-number = w gill-size = c

97.54

83.54

gill-spacing = f and ring-number = w gill-size = c

97.32

83.51

gill-spacing = f gill-size = c

97.42

83.42

Table 3.9: Association Rules for Mushroom Dataset using Frequent Itemsets.

Association Rules

Supp

Conf

gill-size = c and gill-spacing = f  ring-number = w

81.27

100

ring-type = o and ring-number = w  gill-spacing = f

89.71

100

ring-type = o and gill-spacing = f  ring-number = w

89.81

99.89

ring-number = w  gill-spacing = f

97.54

99.70

gill-size = c and ring-number = w  gill-spacing = f

81.49

99.70

gill-size = c and gill-spacing = f and ring-number = w  ring-type = o

81.27

95.00

ring-type = o  gill-size = c

92.17

86.00

ring-type = o and gill-spacing = f and ring-number = w  gill-size = c

89.7

86.00

Table 3.10: Association Rules for Mashroom Dataset using Frequent Closed Itemsets

Association Rules

Supp

Conf

gill-size = c and gill-spacing = f ï‚® ring-number = w

81.27

100

ring-type = o and ring-number = w ï‚® gill-spacing = f

89.71

100

ring-type = o and gill-spacing = f ï‚® ring-number = w

89.81

99.89

gill-size = c and ring-number = w ï‚® gill-spacing = f

81.49

99.7

gill-size = c and gill-spacing = f and ring-number = w ï‚® ring-type = o

81.27

95.00

ring-type = o and gill-spacing = f and ring-number = w ï‚® gill-size = c

89.71

86

Table 3.11: Association Rules for Mashroom Dataset using non-Redundant Frequent Closed Itemsets

The number of rules generated for Mushroom and Census-Income datasets on different minsupp values 70% and 80 % are described in Table 3.12.

Dataset

Minsupp=80%

Minsupp=70%

F

FCI

NR

F

FCI

NR

Mushroom

25

8

6

29

10

7

Census-income

12

10

8

24

18

15

Table 3.12: No. of Frequent, Closed and Redundant Rules

Figure 3.3 Rules for Frequent, Closed, Redundant Rules for Mushroom & Census-Income Datasets

With the help of above Figure, we can say that

the redundant rules are gradually reduced with different support values.

The minsupp = 80% generating very less number of rules compared to minsupp = 70%.

The number of Frequent closed itemset and Non-redundant itemsets are almost same for both datasets. This specifies that when closed itemsets are increases then redundant itemsets will be increased.

For mushroom dataset a large number of redundant rules are removed compared to census-income dataset.

The above experiments are performed with different minsupp values, the proposed work also consider the experiments on different minconf values are described in Table 3.13.

Census Income Data Set

Mushroom Data Set

minConf

FCI

Non Redundant

FCI

Non Redundant

0.1

191

38

1709

427

0.5

32

12

331

89

0.9

04

03

56

24

Table 3.13: The Generated Non Redundant Rules

Figure 3.4. The Generated Non Redundant Rules for Mushroom & Census-Income Datasets

The above table and figures gives the analysis performed on Census-Income and Mushroom Datasets on different minconf values, here the minsupp considered as 0.70.

Data analysis using Census-Income Dataset

At minconf = 0.1 there are more number of closed itemsets and non-redundant rules compared with minconf = 0.9.

At minconf = 0.9, there are very less number of closed itemsets and non-redundant rules as well the dataset contains more number of records. Thus at minconf = 0.9 the proposed work derived very less number of rules this is due to the loss of information.

When minconf values are increasing from 0.1 to 0.5 the number of frequent closed itemsets is decreased.

Whereas from 0.5 to 0.9 confidence values the number of frequent closed itemsets and non-redundant rules are almost constant.

The confidence value 0.5 is a mid point, below the minconf = 0.5 the number of itemsets and rules are increased and above 0.5 confidence the number of itemsets and non-redundant rules are decreased

Data analysis using Mushroom Dataset

At minconf = 0.1, this dataset has generated large number of frequent closed itemsets, but the number of non-redundant rules are very less at the same confidence levels. Thus at this confidence there are more number of redundant rules. Those are eliminated with the proposed work of FCA.

The ratio of closed itemsets and non-redundant rules from minconf = 0.1 to 0.5 is 0.92. Similarly the ratio of these values from minconf = 0.5 to 0.9 is 0.62. Thus the minconf = 0.1 to 0.5 is generating 30% more closed itemsets and non-redundant rules than the minconf 0.5 to 0.9. Thus the minconf = 0.1 to 0.5 generating the more redundant rules which are eliminated with the concept of FCA.

For the two datasets the minconf = 0.5 is midpoint. Both datasets below this midpoint are generating more redundant rules which are eliminated with proposed work. Whereas above the midpoint there are very less number of redundant and these are also eliminated with proposed work, due to the loss of information.

Summary

The traditional Association Rules produces too many rules, in which most of them are redundant. In the proposed work, a FCA framework is developed based on closed frequent itemsets to reduce the rule set, and to obtain the strong Non Redundant association rules based on the concept lattice. The generated Non-Redundant rules quality is identified with the Certainty Factor. In the above-generated rules, there is a loss of information. This is problem is extended using Min –Max Representation to get more accurate rules.

Note: Related to this chapter, a research papers are published in

R. Vijaya Prakash, Dr. A Govardhan, Prof. SSVN. Sarma, "Mining Non- Redundant Frequent Pattern in Taxonomy Datasets using Concept Lattices" International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA, Volume 3– No.9, August 2012 – www.ijais.org.

R. Vijaya Prakash, Dr. A Govardhan, Prof. SSVN. Sarma, "Eliminating Redundant Frequent Pattern’s In Non-Taxonomy Datasets", International Journal of Advanced Research in Computer Science and Software Engineering, Vol 2, Issue2, ISSN: 2277 128x.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now