Improved Algorithms For Mining Sequential Patterns

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

College of Computer and Information Systems

Jazan University, Kingdom of Saudi Arabia

[email protected] [email protected]

ABSTRACT

Sequential Pattern Mining is the process of applying data mining techniques to a sequential database for the purpose of discovering the correlation relationships that exist

among an ordered list of events. The patterns can be used to focus on retailing industry,

including attached mailing, add-on sales and customer satisfaction. In this paper, we present fast and efficient algorithms called AprioriAllSID for mining sequential patterns

that are fundamentally different from known algorithms like AprioriAll. The algorithm

has been implemented on an experimental basis and its performance studied. The performance study shows that the proposed algorithms have an excellent performance over the best existing algorithms.

Keywords: Data Mining, Sequential Pattern Mining, ApriorAllSID algorithm, GSPSID algorithm, Data Sequence.

1 INTRODUCTION

Data Mining, also known as Knowledge discovery in Databases, has attracted a lot of attention. Because of the progress of data collection tools, large amount of transaction data have been

generated, but such data being archived and not used efficiently [4]. Data Mining is the method

of discovery of useful information such as rules and previously unknown patterns existing between data

items embedded in large databases, which allows more effective utilization of existing data.

The problem of mining sequential patterns in a large database of customer transactions was introduced in [2]. A transaction data typically consists of a customer ID, a transaction ID and a transaction time

associated with each transaction and item bought per-transaction. By analyzing these customer transaction data, we can extract the sequential patterns such as "10% of customers who buy both A and B also buy C

in the next transaction".

Several algorithms have been proposed to find sequential pattern [2] [22]. An algorithm for finding all sequential patterns, named AprioriAll, was presented in [2]. First, AprioriAll discovers all the set of items

(itemset) with a user-defined minimum support (large itemset), where the support is the percentage of customer transactions that contain the itemsets. Second, the database is transformed by replacing the itemsets in each transaction with the set of all large itemsets. Last, it finds the sequential patterns. It is

costly to transform the database. In [25], a graph-based algorithm DSG (Direct Sequential Patterns

Generation) was presented. DSG constructs an association graph to indicate the associations between items by scanning the database once, and generates the sequential patterns by traversing the graph.

Though the disk I/O cost of DSG is very low, the related information may not fit in the memory when the

size of the database is large.

In [22], GSP (Generalized Sequential Pattern) algorithm that discovers generalized Sequential Patterns was proposed. GSP finds all the frequent sequences without transforming the database. Besides, some generalized definitions of sequential patterns are introduced in [2] [3]. First, time constraints are

introduced. Users often want to specify maximum or minimum time period between adjacent elements.

Second, flexible definition of a customer transaction is introduced. It allows a user-defined window-size within which the items can be present. Third, given a user-defined taxonomy (is-a hierarchy) over the

data items, the generalized sequential patterns, which includes items spanning different levels of the

taxonomy, is introduced. All the previous algorithms for discovering sequential patterns are serial algorithms. Finding sequential patterns has to handle a large amount of customer transaction data and

requires multiple passes over the database, which requires long computation time. Thus, we introduce

efficient algorithms for discovering sequential patterns in a large collection of sequenced data.

In this paper, we consider the new algorithms for mining sequential patterns in sequential environment. All the earlier algorithms are multiple pass over the data whereas in the proposed algorithm, the original

database is read only and we introduce a new temporary database D' for the next iterations. After completing the first iteration, we can find the candidate sequence of size-2 using temporary database D'.

Then we can find the candidate k-size sequences until the candidate sequence or temporary database size

is empty. At this stage, the database size is reduced and the number of candidate sequences is also reduced. This feature is used for finding sequential patterns and also reduced the time complexity. So the proposed methods are more efficient than all other methods like AprioriAll and Generalized Sequential

Patterns (GSP).

The rest of this paper is organized as follows: Section 2 describes the problem of mining sequential patterns and related works. In section 3, we propose efficient algorithms namely AprioriAllSID for

discovering sequential patterns. Relative performance study is given in section 4. Section 5 concludes the paper.

2 SEQUENTIAL PATTERN MINING

The problem of sequential pattern mining is one of the several that have deserved particular attention from the data mining community. The sequential nature of the problem appears when

the data to be mined is naturally embedded in a one-dimensional space, i.e., when one of the

relevant pieces of information can be viewed as one ordered set of elements. This variable can be

time or some other dimension, as is common in other areas, like bioinformatics. Sequential

pattern mining is defined as the process of discovering all sub-sequences that appear frequently

on a given sequence database. The challenge resides in figuring out what sequences to try and

then efficiently finding out which of those are frequent [Srikant 1996].

One of its obvious applications is on modeling the behavior of some entity along time. For instance, by using a database with transactions performed by customers at any instant, it is

desirable to predict what would be the customer's next transaction, based on his past transactions. Examples of these tasks are easily found on the prediction of financial time series or

on tasks related to patients' health monitoring.

2.1 Statement of the Problem

Sequential Pattern Mining algorithms address the problem of discovering the existent maximal frequent sequences in a given database. Algorithms for this problem are relevant when the data

to be mined has some sequential nature, i.e., when each piece of data is an ordered set of

elements, like events in the case of temporal information, or nucleotides and amino-acid

sequences for problems in bioinformatics.

The problem was first introduced by Agrawal and Srikant, where the basic concepts involved in

pattern detection were established [Agrawal 1995]. In the last two decades, several sequential

pattern mining algorithms were proposed, but not all assume the same conditions.

Some basic definitions are needed, in order to formally introduce the problem:

Definition 1: An itemset is a non-empty subset of elements from a set C, the item collection, called items.

An itemset, also called a basket, represents the set of items that occur together. If the data is time dependent, an itemset corresponds to the set of items transacted in a particular instant by a

particular entity. The itemset composed of items a and b is denoted by (ab).

Definition 2: A sequence is an ordered list of itemsets. A sequence is maximal if it is not contained in any other sequence.

A sequence with k items is called a k-sequence. The number of elements (itemsets) in a sequence s is the length of the sequence and is denoted by |s|. The ith itemset in the sequence is represented

by si and <> denotes the empty sequence. The result of the concatenation of two sequences x and

y is a new sequence denoted by xy. The set of considered sequences is usually designated by

database (DB), and the number of sequences by database size (|DB|).

Definition 3: A sequence a = <a1, a2, ..., an> is contained in another sequence b = <b1, b2,... bm>, or a is a subsequence of b, if there exist integers 1 i1 < i2 <<in  m such that a1 bi1, a2bi2,

, an  bin.

A subsequence s' of s is denoted by s'  s, and by s'  s if s' is a proper subsequence of s, i.e. if s' is a subsequence of s but is not equal to s. It is usual to assume that the items in an itemset of a

sequence are in lexicographic order.

This assumption facilitates the design of sequential pattern mining algorithms, avoiding the repetition of some operations (such as the generation of repeated sequences). In this manner,

prefixes and suffixes have specific meanings. They are special cases of subsequences: the

sequence without the first element in the first itemset of the sequence and without the last

element of the last itemset of the sequence, respectively. Finally, the sequential pattern-mining

problem may be stated in its entirety.

Definition 4: Given a database D of sequences, and some user-specified minimum support threshold  and constraint c, a sequence is frequent if it is contained in at least  sequences in the

database, satisfying the constraint c. A sequential pattern is a maximal sequence that is frequent.

The problem of finding sequential patterns can be decomposed into two parts:

i) Generate all combinations of customer sequences with fractional sequence support (i.e.,

supportD(C) / |D|) above a certain threshold called minimum support m.

ii) Use the frequent sequences to generate sequential patterns.

The second sub problem is straightforward. However discovering frequent sequences is a non-trivial

issue, where the efficiency of an algorithm strongly depends on the size of the candidate sequences.

2.2 Related works

There are two main approaches to the sequential pattern-mining problem: apriori-based and pattern-growth methods, with GSP [Srikant 1996] and PrefixSpan [Pei 2001] their best-known

implementations, respectively. Despite there are several implementations of apriori-based

methods, most of them assume some specific situations (for example that the entire dataset fits in

memory, see for example [Ayres 2002]), allowing for significant improvements. Although GSP

considers time constraints and uses taxonomies, if no taxonomy is provided, and time constraints

are not set both algorithms have similar goals: to discover sequential patterns, without

considering any constraints and without any database size restrictions.

GSP [Srikant 1996] follows the candidate generation and test philosophy. It begins with the discovery of frequent 1-sequences, and then generates the set of potentially frequent (k+1)-

sequences from the set of frequent k-sequences (usually called candidates). The generation of

potentially frequent k-sequences (k-candidates) uses the frequent (k-1)-sequences discovered in

the previous step, which may reduce significantly the number of sequences to consider at each

moment. Note that to decide if one sequence s is frequent or not, it is necessary to scan the entire

database, verifying if s is contained in each sequence in the database.

In order to reduce its processing time, GSP adopts three optimizations. First, it maintains all candidates in a hash-tree to scan the database once per iteration. Second, it only creates a new k-

candidate when there are two frequent (k-1)-sequences with the prefix of one equal to the suffix

of the other. Third, it eliminates all candidates that have some non-frequent maximal

subsequence. By using these strategies, GSP reduces the time spent in scanning the database,

increasing its general performance.

In general, apriori-based methods can be seen as breadth-first traversal algorithms, since they construct all k-patterns simultaneously. Consider a database that is composed of sequences with

items belonging to the set ={a, b}. If all possible arrangements of these two elements are

frequent, GSP would work as illustrated in Figure 1. At the end of the first iteration, GSP would

have found a and b as 1-sequences. At the end of the second step it would have found {aa, (ab),

ab, ba, bb}. Finally, at the end of the third iteration, GSP would have found all arrangements of 2

items taken 3 at a time, including the basket (ab) {aaa, a(ab), aab, (ab)a, (ab)b, aba, abb, baa,

b(ab), bab, bba, bbb}.

Pattern-growth methods are a more recent approach to deal with sequential pattern mining problems. The key idea is to avoid the candidate generation step altogether, and to focus the

search on a restricted portion of the initial database.

PrefixSpan [Pei 2001] is the most promising of the pattern-growth methods and is based on recursively constructing the patterns, and simultaneously, restricting the search to projected

databases. A -projected database is the set of subsequences in the database, which are suffixes

of the sequences that have prefix . In each step, the algorithm looks for the frequent sequences

with prefix , in the correspondent projected database. In this way, the search space is reduced at

each step, allowing for better performances in the presence of small support thresholds. In the

presence of gap constraints, the algorithm has to be adapted, and this claim is no longer valid.

However, as been shown, the new version of PrefixSpan remains faster than GSP [Antunes

2003].

In general, pattern-growth methods can be seen as depth-first traversal algorithms, since they

construct each pattern separately, in a recursive way. If we consider the same database as

previously, PrefixSpan would work as illustrated as follows.

{ } = { (a (b)}  { aa, (ab), ab }  { (aaa), (a(ab), (aab) }

The main drawbacks of pattern mining in general and of sequential pattern mining in particular, are: the large amount of discovered patterns; its inability to use background knowledge; and the

lack of focus on user expectations. In order to resolve these problems, the use of constraints was

proposed.

A constraint is a predicate on the power set of the set of items I, i.e. C:2Iï‚®{true, false}. Given a transaction database, a support threshold and a constraint C, the problem of constrained frequent

pattern mining is the selection of the complete set of frequent itemsets satisfying C [Pei 2002a].

This approach has been widely accepted by the data mining community, since it allows the user to control the mining process, either by introducing his background knowledge deeply into the

process or by narrowing the scope of the discovered patterns. The use of constraints also reduces

the search space, which contributes significantly to achieve better performance and scalability

levels ([Srikant 1996], [Pei 2002a] and [Garofalakis 1999]).

3 AprioriAllSID

In this section we describe the algorithm AprioriAllSID based on [2].

3.1 Description

The AprioriAllSID algorithm is shown in figure 1. The feature of the proposed algorithm is that the given customer transaction database D is not used for counting support after the first pass. Rather the set C'k is

used for determining the candidates' sequences before the pass begins. Each member of the set Ck is of

the form SID, { Sk }, where each Sk is a potentially frequent k-sequence present in the sequence with

identifier SID. For k=1, C1 corresponds to the database D, although conceptually each sequence i is

replaced by the sequence { i }. For k 1, C'k is corresponding to customer sequence S is s.SID, {s Ck

| s contained in t }. If s customer sequence does not contain any candidate k-sequence, then C'k will not

have an entry for this customer sequence.

Thus, the nu' mber of sequences in the database is greater than the number of entries in C'k. The number of

entries in C k may be smaller than the number of sequences in database especially for large value of k. In

addition, for large values of k, each entry may be smaller than the corresponding sequence because very

few candidate sequences may be contained in the sequence. However, for small values of k, each may be

larger than the corresponding sequence because an entry in Ck includes all candidate k-sequences

contained in the sequence.

3.2 Algorithm AprioriAllSID

In figure 1, we present an efficient algorithm called AprioriAllSID, which is used to discover all sequential patterns in large customer database.

Algorithm AprioriAllSID

1. L1' = {Large size-1 sequences}; // result of L-itemset phase

2. C k = database D;

3. For ( k=2; Lk-1 =; k++) do begin

4.

5.

6.

7. 8. 9.

10.

11.

Ck = New candidate sequences generated from Lk-1;

C'k =;

for all entries s C'k-1 do begin

// determine candidate sequences in Ck contained in the sequence with

Identifier s.SID

Ct = {sCk | s-C[k]) s.set-of-sequences (s-C[k]) s.set-of-sequences};

for each customer sequence C in the database do

increment the count of all candidate sequences in Ck that are contained in s;

if (Ct) then C'k = C'k + s.SID, Ct;

end;

12. Lk = Candidate sequences in Ck with minimum support;

13. End;

14. Answer =k Lk;

Figure 1: Algorithm AprioriAllSID

Procedure Candidate-gen (Lk-1: frequent (k-1)-itemsets; min-sup: minimum support)

1. For each itemset l1 Lk-1

2. For each itemset l2 Lk-1

3.

4.

5.

6. 7.

8.

If (l1[1] = l2[2]) ^ (l1[2] = l2[2]) ^ (l1[k-1] = l2[k-2]) ^ (l1[k-1] < l2[k-1]) then

c = l1 L2; // join step: generate candidate sets

If has-infrequent-subset (c, Lk-1) then

Delete c; // prune step: remove infrequent candidate sets

Else add c to Ck;

End if;

9. Return;

Figure 2: Procedure Candidate-gen

Procedure Has-infrequent-subset(c:candidate k-itemset; Lk-1: frequent(k-1)-itemset;

1. For each (k-1)-subsets s of c

2. If s Lk-1 then

3. Return true;

4. Return false;

Figure 3: Procedure Has-infrequent-subset

Example: Consider the database in figure 4 and assume that minimum support is 2 customer sequences.

By using Candidate-gen procedure in figure 2,' with size-1 of frequent sequences gives the candidate

sequence in C2 by iterating over the entries in C 2 and generates C'2 in step 6 to 11 of figure 1. The first

entry in C'1 is {(1) (5)} {2} {3} {4} corresponding to customer sequence 10. The Ct at step 7

corresponding to this entry s is {{(1) (5)} {2} {3} {4}} are members of s.set-of-sequences.

By using Candidate-gen procedure with L2 gives C3. Making pass over the data with C'2 and C3 generates

C'3. This process is repeated until there is no sequence in the customer sequence database.

Customer Database C'1 L1

TID Sequence TID Set-of-sequences Sequence Support

10  { 1 5} {2} {3} {4} 10  { (1) ( 5) } {2} {3} {4} {1} 420  {1} {3} {4} {3 5} 20  {1} {3} {4} {(3) ( 5)} {2} 2

30  {1} {2} {3} {4} 30  {1} {2} {3} {4} {3} 440  {1} {3} {5} 40  {1} {3} {5} {4} 450  {4} {5} 50  {4} {5} {5} 4

C2 C'2 L2

Itemset TID Set-of-sequences Sequence Support

 { {1 2} {1 3}} {1 4} {1 5} {2 3} {2 4}

{1 2}

{1 3}

{1 4} {1 5} {2 3}

{2 4}

{2 5} {3 4}

{3 5}

{4 5}

C3

10

20

30 40 50

{2 5} {3 4} {3 5} {4 5}

 {1 3} {1 4} {1 5} {(3 4) (3 5) (4 5) }

 {1 2} {1 4} {2 3} {2 4} {3 4}

 {1 3} {1 5} {3 5}

 {4 5}

C'3

{1 2}

{1 3}

{1 4} {1 5} {2 3}

{2 4}

{3 4} {3 5}

{4 5}

L3

2

4

332

2

32

2

Itemset TID Set-of-sequences Sequence Support

{1 2 3} 10  {{1 2 3}} {1 2 4} {1 3 5} { 1 4 5} {2 3 4} {1 2 3} 2

{1 2 4} 20  {1 3 4} {1 3 5} {1 4 5} {3 4 5} } {1 2 4} 2

{1 3 4} 30  {1 2 3} {1 2 4} {1 3 4} {2 3 4} {1 3 4} 3

{1 3 5} 40  {1 3 5} {1 3 5} 3

{1 4 5} {1 4 5} 2

{2 3 4} {2 3 4} 2

{3 4 5}

C4 C'4 L4

Itemset TID Set-of-sequences Sequence Support

{1 2 3 4} 10  { 1 2 3 4 } {1 2 3 5} {1 2 3 4} 2

30  {1 2 3 4}

Figure 4: Example

Lemma 1: For all k 1, if the set of (k-1)-sequences when the SIDs of the generating transactions are

kept associated with the candidate Ckth1 is correct and complete and frequent (k-1) sequence is correct, then -

the set Ct generated in step 7 in the k pass is the same as the set of candidate k-sequences in Ck contained

in the customer sequence with identifier s.SID.

A candidate sequence s = s[1] s[k] is present in the customer sequence s.SID if and only if both s1 = (s-s[k]) and s2 = (s-s[k-1] are in the customer sequence s.SID. Since the candidate k-sequence was

found by using Candidate-gen (Lk-1), all subsequences of sk must be frequent. Hence, s1 and s2 must be

frequent sequences. Thus, if a candidate sequence s Ck is contained in the customer sequence s.SID, s1

and s2 must be members of s.set-of-sequences since C'k-1 is complete. A sequence s will be a member of

Ct. Hence, if cCk is not contained in a customer sequence s.SID, s will not be a member of Ct.

4 PERFORMANCE EVALUATION

In this section, we describe the experiments and the performance results of AprioriAllSID algorithms. We also compare the performance with the AprioriAll algorithm. The computer used to run the

experiments was a Pentium IV 2.8 GHz with 1GB of RAM with OS Windows 7 and the

algorithms were implemented using the Java 2.0. The datasets were maintained in main memory

during the algorithms processing, avoiding hard disk accesses.

Using data set generator, we have simulated the data and test algorithms with two algorithms namely AprioriAll and AprioriAllSID. We have used the simulated data for the performance comparison experiments. The data sets are assumed to simulate a customer-buying pattern in a retail environment

used in [20].

In the performance comparison, we used the five different data sets. The Table 1 shows the performance of AprioriAll, AprioriAllSID for minimum support 1% to 5% for different volume of data. Even though

AprioriAllSID seem to be nearly equal, for massive volume of data, the performance of AprioriAllSID

will be far better than AprioriAll algorithm.

DB AprioriAll (Execution Time in Seconds) AprioriAllSID (Execution Time in Seconds)

Size 1% 2% 3% 4% 5% 1% 2% 3% 4% 5%

100K 187 199 211 228 245 98 121 148 164 183

200K 325 339 351 367 384 174 192 221 246 265 300K 428 447 465 489 510 269 281 301 324 356 400K 559 587 611 638 669 346 371 392 415 458 500K 678 691 726 758 793 489 514 561 592 636

Table 1: Performance evaluation between AprioriAll and AprioriAllSID algorithms

The running times increase for both AprioriAllSID and AprioriAll algorithms as the minimum support is decreased because the total number of candidate sequences increase. AprioriAll algorithm in [2] is the

multiple passes over the data. So, the execution time is increased with increase of the customer transactions in the database. In Table 1, we can conclude that the AprioriAllSID algorithm is almost two

times faster than AprioriAll for small volume of data and more than the order of magnitude for the large

volume of data. The data sets ranges from giga bytes to tera bytes and the proposed algorithms will be much faster than AprioriAll. Thus we conclude that the proposed algorithms are quite suitable for massive

databases.

5 CONCLUSIONS

Sequential pattern mining is one of the mining techniques applied to predict customer behaviors in areas such as basket analysis or log monitoring. It has been presented an efficient algorithm,

AprioriAllSID, for discovering all relevant generalized association rules between items in massive

database of transactions. Also the performance evaluation between two algorithms AprioriAllSID and AprioriAll has been analyzed. The experimental results showing that the proposed algorithm always

outperform AprioriAll algorithm. Hence the proposed algorithm is very much suitable for massive

databases.

ACKNOWLEDGEMENT

The authors are extremely express gratitude to Dr. Omar Sayed Al-Mushayt, College Dean and Dr. Saeed Q Al-Khalidi, Vice Dean, College of Computer Science and Information Systems,

JAZAN University, Kingdom of Saudi Arabia for having noble and continuous encouragement

to complete this research. The special thanks also to the University President, JAZAN

University, Kingdom of Saudi Arabia for inspiration and persistent support directly or indirectly

for the completion of this research.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now