Association Analysis of Sequence Data
A sequence is an ordered list of elements (transactions). For example, purchase history of a given customer, history of events generated by a given sensor, browsing activity of a particular Web visitor, and so on.
A sequence $s$ is defined as
\[s = <e_1~e_2~e_3~...>\]Where $e_i$ is the $i^{th}$ element, containing a collection of events (items) and attributed to a specific time or location.
Subsequence and Sequencial Pattern
We can say that a sequence $<a_1~a_2~…~a_n>$ is contained in another sequence $<b_1~b_2~…~b_m>$ $(m ≥ n)$ if
\[\exists i_1,i_2, ... \in N, i_1<i_2<i_3<...<i_n~~such~that \\ a_1 \subset b_{i_1}~,~a_2 \subset b_{i_2}~,...,~a_n \subset b_{i_n}~\]Defining sequence $<a_1~a_2~…~a_n>$ as a subsequence of sequence $<b_1~b_2~…~b_m>$, A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup).
Sequential Pattern Mining
Given a database of sequences and a user-specified minimum support threshold($minsup$), the goal id to find all subsequences with $support ≥ minsup$.
Generalized Sequential Pattern (GSP)
Generalized Sequential Pattern (GSP) is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the a priori (level-wise) algorithm. The algorithm follows the following steps:
- Make the first pass over the sequence database to yield all the $1-element$ frequent sequences.
- Candidate Generation - Merge pairs of frequent subsequences found in the $(k-1)^{th}$ pass to generate candidate $k-sequences$.
- Candidate Pruning - Prune candidate $k-sequences$ that contain infrequent $(k-1)-subsequences$
- Support Counting - Make a new pass over the sequence database to find the support for these candidate sequences
- Candidate Elimination - Eliminate candidate $k-sequences$ whose actual support is less than $minsup$
- If any frequent $k-sequences$ is found, increment $k$ and go back to step 2.
Note that we can merge sequences $w_1=<(1) (2~3) (4)>$ and $w_2 =<(2~3) (4) (5)>$ into $<(1) (2~3) (4) (5)>$. However, it is unneccesary to merge sequences $w_1=<(1) (2~6) (4)>$ and $w_2 =<(1) (2) (4~5)>$ into $<(1) (2~6) (4~5)>$ since $<(1) (2~6) (4~5)>$ is a viable candidate as it is still a $5-sequence$.
Mining Sequential Patterns with Timing Constraints
Other than preprocessing the sequence database or postprocessing the discovered patterns after mining sequential patterns without timing constraints, we can also modify GSP to directly prune candidates that violate timing constraints.
1. We need to modify Candidate Pruning step if there is max-gap constraint.
Because of max-gap constraint, Apriori principle might not hold. So Candidate Pruning step should be modified from
Prune candidate $k-sequences$ that contain $(k-1)-subsequences$ that are infrequent.
into
Prune candidate $k-sequences$ that contain contigious $(k-1)-subsequences$ that are infrequent.
$s$ is a contiguous subsequence of $w = <e_1~ e_2~…~e_k>$ if any of the following conditions holds:
- $s$ is obtained from $w$ by deleting an item from either $e_1$ or $e_k$
- $s$ is obtained from $w$ by deleting an item from any element $e_i$ that contains more than 2 items
- $s$ is a contiguous subsequence of $s’$ and $s’$ is a contiguous subsequence of $w$.
2. We need to modify Support Counting step if there is window-size constraint.
Because of window-size constraint, different events within window-size can be seen as the same event. So Support Counting step should be modified from
Given a candidate pattern $<(a~c)>$, any data sequence that contains $<…,(a~c),…>$ will contribute to the support count of candidate pattern.
into
Given a candidate pattern $<(a~c)>$, any data sequence that contains $<…,(a~c),…>$ $<…,(a),…,(c),…>$ where time({c}) – time({a}) ≤ ws $<…,(c),…,(a),…>$ where time({a}) – time({c}) ≤ ws will contribute to the support count of candidate pattern.