Data Mining - Preprocessing for Association Analysis
Before we do association analysis, we need to handle the following 2 issues:
- Categorical Attributes
- Continuous Attributes
- Multi-Level Concept Hierarchy
Handling Categorical Attributes
To apply association analysis formulation, we need asymmetric binary variables, such as
\[(Gender=Male, Browser=IE, Buy=No)\]Which can be seen as 3 items in a transaction:
\[(A, B, C)\]That’s the reason why we need to transform categorical attribute into asymmetric binary variables by introducing a new “item” for each distinct attribute-value pair. For example, Browser=IE
should be a different item than Browser=Chrome
and Browser=Mozilla
. So we should transform
Browser | Buy |
---|---|
IE | Yes |
Chrome | Yes |
Mozilla | No |
into
Browser=IE | Browser=Chrome | Browser=Mozilla | Buy |
---|---|---|---|
Yes | No | No | Yes |
No | Yes | No | Yes |
No | No | Yes | No |
Handling Continuous Attributes
When there is any numerical attribute, the problem is to discretize individual numerical attribute into interesting intervals. Each interval is represented as a Boolean attribute.
Size of the discretized intervals affect support & confidence.
- If intervals too small
- may not have enough support
- e.g. {Refund = No, (Income = 51,250)} $\rightarrow$ {Cheat = No}
- If intervals too large
- may not have enough confidence
- e.g. {Refund = No, (0K $\leq$ Income $\leq$ 1B)} $\rightarrow$ {Cheat = No}
Discretization-based method
The equi-sized approach is to simply partition the continuous domain into intervals with equal length. The equi-depth approach basically partitions the data values into intervals with equal size along the ordering of the data. For more information, please refer to Handling Continuous Attributes with Discretization-based Methods.
Statistics-based method
Assume $\mu$ is the segment of population covered by the rule $R$ and $\mu’$ is the segment of population not covered by the rule. We can say that rule $R$ is interesting if
\[\mu' > \mu + \Delta\]Where $\Delta$ is the difference that is large enough to determine interestingness.
Statistical Hypothesis Testing
Since research/alternative hypothesis ($H_1$) is
\[\mu' > \mu + \Delta\]So the null hypotheis ($H_2$) is
\[\mu' \leq \mu + \Delta\]Which can also be
\[\mu' = \mu + \Delta \\ \mu' - \mu - \Delta = 0\]As a result, we only need to prove that
\[Z = \frac{\mu'-\mu-\Delta}{\sqrt{\frac{s_1^2}{n_1^2}+\frac{s_2^2}{n_2^2}}} > 1.64 = at~the~95\%~confidence~level\]so we can reject the null hypothesis in favor of the alternative.
Non-discretization based method
Min-Apriori is an algorithm for finding association rules in data with continuous attributes. It counts support of itemsets using non-discretization method. For example, if there are 3 documents
$W_1$ | $W_2$ | $W_3$ | |
---|---|---|---|
$D_1$ | 2 | 2 | 0 |
$D_1$ | 0 | 0 | 1 |
$D_1$ | 2 | 3 | 0 |
With Min-Apriori, we do not need to discretize the count of words. Instead, we normalize the word vectors and then use the new definition of support
\[support(C) = \sum_{i \in T}{min_{j \in C}D(i,j)}\]Where $C$ is a set of words and $T$ is the set of documents.
Multi-level Association Rule
The issues about concept hierarchy:
- Rules at lower levels may not have enough support to appear in any frequent itemsets.
- Rules at lower levels of the hierarchy are overly specific.
For example, ${skim milk} \rightarrow {white bread}$, ${2\% milk} \rightarrow {wheat bread}$, and ${skim milk} \rightarrow {wheat bread}$ are also indicative of ${milk} \rightarrow {bread}$
To deal with the issues, two approaches are introduces:
Augment transactions with higher level items
Before doing association analysis, extend current association rule formulation by augmenting each transaction with higher level items.
However, this approach increases dimensionality of the data. And too many frequent patterns involving items from the higher levels would be found since items that reside at higher levels have much higher support counts.
Generate frequent patterns from highest level to lowest level
Generate frequent patterns at highest level first. Then, generate frequent patterns at the next highest level, and so on.
The disadvantages of this approach are:
- More passes over the data leads to dramatically increased I/O requirements.
- May miss some potentially interesting cross-level association patterns.