Categorical Data Clustering
Categorical Data Clustering, including k-modes and ROCK, will be introduced in this document.
k-modes
k-modes, a clustering method applying on categorical data, is just like k-means, though k-modes consider “frequency of occurences” other than the “average”.
For example, suppose we need to partition {$\overrightarrow{x_1}…\overrightarrow{x_5}$} into $k=2$ clusters given
\[\overrightarrow{x_1} = (\alpha, \text{large})\\ \overrightarrow{x_2} = (\beta, \text{small})\\ \overrightarrow{x_3} = (\beta, \text{medium})\\ \overrightarrow{x_4} = \overrightarrow{x_1} \\ \overrightarrow{x_5} = \overrightarrow{x_3}\]Initially, we randomly choose 2 modes $(\alpha, \text{large})$ & $(\beta, \text{small})$.
- the center of Cluster A : $y_A = (\alpha, \text{large})$
- the center of Cluster B : $y_B = (\beta, \text{small})$
So in the first iteration, we reallocate the cluster members with
- Cluster A : $x_1, x_4$
- Cluster B : $x_2, x_3, x_5$
and then update the modes
- the center of Cluster A : $y_A = (\alpha, \text{large})$
- the center of Cluster B : $y_B = (\beta, \text{medium})$
Afterwards, in the second iteration, we again reallocate the cluster members
- Cluster A : $x_1, x_4$
- Cluster B : $x_2, x_3, x_5$
that is, the modes are still
- the center of Cluster A : $(\alpha, \text{large})$
- the center of Cluster B : $(\beta, \text{medium})$
so we can stop updating since the modes are not changing.
Dissimilarity Measure (2007)
If there is a data point $x$ and a cluster $C$, then the dissimilarity between them is
\[d(x, y_C) = \sum_{i=1}^D \frac{q_i}{p_i+q_i}\]where $D$ is the dimension of each data point, $p_i$ is the number of members in $C$ that share the same value as $x$ in dimension $i$, and $q_i$ is the number of members in $C$ that do not share the same value as $x$ in dimension $i$.
For example, imagine there is a situation that the current 2 modes are
- the center of Cluster A : $y_A = (1,1,天)$
- the center of Cluster B : $y_B = (1,1,黃)$
and the members are assigned as
- Cluster A :
- $x_1=(1,1,天)$
- $x_2=(1,1,地)$
- $x_3=(1,1,玄)$
- Cluster B :
- $x_4=(1,1,黃)$
- $x_5=(2,1,玄)$
- $x_6=(1,2,天)$
then which cluster should we assign $\overrightarrow{x_7} = (1,1,宇)$ to?
In this case, the “Dissimilarity Measure” would say that Cluster A is the answer because
\[d(x_7, y_A) = \frac{0}{3+0}+\frac{0}{3+0}+1 = 1 \\ d(x_7, y_B) = \frac{1}{2+1}+\frac{1}{2+1} + 1 = 1.66\]dimension 1 | dimension 2 | dimension 3 | |
---|---|---|---|
Cluster A | $1$: 3 times $2$: 0 times |
$1$: 3 times $2$: 0 times |
天: 1 times 地: 1 times 玄: 0 times 黃: 0 times 宇: 0 times |
Cluster B | $1$: 2 times $2$: 1 times |
$1$: 2 times $2$: 1 times |
天: 1 times 地: 0 times 玄: 1 times 黃: 1 times 宇: 0 times |
Mode Initialization
To perform mode initialization, first we introduce 2 terms:
- Cluster Modes (C.M.), which will be annotated as $CM$ later.
- Finer Modes (F.M.), which will be annotated as $FM$ later.
Assume we want to cluster $n$ points into $k$ clusters, from data we do sub-sample randomly and obtain $J$ subsets $S_1, S_2, …, S_J$, where $J$ is given by
\[0.1 \frac{n}{k} < J < 0.5\frac{n}{k}\]Step 1. Sub-Sampling
- Let $CMU = \emptyset$
- For $i=1, 2, …, J$ do
- For set $S_i$, let ${CM}_i$ be the $k$ modes after running k-modes algorithm on $S_i$ using random initials.
- $CMU$ $=$ $(CMU_{old}) \cup (CM_{i})$
Step 2. Refine
- For $i=1, 2, …, J$ do
- Use ${CM}_i$ as initials to run k-modes on $CMU$.
- Let ${FM}_i$ be the $k$ modes after running k-modes algorithm on $CMU$.
Step 3. Refine
- Find a ${FM}_i$ (say, ${FM}_p=${$y_1,y_2,…,y_k$}) that minimizes the total sum of distortion.
In bean experiment, if we run k-modes 20 times, then the distrubution of accuracies are quite different between random initialization and the mode initailization method we talk about here.
accuracy | mode initilization | random initials |
---|---|---|
98% | 14 | 5 |
94% | 0 | 2 |
89% | 0 | 2 |
77% | 0 | 3 |
70% | 5 | 0 |
… | … | … |
ROCK-based Algorithm
ROCK is a robust hierarchical clustering algorithm that employs links instead of distances when merging clusters. It naturally extends to non-metric similarity measures that are relevant in situations where a domain expert or a similarity table is the only source of knowledge.
Definitions
Given transactions {$T_i$} $\vert_{i=1}^N$, $T_i =$ { $a_1, a_2, …$ }, define $D_{k}$ as the set of possible elements for the $k^{th}$ attribute. The terms that ROCK algorithm uses are defined in below.
- Weighted Similarity
- Neighbor
- If $Sim(T_i, T_j) > \theta$ then we say $T_i$ and $T_j$ are neighbors.
- Link
- $Link(T_i, T_j)$ = number of common neighbors of $T_i$ and $T_j$.
- $Link(C_i, C_j)$ is the Sum of all Link-pairs between 2 clusters.
So in ROCK algorithm, the goodness for merging 2 clusters $C_i$ & $C_j$ is defined as
\[g(C_i, C_j) = \frac{link(C_i, C_j)}{(\|C_i\| + \|C_j\|)^{R(\theta)} - \|C_i\|^{R(\theta)} - \|C_j\|^{R(\theta)}}\]where $3 < R(\theta) = \frac{3+\theta}{1-\theta} < \infty$. (e.g., $R(\theta) = 7$ if $\theta = 0.5$)
Algorithm
The Algorithm of ROCK contains 3 steps:
- Initially, each data point is a cluster.
- Merge clusters with highest goodness.
- Repeat Step 2. until only $k$ clusters remain or until all links between clusters are 0.
Use market dataset as an example, if we’ve got 4 transactions
\[T_1 = \{1,2,3,5\} = [1~1~1~0~1~0] \\ T_2 = \{2,3,4,5\} = [0~1~1~1~1~0] \\ T_3 = \{1,4\} = [1~0~0~1~0~0] \\ T_4 = \{6\} = [0~0~0~0~0~1]\]First, we compute the similarity of each pair of points to obtain neighborhood-list of each data point.(given $\theta = 0.5$).
\[Sim(T_1, T_2) = \frac{4}{4+2(1/2 + 1/2)} \\ Sim(T_1, T_3) = \frac{2}{2+2(1/2 + 1/2 + 1/2 + 1/2)} \\ ~ \\ \;\;\;\vdots \\ Sim(T_3, T_4) = \frac{3}{3+2(1/2 + 1/2 + 1/2)} \\ \bigg\downarrow \\ ~ \\ neighbor(T_1) = \{T_1, T_2\} \\ neighbor(T_2) = \{T_1, T_2\} \\ neighbor(T_2) = \{T_3, T_4\} \\ neighbor(T_4) = \{T_3, T_4\}\]Next, let $C_i = T_i~~\forall i$, we compute the goodness between each pair of clusters ($R(\theta) = 7$ since $\theta = 0.5$).
\[Link(C_1, C_2) = 2 \\ Link(C_1, C_3) = 0 \\ \;\;\;\vdots \\ Link(C_1, C_4) = 0 \\ ~ \\ \bigg\downarrow \\ ~ \\ g(C_1, C_2) = \frac{Link(C_1, C_2)}{(1+1)^7-1^7-1^7} = 0.0079 \\ g(C_1, C_3) = 0 \\ \;\;\;\vdots \\ g(C_1, C_4) = 0\]And then we merge the pair that has the max goodness. That is, we merge $C_1$ and $C_2$, and we call the new cluster $C_{1,2}=$ { $T_1, T_2$ }.
After that, we iteratively compute the goodness between each pair of clusters and merge clusters until we obtain $k$ clusters or until $Link(C_i, C_j)$ becomes $0$ for any cluster pair $(C_i, C_j)$.
Experiment and Discussion
If we apply ROCK algorithm on mushroom data, which contains 8124 mushrooms with 22 attributes, we can see three kinds of result given $\theta = 0.8$, $k=20$.
- In every cluster mushrooms are either all edible or all poison.
- Well clustered by the odor attribute, e.g., {none, anise , almond} for edible and {foul, fishy, spicy} for poison.
- Not so well-separated.
Note that the final clusters have all zero links between them, which are actually the connected components in its link graph. By the way, the parameter $\theta$ determines the density of the link graph and affects the final clusters.