Categorical Data Clustering, including k-modes and ROCK, will be introduced in this document.

k-modes

k-modes, a clustering method applying on categorical data, is just like k-means, though k-modes consider “frequency of occurences” other than the “average”.

For example, suppose we need to partition {$\overrightarrow{x_1}…\overrightarrow{x_5}$} into $k=2$ clusters given

\[\overrightarrow{x_1} = (\alpha, \text{large})\\ \overrightarrow{x_2} = (\beta, \text{small})\\ \overrightarrow{x_3} = (\beta, \text{medium})\\ \overrightarrow{x_4} = \overrightarrow{x_1} \\ \overrightarrow{x_5} = \overrightarrow{x_3}\]

Initially, we randomly choose 2 modes $(\alpha, \text{large})$ & $(\beta, \text{small})$.

  • the center of Cluster A : $y_A = (\alpha, \text{large})$
  • the center of Cluster B : $y_B = (\beta, \text{small})$

So in the first iteration, we reallocate the cluster members with

  • Cluster A : $x_1, x_4$
  • Cluster B : $x_2, x_3, x_5$

and then update the modes

  • the center of Cluster A : $y_A = (\alpha, \text{large})$
  • the center of Cluster B : $y_B = (\beta, \text{medium})$

Afterwards, in the second iteration, we again reallocate the cluster members

  • Cluster A : $x_1, x_4$
  • Cluster B : $x_2, x_3, x_5$

that is, the modes are still

  • the center of Cluster A : $(\alpha, \text{large})$
  • the center of Cluster B : $(\beta, \text{medium})$

so we can stop updating since the modes are not changing.

Dissimilarity Measure (2007)

If there is a data point $x$ and a cluster $C$, then the dissimilarity between them is

\[d(x, y_C) = \sum_{i=1}^D \frac{q_i}{p_i+q_i}\]

where $D$ is the dimension of each data point, $p_i$ is the number of members in $C$ that share the same value as $x$ in dimension $i$, and $q_i$ is the number of members in $C$ that do not share the same value as $x$ in dimension $i$.

For example, imagine there is a situation that the current 2 modes are

  • the center of Cluster A : $y_A = (1,1,天)$
  • the center of Cluster B : $y_B = (1,1,黃)$

and the members are assigned as

  • Cluster A :
    • $x_1=(1,1,天)$
    • $x_2=(1,1,地)$
    • $x_3=(1,1,玄)$
  • Cluster B :
    • $x_4=(1,1,黃)$
    • $x_5=(2,1,玄)$
    • $x_6=(1,2,天)$

then which cluster should we assign $\overrightarrow{x_7} = (1,1,宇)$ to?

In this case, the “Dissimilarity Measure” would say that Cluster A is the answer because

\[d(x_7, y_A) = \frac{0}{3+0}+\frac{0}{3+0}+1 = 1 \\ d(x_7, y_B) = \frac{1}{2+1}+\frac{1}{2+1} + 1 = 1.66\]
  dimension 1 dimension 2 dimension 3
Cluster A $1$: 3 times
$2$: 0 times
$1$: 3 times
$2$: 0 times
天: 1 times
地: 1 times
玄: 0 times
黃: 0 times
宇: 0 times
Cluster B $1$: 2 times
$2$: 1 times
$1$: 2 times
$2$: 1 times
天: 1 times
地: 0 times
玄: 1 times
黃: 1 times
宇: 0 times

Mode Initialization

To perform mode initialization, first we introduce 2 terms:

  • Cluster Modes (C.M.), which will be annotated as $CM$ later.
  • Finer Modes (F.M.), which will be annotated as $FM$ later.

Assume we want to cluster $n$ points into $k$ clusters, from data we do sub-sample randomly and obtain $J$ subsets $S_1, S_2, …, S_J$, where $J$ is given by

\[0.1 \frac{n}{k} < J < 0.5\frac{n}{k}\]

Step 1. Sub-Sampling

  1. Let $CMU = \emptyset$
  2. For $i=1, 2, …, J$ do
    • For set $S_i$, let ${CM}_i$ be the $k$ modes after running k-modes algorithm on $S_i$ using random initials.
    • $CMU$ $=$ $(CMU_{old}) \cup (CM_{i})$

Step 2. Refine

  1. For $i=1, 2, …, J$ do
    • Use ${CM}_i$ as initials to run k-modes on $CMU$.
    • Let ${FM}_i$ be the $k$ modes after running k-modes algorithm on $CMU$.

Step 3. Refine

  1. Find a ${FM}_i$ (say, ${FM}_p=${$y_1,y_2,…,y_k$}) that minimizes the total sum of distortion.
\[\text{total sum of distortion} = \\ \sum_{i=1}^N \min\bigg(\text{Distortion}(x_i, y_1), \text{Distortion}(x_i, y_2), ..., \text{Distortion}(x_i, y_k)\bigg)\]

In bean experiment, if we run k-modes 20 times, then the distrubution of accuracies are quite different between random initialization and the mode initailization method we talk about here.

accuracy mode initilization random initials
98% 14 5
94% 0 2
89% 0 2
77% 0 3
70% 5 0

ROCK-based Algorithm

ROCK is a robust hierarchical clustering algorithm that employs links instead of distances when merging clusters. It naturally extends to non-metric similarity measures that are relevant in situations where a domain expert or a similarity table is the only source of knowledge.

Definitions

Given transactions {$T_i$} $\vert_{i=1}^N$, $T_i =$ { $a_1, a_2, …$ }, define $D_{k}$ as the set of possible elements for the $k^{th}$ attribute. The terms that ROCK algorithm uses are defined in below.

  1. Weighted Similarity
\[Sim(T_i, T_j) = \frac{\|T_i \cap T_j\|}{\|T_i \cap T_j\| + 2 \sum_{a_k \notin T_i\cap T_j} \frac{1}{\|D_k\|}}\]
  1. Neighbor
    • If $Sim(T_i, T_j) > \theta$ then we say $T_i$ and $T_j$ are neighbors.
  2. Link
    • $Link(T_i, T_j)$ = number of common neighbors of $T_i$ and $T_j$.
    • $Link(C_i, C_j)$ is the Sum of all Link-pairs between 2 clusters.
\[Link(C_1, C_2) = \sum_{T_i \in C_1, T_j \in C_2} Link(T_i, T_j)\]

So in ROCK algorithm, the goodness for merging 2 clusters $C_i$ & $C_j$ is defined as

\[g(C_i, C_j) = \frac{link(C_i, C_j)}{(\|C_i\| + \|C_j\|)^{R(\theta)} - \|C_i\|^{R(\theta)} - \|C_j\|^{R(\theta)}}\]

where $3 < R(\theta) = \frac{3+\theta}{1-\theta} < \infty$. (e.g., $R(\theta) = 7$ if $\theta = 0.5$)

Algorithm

The Algorithm of ROCK contains 3 steps:

  1. Initially, each data point is a cluster.
  2. Merge clusters with highest goodness.
  3. Repeat Step 2. until only $k$ clusters remain or until all links between clusters are 0.

Use market dataset as an example, if we’ve got 4 transactions

\[T_1 = \{1,2,3,5\} = [1~1~1~0~1~0] \\ T_2 = \{2,3,4,5\} = [0~1~1~1~1~0] \\ T_3 = \{1,4\} = [1~0~0~1~0~0] \\ T_4 = \{6\} = [0~0~0~0~0~1]\]

First, we compute the similarity of each pair of points to obtain neighborhood-list of each data point.(given $\theta = 0.5$).

\[Sim(T_1, T_2) = \frac{4}{4+2(1/2 + 1/2)} \\ Sim(T_1, T_3) = \frac{2}{2+2(1/2 + 1/2 + 1/2 + 1/2)} \\ ~ \\ \;\;\;\vdots \\ Sim(T_3, T_4) = \frac{3}{3+2(1/2 + 1/2 + 1/2)} \\ \bigg\downarrow \\ ~ \\ neighbor(T_1) = \{T_1, T_2\} \\ neighbor(T_2) = \{T_1, T_2\} \\ neighbor(T_2) = \{T_3, T_4\} \\ neighbor(T_4) = \{T_3, T_4\}\]

Next, let $C_i = T_i~~\forall i$, we compute the goodness between each pair of clusters ($R(\theta) = 7$ since $\theta = 0.5$).

\[Link(C_1, C_2) = 2 \\ Link(C_1, C_3) = 0 \\ \;\;\;\vdots \\ Link(C_1, C_4) = 0 \\ ~ \\ \bigg\downarrow \\ ~ \\ g(C_1, C_2) = \frac{Link(C_1, C_2)}{(1+1)^7-1^7-1^7} = 0.0079 \\ g(C_1, C_3) = 0 \\ \;\;\;\vdots \\ g(C_1, C_4) = 0\]

And then we merge the pair that has the max goodness. That is, we merge $C_1$ and $C_2$, and we call the new cluster $C_{1,2}=$ { $T_1, T_2$ }.

After that, we iteratively compute the goodness between each pair of clusters and merge clusters until we obtain $k$ clusters or until $Link(C_i, C_j)$ becomes $0$ for any cluster pair $(C_i, C_j)$.

Experiment and Discussion

If we apply ROCK algorithm on mushroom data, which contains 8124 mushrooms with 22 attributes, we can see three kinds of result given $\theta = 0.8$, $k=20$.

  • In every cluster mushrooms are either all edible or all poison.
  • Well clustered by the odor attribute, e.g., {none, anise , almond} for edible and {foul, fishy, spicy} for poison.
  • Not so well-separated.

Note that the final clusters have all zero links between them, which are actually the connected components in its link graph. By the way, the parameter $\theta$ determines the density of the link graph and affects the final clusters.

References