Analytical Data Clustering

Analytical clustering is a quick and automatic way by preserving certain features of the input data. The method is analytical, deterministic, unsupervised, automatic, and noniterative.

1-Dimensional 2-Class Clustering

For example, if we want to find 2 clusters among 3000 data points, the goal is to find

the 2 representatives $X_A, X_B$
the 2 percentages $P_A, P_B,$ ($P_A+P_B = 100\%$)
- cluster $A$ will include $3000\times P_A$ points
- cluster $B$ will include $3000\times P_B$ points

Note that we will first obtain the cluster representatives ($X_A, X_B$) and then obtain their percentages ($P_A, P_B$).

Clustering can be viewed as a method for quantification, that is, represent all data with smaller information.

When see it as a quantification, what we need to do is to solve $4$ unknown variables. So we can do clustering with $4$ assumptions that

$P_A+P_B = 100\%$
$P_AX_A +P_BX_B = \overline{X} = \frac{1}{3000}(X_1+X_2+…+X_{3000})$
$P_AX_A^{2} +P_BX_B^{2} = \overline{X^{2}} = \frac{1}{3000}(X_1^{2}+X_2^{2}+…+X_{3000}^{2})$
$P_AX_A^{3} +P_BX_B^{3} = \overline{X^{3}} = \frac{1}{3000}(X_1^{3}+X_2^{3}+…+X_{3000}^{3})$

which can be used to solved the $4$ variables.

2-Dimensional 2-Class Clustering

Now we need to find

the 2 representatives $(X_A, Y_A), (X_B, Y_B)$
the 2 percentages $P_A, P_B$ ($P_A+P_B = 100\%$)
- cluster $A$ will include $3000\times P_A$ points
- cluster $B$ will include $3000\times P_B$ points

Assume that

\[X_i = r_icos\theta_i \\ Y_i = r_isin\theta_i \\ where~i \in \{A,B\}\]

the following $6$ formulas will be used to solve these variables:

$P_A+P_B = 100\%$
$P_AX_A+P_BX_B=\overline{X}$
$P_AY_A+P_BY_B=\overline{Y}$
$P_Ar_A+P_Br_B=\overline{r}$
$P_A\theta_A+P_B\theta_B=\overline{\theta}$
Principle axis of ${(X_A, Y_A), (X_B, Y_B)}$ = Principle axis of data

To make the computation easier, we first perform a transformation so that $\overline{X}=0$ and $\overline{Y}=0$. Note that $\overline{xy}, \overline{x^2}, \overline{y^2}, \overline{r}, \overline{\theta}$ are all known values.

Because

$P_AX_A+P_BX_B=\overline{X}=0$
$P_AY_A+P_BY_B=\overline{Y}=0$

therefore

$X_A=\frac{-P_B}{P_A}X_B$
$Y_A=\frac{-P_B}{P_A}Y_B$

indicating that

\[r_A = \sqrt{X_A^2+Y_A^2}=\frac{P_B}{P_A}\sqrt{X_B^2+Y_B^2}=\frac{P_B}{P_A}r_B \\ \because \overline{r}=P_Ar_A+P_Br_B=~P_A\bigg(\frac{P_B}{P_A}r_A\bigg)+P_Br_B = 2P_Br_B \\ \therefore r_B = \frac{\overline{r}}{2P_B}\]

And the angle of the principle axis will be

\[\theta_A = \frac{1}{2}tan^{-1}\bigg(\frac{2\overline{xy}}{\overline{x^2}- \overline{y^2}}\bigg) \\ \theta_B = \theta_A+\pi ~(\because~Principle~Axis~is~reserved) \\ \because \overline{\theta} = P_A\theta_A+P_B\theta_B = (1-\theta_B)\theta_A+P_B(\theta_A + \pi) \\ \therefore P_B=(\overline{\theta}-\theta_A)/\pi\]

The principle axis is reserved before and after clustering. That is, we are minimizing the squared length of the projection from data point to the principle axis.

3-Dimensional 2-Class Clustering

Analytical 2-class Clustering in 3-dim data is to solve 8 unknown variables ($P_A, P_B, X_A, X_B, Y_A, Y_B, Z_A, Z_B$) that uses one of the following method to preserve the properties before and after clustering.

Method 1. Preserve ${P; \overline{x}, \overline{y}, \overline{z}, \overline{|xy|}, \overline{|xz|}, \overline{|yz|}, \overline{|xyz|}}$ before and after clustering.

\[P_A=P_A = \frac{1}{2} \pm \frac{1}{2}\sqrt{\frac{\sqrt{\delta^2+8\delta}-\delta-2}{2}} \\ where~\delta=\frac{\overline{\|xyz\|}^2}{\overline{\|xy\|}\overline{\|xz\|}\overline{\|yz\|}} \\ x_A = \pm\sqrt{\frac{P_B}{P_A}\frac{\overline{\|xy\|}~\overline{\|xz\|}}{\overline{\|yz\|}}} \\ y_A = \pm\sqrt{\frac{P_B}{P_A}\frac{\overline{\|xy\|}~\overline{\|yz\|}}{\overline{\|xz\|}}} \\ z_A = \pm\sqrt{\frac{P_B}{P_A}\frac{\overline{\|xz\|}~\overline{\|yz\|}}{\overline{\|xy\|}}}\]

Method 2. Preserve ${P; \overline{x}, \overline{y}, \overline{z}, \overline{xy}, \overline{xz}, \overline{yz}, \overline{xyz}}$ before and after clustering.

\[P_A = \frac{1}{2} \pm \frac{1}{2}\sqrt{\frac{(\overline{xyz})^2}{(\overline{xyz})^2+4\overline{xy}\overline{xz}\overline{yz}}} \\ x_A = \pm\sqrt{\frac{P_B}{P_A}\frac{\overline{xy}~\overline{xz}}{\overline{yz}}} \\ y_A = \pm\sqrt{\frac{P_B}{P_A}\frac{\overline{xy}~\overline{yz}}{\overline{xz}}} \\ z_A = \pm\sqrt{\frac{P_B}{P_A}\frac{\overline{xz}~\overline{yz}}{\overline{xy}}}\]

Method 3. Preserve ${P; \overline{x}, \overline{y}, \overline{z}, \overline{x^2}, \overline{y^2}, \overline{z^2}, \overline{r}}$ before and after clustering, where $r=\frac{\sum_{k=1}^{N}\sqrt{x_k^2+y_k^2+z_k^2}}{N}$.

\[P_A = \frac{1}{2} \pm \frac{1}{2}\sqrt{1-\frac{(\overline{r})^2}{\overline{x^2}+\overline{y^2}+\overline{z^2}}} \\ x_A = \pm\sqrt{\frac{P_B}{P_A}\overline{x^2}} \\ y_A = \pm\sqrt{\frac{P_B}{P_A}\overline{y^2}} \\ z_A = \pm\sqrt{\frac{P_B}{P_A}\overline{z^2}}\]

Method 4. Preserve ${P; \overline{x}, \overline{y}, \overline{z}, \overline{x^2}, \overline{y^2}, \overline{z^2}, \overline{xyz}}$ before and after clustering.

\[P_A = \frac{1}{2} \pm \frac{1}{2}\sqrt{\frac{(\overline{xyz})^2}{(\overline{xyz})^2+4\overline{x^2}\overline{y^2}\overline{z^2}}} \\ x_A = \pm\sqrt{\frac{P_B}{P_A}\overline{x^2}} \\ y_A = \pm\sqrt{\frac{P_B}{P_A}\overline{y^2}} \\ z_A = \pm\sqrt{\frac{P_B}{P_A}\overline{z^2}}\]

d-Dimensional 2-Class Clustering

Use the lookup table in this reserach paper to solve variables.

Multi-class Clustering

Step 1. Do analytical clustering for $k$ = 2.

Step 2. Draw vertical line (A line segment bisector passes through the midpoint of the segment) to split the two clusters according to their cluster centers.

Step 3. For each splitted part of data points, do Step 1. and Step 2. recursively.

Step 4. Merge 2 close clusters that have the same boundary cut.

Applications

The applications in 3-Dimensions are

Color Image Sharpening
Color Image Compression

Also, for $d$-Dimensions, the applications are

Texture Analysis ($d$ = 6~8)
Politics Investigation (say, $d$ = 10)
Vector Quantization (say, $d$ = 16)

Strengths

Automatic and fast
Less sensitive to noise
No need to do cluster center initialization
Non-iterative
Similar results to that of k-means
Can be used as the center initialization for k-means (a method of CCIA)

Weakness

Complicated and tedius
When $k \geq 5$ ($k$ is the number of classes), there is no formula solution (in this case we can only do clustering iteration-by-iteration)

An Explorer of Things

Hello, my name is Chihling :)