Agglomerative Method for Hierarchical Clustering

2017-09-01

Agglomerative clustering start with the points as individual clusters. At each step, it merges the closest pair of clusters until only one cluster (or k clusters) left.

Brief Introduction to Popular Data Mining Algorithms

By Chih-Ling Hsu

2017-08-30

In this article, I will introduce a regression algorithm, linear regression, classical classifiers such as decision trees, naïve Bayes, and support vector machine, and unsupervised clustering algorithms such as k-means, and reinforcement learning techniques, the cross-entropy method, to give only a small glimpse of the variety of machine learning techniques that exist, and we will end this list by introducing neural networks.

Code Example of a Neural Network for The Function XOR

By Chih-Ling Hsu

2017-08-30

It is a well-known fact, and something we have already mentioned, that 1-layer neural networks cannot predict the function XOR. 1-layer neural nets can only classify linearly separable sets, however, as we have seen, the Universal Approximation Theorem states that a 2-layer network can approximate any function, given a complex enough architecture.

We will now create a neural network with two neurons in the hidden layer and we will show how this can model the XOR function. However, we will write code that will allow the reader to simply modify it to allow for any number of layers and neurons in each layer, so that the reader can try simulating different scenarios. We are also going to use the hyperbolic tangent as the activity function for this network. To train the network, we will implement the back-propagation algorithm discussed earlier.

In addition, if you are interested in the mathemetical derivation of this implementation, please see my another post .

Imbalanced Data Classification

By Chih-Ling Hsu

2017-07-25

Most of data in the real-word are imbalance in nature. Imbalanced class distribution is a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes. This happens because Machine Learning Algorithms are usually designed to improve accuracy by reducing the error. Thus, they do not take into account the class distribution / proportion or balance of classes.

Accuracy Paradox Accuracy Paradox is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution.

Association Analysis of Sequence Data

By Chih-Ling Hsu

2017-07-01

A sequence is an ordered list of elements (transactions). For example, purchase history of a given customer, history of events generated by a given sensor, browsing activity of a particular Web visitor, and so on.

A sequence $s$ is defined as

\[s = <e_1~e_2~e_3~...>\]

Where $e_i$ is the $i^{th}$ element, containing a collection of events (items) and attributed to a specific time or location.

Data Mining - Preprocessing for Association Analysis

By Chih-Ling Hsu

2017-07-01

Before we do association analysis, we need to handle the following 2 issues:

Categorical Attributes
Continuous Attributes
Multi-Level Concept Hierarchy

Handling Continuous Attributes with Discretization-based Methods

By Chih-Ling Hsu

2017-07-01

Size of the discretized intervals affect support & confidence.

If intervals too small
- may not have enough support
- e.g. {Refund = No, (Income = 51,250)} $\rightarrow$ {Cheat = No}
If intervals too large
- may not have enough confidence
- e.g. {Refund = No, (0K $\leq$ Income $\leq$ 1B)} $\rightarrow$ {Cheat = No}

When there is any numerical attribute, the problem is to discretize individual numerical attribute into interesting intervals. Each interval is represented as a Boolean attribute.

WhaleCharger - Manage Daily Expense with QR Codes

By Chih-Ling Hsu

2017-05-09

WhaleCharger是一款為學生設計的記帳APP，摒棄複雜的收支出操作，以「預算」的設定為導向，幫你計算根據你目前剩下的預算，你平均每天還可以花多少錢。除了記帳，WhaleCharger還提供定時提醒、項目搜尋等功能。而且，主頁的鯨魚可以互動，他的心情會與你記帳的頻率有關。

EmailPro - Writing Email Like a Pro

By Chih-Ling Hsu

2017-05-09

EmailPro是一套英文信件寫作輔助系統，提升您在Gmail上寫作英文信件時的寫作效率與質量。EmailPro通過文字預測提供您目前寫作的下一步寫作建議，讓您可以在保持流暢寫作的同時，頻繁地使用正確的英文搭配詞組，並避免常見的語法錯誤。

EmailPro is a computer assisted English email writing system that provides you with multiple appropriate pattern/collocation suggestions depending on previous writing to prevent you from common grammar mistakes and help you adopt collocations correctly and frequently.

Simple Examples with Spark Streaming

By Chih-Ling Hsu

2017-05-02

Types of queries one wants on answer on a data stream:

Sampling data from a stream - Construct a random sample
Queries over sliding windows - Number of items of type x in the last k elements of the stream
Filtering a data stream - Select elements with property x from the stream
Counting distinct elements - Number of distinct elements in the last k elements of the stream
Estimating moments - Estimate average/std deviation of last k elements
Finding frequent elements

An Explorer of Things

Hello, my name is Chihling :)

Agglomerative Method for Hierarchical Clustering

Brief Introduction to Popular Data Mining Algorithms

Code Example of a Neural Network for The Function XOR

Imbalanced Data Classification

Association Analysis of Sequence Data

Data Mining - Preprocessing for Association Analysis

Handling Continuous Attributes with Discretization-based Methods

WhaleCharger - Manage Daily Expense with QR Codes

EmailPro - Writing Email Like a Pro

Simple Examples with Spark Streaming