Spark Programming Practice on Hadoop Platform

2018-06-14

The goal of this document is to practice Spark programming on Hadoop platform with the following problems.

In the text file (Youvegottofindwhatyoulove.txt), show the top 30 most frequent occurring words and their average occurrences in a sentence According to the result, what are the characteristics of these words?
Implement a program to calculate the average amount in credit card trip for different number of passengers which are from one to four passengers in 2017.09 NYC Yellow Taxi trip data. In NYC Taxi data, the “Passenger_count” is a driver-entered value. Explain also how you deal with the data loss issue.
For each of the above task 1 and 2, compare the execution time on local worker and yarn cluster. Also, give some discussions on your observation.

Analyze Airline On-time Performance Dataset

2018-06-14

In this document, we will walk through the data exploration of Airline on-time performance dataset with Apache Pig as our exploring tool. We will analyze 8 years of flight data to answer the following 3 analytic questions.

Find the maximal delays (you should consider both ArrDelay and DepDelay) for each month of 2008.
How many flights were delayed caused by weather between 2000 ~ 2005? Please show the counting for each year.
List Top 5 airports which occur delays most in 2007. (Please show the IATA airport code)

Cluster Validaty and Cluster Number Selection

By Chih-Ling Hsu

2018-05-28

Numbers of cluster validity measures have been proposed to help us not only with the validation of our clustering result but also with cluster number selection.

For fuzzy clustering, we can optimize our clustering results with some validity measure such as Partition Coefficient, Partition Entropy, XB-index, and Overlaps Separation Measure.

For hard clustering, we can use measures such as DB index and Dunn index.

And for hierarchical clustering, we can select the best $k$ by stopping at the “Big Jump” of distances while performing agglomerative clustering.

Aggregation of Clustering Methods

By Chih-Ling Hsu

2018-05-21

Since a large number of clustering algorithms exist, aggregating different clustered partitions into a single consolidated one to obtain better results has become an important problem.

Analyze the NYC Taxi Data

By Chih-Ling Hsu

2018-05-14

In this document, I will walk through the analysis of New York City Taxi Data (with download link shown in Section II) using Python. 6 months of “Yellow” label data will be loaded and analyzed.

Categorical Data Clustering

By Chih-Ling Hsu

2018-05-07

Categorical Data Clustering, including k-modes and ROCK, will be introduced in this document.

Quick-Finding of the Nearest Center

By Chih-Ling Hsu

2018-04-26

Quick-Finding of the Nearest Center is used for

Fast k-means
Fast VQ (accelerating the compression speed of VQ)
Fast classification

Probabilistic D-Clustering

By Chih-Ling Hsu

2018-04-16

Probabilistic D-Clustering is a new iterative method for probabilistic clustering of data. Given clusters, their centers and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster. This assumption is the working principle of Probabilistic D-Clustering.

At each iteration, the distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for all data points, and the centers are updated as convex combinations of these points, with weights determined by the above principle.

Progress is monitored by the joint distance function, a measure of distance from all cluster centers, that evolves during the iterations, and captures the data in its low contours.

This method is simple, fast (requiring a small number of cheap iterations) and insensitive to outliers.

Vector Quantization and Its Variants

By Chih-Ling Hsu

2018-04-09

The goal of Vector Quantization(VQ) is to perform compression and decompression using a given codebook. So, annother question is, “How to generate a codebook?”

Use Annealing Method to Avoid Local Minimum

By Chih-Ling Hsu

2018-04-02

Local minima are still an important unsolved problem for artificial potential field approaches. In order to overcome this problem, annealing methods are proposed to prevent being trapped in a local minimum point, and allows the computation to continue its trajectory towards the final destination.

An Explorer of Things

Hello, my name is Chihling :)

Spark Programming Practice on Hadoop Platform

Analyze Airline On-time Performance Dataset

Cluster Validaty and Cluster Number Selection

Aggregation of Clustering Methods

Analyze the NYC Taxi Data

Categorical Data Clustering

Quick-Finding of the Nearest Center

Probabilistic D-Clustering

Vector Quantization and Its Variants

Use Annealing Method to Avoid Local Minimum