The goal of this document is to practice Spark programming on Hadoop platform with the following problems.

  1. In the text file (Youvegottofindwhatyoulove.txt), show the top 30 most frequent occurring words and their average occurrences in a sentence According to the result, what are the characteristics of these words?
  2. Implement a program to calculate the average amount in credit card trip for different number of passengers which are from one to four passengers in 2017.09 NYC Yellow Taxi trip data. In NYC Taxi data, the “Passenger_count” is a driver-entered value. Explain also how you deal with the data loss issue.
  3. For each of the above task 1 and 2, compare the execution time on local worker and yarn cluster. Also, give some discussions on your observation.

Read More

In this document, we will walk through the data exploration of Airline on-time performance dataset with Apache Pig as our exploring tool. We will analyze 8 years of flight data to answer the following 3 analytic questions.

  1. Find the maximal delays (you should consider both ArrDelay and DepDelay) for each month of 2008.
  2. How many flights were delayed caused by weather between 2000 ~ 2005? Please show the counting for each year.
  3. List Top 5 airports which occur delays most in 2007. (Please show the IATA airport code)

Read More

Numbers of cluster validity measures have been proposed to help us not only with the validation of our clustering result but also with cluster number selection.

For fuzzy clustering, we can optimize our clustering results with some validity measure such as Partition Coefficient, Partition Entropy, XB-index, and Overlaps Separation Measure.

For hard clustering, we can use measures such as DB index and Dunn index.

And for hierarchical clustering, we can select the best $k$ by stopping at the “Big Jump” of distances while performing agglomerative clustering.

Read More

Probabilistic D-Clustering is a new iterative method for probabilistic clustering of data. Given clusters, their centers and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster. This assumption is the working principle of Probabilistic D-Clustering.

At each iteration, the distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for all data points, and the centers are updated as convex combinations of these points, with weights determined by the above principle.

Progress is monitored by the joint distance function, a measure of distance from all cluster centers, that evolves during the iterations, and captures the data in its low contours.

This method is simple, fast (requiring a small number of cheap iterations) and insensitive to outliers.

Read More