Classification is well so common in the area of machine learning and scikit-learn provides a comprehensive toolkit that can be easily used. Here I will share some common classification models and how to apply them on a dataset using this good toolkit, while the classification process will cover

  • training and testing
  • cross validation and grid serach process
  • classification performace display
  • plots of area under curves

Read More

In this document, I will build a predictive framework for predicting whether each flight in 2006 will be cancelled or not by using the data from 2000 to 2005 as training data.

Items to be delivered in this document includes:

  1. Show the predictive framework you designed. What features do you extract? What algorithms do you use in the framework?
  2. Explain the validation method you use.
  3. Explain the evaluation metric you use and show the effectiveness of your framework (i.e., use confusion matrix)
  4. Show the validation results and give a summary of results.

Read More

The Wisconsin Diabetes Registry Study targeted all individuals $<30$ years of age diagnosed with Type I diabetes in southern Wisconsin, USA. Participants were requested to submit blood samples and were sent a questionnaire inquiring about hospitalizations and other events. The blood samples were used to determine glycosylated hemoglobin (GHb), and important indicator of glycemic control.

The data set diabetes.txt, which can be downloaded from here, contains the data from the Wisconsin Diabetes Registry Study. The data items are:

Variable name meaning
ID an unique identification number
HEALTH self-reported health status: 1=excellent, 2=good, 3=fair, 4=poor
BH dichotomized health status: 1=excellent health, 0=worse than excellent health
GENDER sex code: 1=female, 0=male
GHb overal mean glycosylated hemoglobin value in study
AGE age at diagnosis (years)
  • In the section “Relationship of GHb with age”, we are interested in the relationship of GHb with age at diagnosis and/or self-reported health status
  • In the section “Modeling of dichotomized health status”, we use BH as the dependent variable (response variable) and use models such as logistic regression to show the explanationary power of other independent variables.

Read More

This practice is based on Hans Rosling talks New Insights on Poverty and The Best Stats You’ve Ever Seen.

The assignment uses data to answer specific question about global health and economics. The data contradicts commonly held preconceived notions. For example, Hans Rosling starts his talk by asking: “for each of the six pairs of countries below, which country do you think had the highest child mortality in 2015?”

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa

Most people get them wrong. Why is this? In part it is due to our preconceived notion that the world is divided into two groups: the Western world versus the third world, characterized by “long life,small family” and “short life, large family” respectively. In this homework we will use data visualization to gain insights on this topic.

Read More

The Universal Approximation Theorem states that a 2-layer network can approximate any function, given a complex enough architecture. That’s why we will create a neural network with two neurons in the hidden layer and we will later show how this can model the XOR function.

In this experiment, we will need to understand and write a simple neural network with backpropagation for “XOR” using only numpy and other python standard library.

The code here will allow the user to specify any number of layers and neurons in each layer. In addition, we are going to use the logistic function as the activity function for this network.

Read More

The goal of this document is to practice Spark programming on Hadoop platform with the following problems.

  1. In the text file (Youvegottofindwhatyoulove.txt), show the top 30 most frequent occurring words and their average occurrences in a sentence According to the result, what are the characteristics of these words?
  2. Implement a program to calculate the average amount in credit card trip for different number of passengers which are from one to four passengers in 2017.09 NYC Yellow Taxi trip data. In NYC Taxi data, the “Passenger_count” is a driver-entered value. Explain also how you deal with the data loss issue.
  3. For each of the above task 1 and 2, compare the execution time on local worker and yarn cluster. Also, give some discussions on your observation.

Read More

In this document, we will walk through the data exploration of Airline on-time performance dataset with Apache Pig as our exploring tool. We will analyze 8 years of flight data to answer the following 3 analytic questions.

  1. Find the maximal delays (you should consider both ArrDelay and DepDelay) for each month of 2008.
  2. How many flights were delayed caused by weather between 2000 ~ 2005? Please show the counting for each year.
  3. List Top 5 airports which occur delays most in 2007. (Please show the IATA airport code)

Read More

Numbers of cluster validity measures have been proposed to help us not only with the validation of our clustering result but also with cluster number selection.

For fuzzy clustering, we can optimize our clustering results with some validity measure such as Partition Coefficient, Partition Entropy, XB-index, and Overlaps Separation Measure.

For hard clustering, we can use measures such as DB index and Dunn index.

And for hierarchical clustering, we can select the best $k$ by stopping at the “Big Jump” of distances while performing agglomerative clustering.

Read More