In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them (Bollobás 2001). From a mathematical perspective, random graphs are found to model and mirror the diverse types of complex networks encountered in different areas.

Read More

$G(n, p)$, the Erdos-Renyi Random Graph, defines a family of graphs, each of which starts with $n$ isolated nodes, and we place an edge between each distinct node pair with probability $p$. In $G(n, p)$ Model, the probability of obtaining any one particular random graph with $m$ edges is $p^{m}(1-p)^{N-m}$ with the notation $N=\binom{n}{2}$. As a result, $G(n, p)$ defines a bigger familiy than $G(n, m)$ since $n$ and $p$ do not uniquely determine the graph so number of possible graphs are larger.

Read More

Suppose we are given a graph data $G = (N, E)$, which contains $|N| = n$ nodes and $|E| = m$ edges. One big question we often ask is: “Which vertices are important?” Intuitively, we would consider a “star” (the central node that connects a lot of other nodes) as an obviously important case and also consider nodes in a “circle” are equivalently important. In general, we can use cantrality measures to rank the nodes in a graph. There are many centrality measures and page rank is currently the most prominent approach that deals with directed graphs.

Read More

The data reflecting our real world can be represented in networks sometimes. With network data, we can ask questions such as

  • What are the patterns and statistical properties of network data? Why networks are the way they are? Can we find the underlying rules that build these networks?
  • Can we model the networks? Can we predict behavior? Why/How things go viral?
  • How does the network structure evolve over time?

To answer the questions, we need to understand network properties (e.g., diameter, scale-free / power law network, small-world behavior), network models that fit our observations (e.g., Erdos Renyi random graphs, Kleinberg’s model models, … etc), and algorithms that could unflod on our networks (e.g., page rank, decentralized search, label propagation, link prediction, community detection, … etc).

Read More

Classification is well so common in the area of machine learning and scikit-learn provides a comprehensive toolkit that can be easily used. Here I will share some common classification models and how to apply them on a dataset using this good toolkit, while the classification process will cover

  • training and testing
  • cross validation and grid serach process
  • classification performace display
  • plots of area under curves

Read More

In this document, I will build a predictive framework for predicting whether each flight in 2006 will be cancelled or not by using the data from 2000 to 2005 as training data.

Items to be delivered in this document includes:

  1. Show the predictive framework you designed. What features do you extract? What algorithms do you use in the framework?
  2. Explain the validation method you use.
  3. Explain the evaluation metric you use and show the effectiveness of your framework (i.e., use confusion matrix)
  4. Show the validation results and give a summary of results.

Read More

The Wisconsin Diabetes Registry Study targeted all individuals $<30$ years of age diagnosed with Type I diabetes in southern Wisconsin, USA. Participants were requested to submit blood samples and were sent a questionnaire inquiring about hospitalizations and other events. The blood samples were used to determine glycosylated hemoglobin (GHb), and important indicator of glycemic control.

The data set diabetes.txt, which can be downloaded from here, contains the data from the Wisconsin Diabetes Registry Study. The data items are:

Variable name meaning
ID an unique identification number
HEALTH self-reported health status: 1=excellent, 2=good, 3=fair, 4=poor
BH dichotomized health status: 1=excellent health, 0=worse than excellent health
GENDER sex code: 1=female, 0=male
GHb overal mean glycosylated hemoglobin value in study
AGE age at diagnosis (years)
  • In the section “Relationship of GHb with age”, we are interested in the relationship of GHb with age at diagnosis and/or self-reported health status
  • In the section “Modeling of dichotomized health status”, we use BH as the dependent variable (response variable) and use models such as logistic regression to show the explanationary power of other independent variables.

Read More

This practice is based on Hans Rosling talks New Insights on Poverty and The Best Stats You’ve Ever Seen.

The assignment uses data to answer specific question about global health and economics. The data contradicts commonly held preconceived notions. For example, Hans Rosling starts his talk by asking: “for each of the six pairs of countries below, which country do you think had the highest child mortality in 2015?”

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa

Most people get them wrong. Why is this? In part it is due to our preconceived notion that the world is divided into two groups: the Western world versus the third world, characterized by “long life,small family” and “short life, large family” respectively. In this homework we will use data visualization to gain insights on this topic.

Read More

The Universal Approximation Theorem states that a 2-layer network can approximate any function, given a complex enough architecture. That’s why we will create a neural network with two neurons in the hidden layer and we will later show how this can model the XOR function.

In this experiment, we will need to understand and write a simple neural network with backpropagation for “XOR” using only numpy and other python standard library.

The code here will allow the user to specify any number of layers and neurons in each layer. In addition, we are going to use the logistic function as the activity function for this network.

Read More