Spark Machine Learning Modules

2017-04-11

This project is aiming for simple access and usage of machine learning on Spark.

Create an Excel VBA Add-In

2017-04-06

An Excel Add-In is a file (usually with an .xla or .xll extension) that Excel can load when it starts up. The file contains code (VBA in the case of an .xla Add-In) that adds additional functionality to Excel, usually in the form of new functions.

Add-Ins provide an excellent way of increasing the power of Excel and they are the ideal vehicle for distributing your custom functions. Excel is shipped with a variety of Add-Ins ready for you to load and start using, and many third-party Add-Ins are available.

Unit Testing in Python

By Chih-Ling Hsu

2017-03-30

If you want to be able to change or rewrite your code and know you didn’t break anything, proper unit testing is imperative.

The unittest test framework is python’s xUnit style framework. It is a standard module that you already have if you’ve got python version 2.1 or greater. The unittest module used to be called PyUnit, due to it’s legacy as a xUnit style framework.

Spark SQL Using Python

By Chih-Ling Hsu

2017-03-28

For SQL users, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. With features that will be introduced in Apache Spark 1.1.0, Spark SQL beats Shark in TPC-DS performance by almost an order of magnitude.

How to Change Schema of a Spark SQL DataFrame?

By Chih-Ling Hsu

2017-03-28

For the reason that I want to insert rows selected from a table (df_rows) to another table, I need to make sure that

The schema of the rows selected are the same as the schema of the table

Since the function pyspark.sql.DataFrameWriter.insertInto, which inserts the content of the DataFrame to the specified table, requires that the schema of the class:DataFrame is the same as the schema of the table.

Brief Introduction of Spark Usage

By Chih-Ling Hsu

2017-03-28

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Generate a Candidate Hash Tree

By Chih-Ling Hsu

2017-03-25

To generate a candidate hash tree, the followings are required.

Hash function
Max leaf size - if number of candidate itemsets exceeds max leaf size, split the node

Frequent Itemset Generation Using FP-Growth

By Chih-Ling Hsu

2017-03-25

FP-Growth uses FP-tree (Frequent Pattern Tree), a compressed representation of the database. Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets.

Frequent Itemset Generation Using Apriori Algorithm

By Chih-Ling Hsu

2017-03-25

The Apriori Principle:

If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if an subset is infrequent, then all of its supersets must be infrequent, too.

The key idea of the Apriori Principle is monotonicity. By the anti-monotone property of support, we can perform support-based pruning:

\[\forall X,Y: (X \subset Y) \rightarrow s(X) \geq s(Y)\]

Data Mining - Association Analysis

By Chih-Ling Hsu

2017-03-25

Association analysis is useful for discovering interesting relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items.

For example, given a table of market basket transactions

TID	Items
1	{Bread, Milk}
2	{Bread, Diapers, Beer, Eggs}
3	{Milk, Diapers, Beer, Cola}
4	{Bread, Milk, Diapers, Beer}
5	{Bread, Milk, Diapers, Cola}

The follwing rule can be extracted from the table:

\[\{Milk, Diaper\} \rightarrow \{Beer\}\]

A common strategy adopted by many association rule mining algorithms is to decompose the problem into 2 major subtasks:

1. Frequent Itemset Generation

Find all the itemsets that satisfy the minsup threshold.

2. Rule Generation

An Explorer of Things

Hello, my name is Chihling :)

Spark Machine Learning Modules

Create an Excel VBA Add-In

Unit Testing in Python

Spark SQL Using Python

How to Change Schema of a Spark SQL DataFrame?

Brief Introduction of Spark Usage

Generate a Candidate Hash Tree

Frequent Itemset Generation Using FP-Growth

Frequent Itemset Generation Using Apriori Algorithm

Data Mining - Association Analysis