An Excel Add-In is a file (usually with an .xla or .xll extension) that Excel can load when it starts up. The file contains code (VBA in the case of an .xla Add-In) that adds additional functionality to Excel, usually in the form of new functions.

Add-Ins provide an excellent way of increasing the power of Excel and they are the ideal vehicle for distributing your custom functions. Excel is shipped with a variety of Add-Ins ready for you to load and start using, and many third-party Add-Ins are available.

Read More

If you want to be able to change or rewrite your code and know you didn’t break anything, proper unit testing is imperative.

The unittest test framework is python’s xUnit style framework. It is a standard module that you already have if you’ve got python version 2.1 or greater. The unittest module used to be called PyUnit, due to it’s legacy as a xUnit style framework.

Read More

For SQL users, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. With features that will be introduced in Apache Spark 1.1.0, Spark SQL beats Shark in TPC-DS performance by almost an order of magnitude.

Read More

For the reason that I want to insert rows selected from a table (df_rows) to another table, I need to make sure that

The schema of the rows selected are the same as the schema of the table

Since the function pyspark.sql.DataFrameWriter.insertInto, which inserts the content of the DataFrame to the specified table, requires that the schema of the class:DataFrame is the same as the schema of the table.

Read More

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Read More

The Apriori Principle:

If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if an subset is infrequent, then all of its supersets must be infrequent, too.

The key idea of the Apriori Principle is monotonicity. By the anti-monotone property of support, we can perform support-based pruning:

\[\forall X,Y: (X \subset Y) \rightarrow s(X) \geq s(Y)\]

Read More

Association analysis is useful for discovering interesting relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items.

For example, given a table of market basket transactions

TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}

The follwing rule can be extracted from the table:

\[\{Milk, Diaper\} \rightarrow \{Beer\}\]

A common strategy adopted by many association rule mining algorithms is to decompose the problem into 2 major subtasks:

1. Frequent Itemset Generation

Find all the itemsets that satisfy the minsup threshold.

2. Rule Generation

Read More