Spark Machine Learning Modules
This project is aiming for simple access and usage of machine learning on Spark.
This project is aiming for simple access and usage of machine learning on Spark.
An Excel Add-In is a file (usually with an .xla or .xll extension) that Excel can load when it starts up. The file contains code (VBA in the case of an .xla Add-In) that adds additional functionality to Excel, usually in the form of new functions.
Add-Ins provide an excellent way of increasing the power of Excel and they are the ideal vehicle for distributing your custom functions. Excel is shipped with a variety of Add-Ins ready for you to load and start using, and many third-party Add-Ins are available.
If you want to be able to change or rewrite your code and know you didn’t break anything, proper unit testing is imperative.
The unittest test framework is python’s xUnit style framework.
It is a standard module that you already have if you’ve got python version 2.1 or greater. The unittest
module used to be called PyUnit, due to it’s legacy as a xUnit style framework.
For SQL users, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. With features that will be introduced in Apache Spark 1.1.0, Spark SQL beats Shark in TPC-DS performance by almost an order of magnitude.
For the reason that I want to insert rows selected from a table
(df_rows) to another table, I need to make sure that
The schema of the rows selected are the same as the schema of the table
Since the function pyspark.sql.DataFrameWriter.insertInto
, which inserts the content of the DataFrame to the specified table, requires that the schema of the class:DataFrame is the same as the schema of the table.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
To generate a candidate hash tree, the followings are required.
FP-Growth uses FP-tree (Frequent Pattern Tree), a compressed representation of the database. Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets.
The Apriori Principle:
If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if an subset is infrequent, then all of its supersets must be infrequent, too.
The key idea of the Apriori Principle is monotonicity. By the anti-monotone property of support, we can perform support-based pruning:
\[\forall X,Y: (X \subset Y) \rightarrow s(X) \geq s(Y)\]
Association analysis is useful for discovering interesting relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items.
For example, given a table of market basket transactions
TID | Items |
---|---|
1 | {Bread, Milk} |
2 | {Bread, Diapers, Beer, Eggs} |
3 | {Milk, Diapers, Beer, Cola} |
4 | {Bread, Milk, Diapers, Beer} |
5 | {Bread, Milk, Diapers, Cola} |
The follwing rule can be extracted from the table:
\[\{Milk, Diaper\} \rightarrow \{Beer\}\]A common strategy adopted by many association rule mining algorithms is to decompose the problem into 2 major subtasks:
1. Frequent Itemset Generation
Find all the itemsets that satisfy the minsup threshold.
2. Rule Generation