Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark SQL and DataFrame

  • DataFrames
    • Creating DataFrames
    • DataFrame Operations
    • Running SQL Queries Programmatically
  • Data Sources
    • Saving to Persistent Tables
    • Partition Discovery
    • Schema Merging
    • Hive metastore Parquet table conversion
    • JSON Datasets
    • Hive Tables
    • Interacting with Different Versions of Hive Metastore
    • JDBC To Other Databases
  • Performance Tuning
    • Caching Data In Memory
  • Distributed SQL Engine
    • Running the Thrift JDBC/ODBC server
    • Running the Spark SQL CLI

For more detail, please refer to Spark SQL and DataFrame Guide

Machine Learning Library

spark-mllib: data types, algorithms, and utilities

  • Basic statistics
    • summary statistics
    • correlations
    • stratified sampling
    • hypothesis testing
    • random data generation
  • Classification and regression
    • linear models (SVMs, logistic regression, linear regression)
    • naive Bayes
    • decision trees
    • ensembles of trees (Random Forests and Gradient-Boosted Trees)
    • isotonic regression
  • Collaborative filtering
    • alternating least squares (ALS)
  • Clustering
    • k-means
    • Gaussian mixture
    • power iteration clustering (PIC)
    • latent Dirichlet allocation (LDA)
    • streaming k-means
  • Dimensionality reduction
    • singular value decomposition (SVD)
    • principal component analysis (PCA)
  • Feature extraction and transformation
  • Frequent pattern mining
    • FP-growth
    • association rules
    • PrefixSpan
  • Evaluation metrics
  • PMML model export
  • Optimization (developer)
    • stochastic gradient descent
    • limited-memory BFGS (L-BFGS)

For more detail, please refer to Machine Learning Library (MLlib) Guide

spark-ml: high-level APIs for ML pipelines

spark-ml programming guide provides an overview of the Pipelines API and major concepts. It also contains sections on using algorithms within the Pipelines API, for example:

  • Feature extraction, transformation, and selection
  • Decision trees for classification and regression
  • Ensembles
  • Linear methods with elastic net regularization
  • Multilayer perceptron classifier

For more detail, please refer to Spark ML Programming Guide

Graph Processing (GraphX)

  • The Property Graph
  • Graph Operators
    • Summary List of Operators
    • Property Operators
    • Structural Operators
    • Join Operators
    • Neighborhood Aggregation
      • Aggregate Messages (aggregateMessages)
      • Map Reduce Triplets Transition Guide (Legacy)
      • Computing Degree Information
      • Collecting Neighbors
  • Pregel API
  • Optimized Representation
  • Graph Algorithms
    • PageRank
    • Connected Components
    • Triangle Counting

For more detail, please refer to GraphX Programming Guide

SparkR (R on Spark)

  • Creating DataFrames
    • From local data frames
    • From Data Sources
    • From Hive tables
  • DataFrame Operations
    • Selecting rows, columns
    • Grouping, Aggregation
    • Operating on Columns
  • Running SQL Queries from SparkR

For more detail, please refer to SparkR (R on Spark) Programming Guide

Spark Streaming

  • Basics
  • Linking
    • Initializing StreamingContext
    • Discretized Streams (DStreams)
    • Input DStreams and Receivers
    • Transformations on DStreams
    • Output Operations on DStreams
    • DataFrame and SQL Operations
    • MLlib Operations
    • Caching / Persistence
    • Checkpointing
    • Deploying Applications
    • Monitoring Applications
  • Performance Tuning
    • Reducing the Batch Processing Times
    • Setting the Right Batch Interval
    • Memory Tuning

For more detail, please refer to Spark Streaming Programming Guide