Brief Introduction of Spark Usage
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
- Spark SQL and DataFrame
- Machine Learning Library
- Graph Processing (GraphX)
- SparkR (R on Spark)
- Spark Streaming
Spark SQL and DataFrame
- DataFrames
- Creating DataFrames
- DataFrame Operations
- Running SQL Queries Programmatically
- Data Sources
- Saving to Persistent Tables
- Partition Discovery
- Schema Merging
- Hive metastore Parquet table conversion
- JSON Datasets
- Hive Tables
- Interacting with Different Versions of Hive Metastore
- JDBC To Other Databases
- Performance Tuning
- Caching Data In Memory
- Distributed SQL Engine
- Running the Thrift JDBC/ODBC server
- Running the Spark SQL CLI
For more detail, please refer to Spark SQL and DataFrame Guide
Machine Learning Library
spark-mllib: data types, algorithms, and utilities
- Basic statistics
- summary statistics
- correlations
- stratified sampling
- hypothesis testing
- random data generation
- Classification and regression
- linear models (SVMs, logistic regression, linear regression)
- naive Bayes
- decision trees
- ensembles of trees (Random Forests and Gradient-Boosted Trees)
- isotonic regression
- Collaborative filtering
- alternating least squares (ALS)
- Clustering
- k-means
- Gaussian mixture
- power iteration clustering (PIC)
- latent Dirichlet allocation (LDA)
- streaming k-means
- Dimensionality reduction
- singular value decomposition (SVD)
- principal component analysis (PCA)
- Feature extraction and transformation
- Frequent pattern mining
- FP-growth
- association rules
- PrefixSpan
- Evaluation metrics
- PMML model export
- Optimization (developer)
- stochastic gradient descent
- limited-memory BFGS (L-BFGS)
For more detail, please refer to Machine Learning Library (MLlib) Guide
spark-ml: high-level APIs for ML pipelines
spark-ml programming guide provides an overview of the Pipelines API and major concepts. It also contains sections on using algorithms within the Pipelines API, for example:
- Feature extraction, transformation, and selection
- Decision trees for classification and regression
- Ensembles
- Linear methods with elastic net regularization
- Multilayer perceptron classifier
For more detail, please refer to Spark ML Programming Guide
Graph Processing (GraphX)
- The Property Graph
- Graph Operators
- Summary List of Operators
- Property Operators
- Structural Operators
- Join Operators
- Neighborhood Aggregation
- Aggregate Messages (aggregateMessages)
- Map Reduce Triplets Transition Guide (Legacy)
- Computing Degree Information
- Collecting Neighbors
- Pregel API
- Optimized Representation
- Graph Algorithms
- PageRank
- Connected Components
- Triangle Counting
For more detail, please refer to GraphX Programming Guide
SparkR (R on Spark)
- Creating DataFrames
- From local data frames
- From Data Sources
- From Hive tables
- DataFrame Operations
- Selecting rows, columns
- Grouping, Aggregation
- Operating on Columns
- Running SQL Queries from SparkR
For more detail, please refer to SparkR (R on Spark) Programming Guide
Spark Streaming
- Basics
- Linking
- Initializing StreamingContext
- Discretized Streams (DStreams)
- Input DStreams and Receivers
- Transformations on DStreams
- Output Operations on DStreams
- DataFrame and SQL Operations
- MLlib Operations
- Caching / Persistence
- Checkpointing
- Deploying Applications
- Monitoring Applications
- Performance Tuning
- Reducing the Batch Processing Times
- Setting the Right Batch Interval
- Memory Tuning
For more detail, please refer to Spark Streaming Programming Guide