Spark and SparkR


  • Spark as the Computing Backend

    Apache Spark is an open-source data analytics cluster computing framework. Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms. [reference]

    • Resilient Distributed Datasets (RDDs)

    • DAG

      Applications create linked sequences of operations. Catalyst as the optimizer?

    • In-memory computing

      Spark stores data in memory between operations, as opposed to writing data to disk required by Hadoop. This makes iterative algorithm and interactive data analysis possible and efficient in Spark

    • Core API (Scala, Python) is general purpose

      All language functionality of Scala can be accessed by Spark. But SparkR only exposes a small subset of high-level operations.

    • Interactive programming environment

  • SparkR Architecture

    TODO: Refer to the paper "SparkR: Scaling R Programs with Spark"
    • R-JVM bridge

      Older versions use rJava to call Java/Scala code. Now has its own implementation. Refer to the paper.

    • Scala code spawns Rscript processes.

      Currently, Rscript process is launched and terminated for each operation. But it should be optimized via something like DAG.

      Scala communicates with worker process via stdin/stdout, using custom protocol.how_communicate

      Serializes data via R serialization?, simple binary serialization of integers, strings, raw bytes.serialize

    Link to examples ...

  • Limitations

    • Akward API syntax
    • Not enough support for reading data (e.g. spark-csv does not infer data type)

R-JVM_overhead. Advanced Analytics with Spark p24.
SparkR_low_level_api_slow. Reading Text file in SparkR 1.4.0
how_communicate. A Friendly Critique of SparkR p8.
serialize. A Friendly Critique of SparkR p8.