Spark is the most widely used big data processing framework that enables developers to focus on their data-processing logic without struggling with any other cluster problems.

A large amount of data is generated on Internet every second. Processing such an amount of information is usually considered impossible for a single computer because of memory limitation and bad real-time performance. To solve this problem, exploiting the power of the computer cluster becomes the most feasible solution. This way of processing data is also called “big data processing”. The Apache Software Foundation developed a big data processing framework in 2012, which is called Spark. Though it’s based on Java and Scala, Spark also provides application programming interfaces for R, Python, C#, and SQL.

Spark has the following advantages:

  • Data is distributed among the cluster so that every single node can process its dataset simultaneously。
  • Resources such as memory, disk, and processor are scalable.
  • Data is processed in memory. So much faster than disk-based processing systems such as Hadoop.
  • Lazy evaluation for ease of big data development.
  • Resilient distributed dataset can maintain the data correctness in a fault-tolerant way.
  • Automatically schedule the tasks in the cluster, and dynamic scaling is also supported.
  • Machine learning supported.


Yongding Tian

Master student of Computer Engineering at TU Delft

Chenxu Ma

Master student of Computer Engineering@TU Delft

Zhiyi Wang

Master student of Electrical Engingeering(Signals and Systems) at TU Delft

Zhuoran Guo

Electrical Engineering Master student@Tu Delft

Spark - Distribution Analysis

Spark - Distribution Analysis Introduction of distributed components of Spark This essay mainly talks about the distribution components of Spark. Unlike traditional distribution Internet service, distributed computing has a quite different architecture. Spark is well known for its powerful distributed calculation ability. The three main components that are critical to the distributed calculation of Spark are Spark cluster manager, Resilient Distributed Dataset(RDD) and Directed Acyclic Graph(DAG). Cluster manager The cluster manager is another important part of the distributed calculation of Spark.
March 20, 2021

Spark - Quality and Evolution

Spark - Quality and Evolution Spark quality control As the base stone of the supercomputing area, Spark uses many methods to ensure code quality. Unit test: The Scala API and Python API are the most popular Spark interfaces, so most tests are based on these two APIs. For Scala, Spark uses Scala built-in unit tests. For Python, Spark has its unit test library. Style checker: Style checkers are essential for code readability and maintenance.
March 20, 2021

Spark - From Vision to Architecture

Spark - From Vision to Architecture Spark’s general introduction can be found in Essay1. We start for architectural analysis by a general introduction of the vision underlying Spark and its next step. In this essay, we would like to introduce a more detailed architecture of Spark. Spark architecture As discussed in our first essay, a summary for Spark application architecture can be illustrated by the following graph. Figure: The Spark architecture From the aspect of applications, the most obvious architecture is model-view architecture.
March 14, 2021

Spark - Product Vision and Problem Analysis

Spark - Product Vision and Problem Analysis Introduction to Spark Spark is an open-source general-purpose computing engine for processing large-scale distributed data. It was initially developed at the University of California, Berkeley’s AMP lab in 2012 by Matei Zaharia and then donated to Apache Software Foundation, which has maintained it till now. Since the widespread use of the Internet, the data collected by service providers become incredibly large. Google, as the search engine market leader and biggest cloud system provider, processes over 20 petabytes of data per day 1.


Add "CSV Files" page to Data Sources documents.


Fix issue: SPARK-34492

The Data Sources documents in Spark SQL Guide lacks a page for descripting the general methods for loading and saving CSV file data. We added “CSV Files” page to Data Sources documents which is missing now.

Required by the Spark members, we also add a Scala example, a Python example and a Java example to the Spark test to illustrate the CSV interfaces.

Open PR