Spark
Spark is the most widely used big data processing framework that enables developers to focus on their data-processing logic without struggling with any other cluster problems.
A large amount of data is generated on Internet every second. Processing such an amount of information is usually considered impossible for a single computer because of memory limitation and bad real-time performance. To solve this problem, exploiting the power of the computer cluster becomes the most feasible solution. This way of processing data is also called “big data processing”. The Apache Software Foundation developed a big data processing framework in 2012, which is called Spark. Though it’s based on Java and Scala, Spark also provides application programming interfaces for R, Python, C#, and SQL.
Spark has the following advantages:
- Data is distributed among the cluster so that every single node can process its dataset simultaneously。
- Resources such as memory, disk, and processor are scalable.
- Data is processed in memory. So much faster than disk-based processing systems such as Hadoop.
- Lazy evaluation for ease of big data development.
- Resilient distributed dataset can maintain the data correctness in a fault-tolerant way.
- Automatically schedule the tasks in the cluster, and dynamic scaling is also supported.
- Machine learning supported.
Spark - Distribution Analysis
Spark - Quality and Evolution
Spark - From Vision to Architecture
Spark - Product Vision and Problem Analysis
Contributions
Add "CSV Files" page to Data Sources documents.
Fix issue: SPARK-34492
The Data Sources documents in Spark SQL Guide lacks a page for descripting the general methods for loading and saving CSV file data. We added “CSV Files” page to Data Sources documents which is missing now.
Required by the Spark members, we also add a Scala example, a Python example and a Java example to the Spark test to illustrate the CSV interfaces.