Spark - Quality and Evolution

Spark quality control

As the base stone of the supercomputing area, Spark uses many methods to ensure code quality.

Unit test: The Scala API and Python API are the most popular Spark interfaces, so most tests are based on these two APIs. For Scala, Spark uses Scala built-in unit tests. For Python, Spark has its unit test library.
Style checker: Style checkers are essential for code readability and maintenance. Spark uses Scalafmt to check the code style of Scala code. For other code such as Python and Java, no style checker is applied because major Spark code is written in Scala. In addition, the style checker will be automatically run as a test for all pull requests and commits.
Continues integration and deployment: Spark uses the GitHub workflow to perform CI/CD. However, due to the heavy hardware resource consumption, the server is currently held by UCB AMPLab.
Independent issue form: Though GitHub provides a powerful issue form for developers, Spark has its own issue management website which is called Apache’s JIRA issue tracker). All issues here will be assigned a unique number, for example, SPARK-34492. The pull requests on GitHub must-have titles starting with the issue number: [SPARK-34492] Solving xxx issue.
Pull request reviewer: like most open source projects, Spark members will review the pull requests and communicate with developers to keep the code quality. Only pull requests that are commonly agreed by reviewers can be merged to Spark.

A look inside CI/CD

The Spark CI contains 14 build tasks and tests which can be found [here](spark/.github/workflows at master · apache/spark). The following table will illustrate several most important Spark CI tasks; these tasks are obtained from our contribution CI result. All tasks are performed in a ubuntu container. All builds will automatically run all unit tests in the source code.

Task name	Environment	Functions
Build core modules	Java8 + Hadoop 3.2 + Hive 2.3	Test the Spark core module can be built
Build catalyst	Java8 + Hadoop 3.2 + Hive 2.3	Catalyst is an optimizer for Spark SQL
Build third-party modules	Java8 + Hadoop 3.2 + Hive 2.3	Test the Spark streaming interface is working well; these interfaces include Kafka, mlib, YARN, mesos, and etc.
Build Spark Hive	Java8 + Hadoop 3.2 + Hive 2.3	Hive is a distributed database system, compatible with Spark
Build Spark SQL	Java8 + Hadoop 3.2 + Hive 2.3	Build and test Spark SQL engine
Build pySpark core	Python 3.6	Build and test pySpark core
Build pySpark SQL	Python 3.6	Build and test pySpark, including the SQL part and the mlib part
Build Spark R	/	Build and test Spark R interface
License, dependency, code style and Document (static analysis)	/	Generate Document and check License, dependency
Java 11 built with Maven	Java11	Build the project with Java 11
Scala 2.13 built with sbt	Java11 + Scala 2.13	Build the project with Scala version 2.13
Hadoop built with sbt	Java8 + Hadoop 2.7	Build the project with Hadoop 2.7

From the Ci tasks, Spark is applying a wide range of unit tests to cover most functions. Also, many tests are focused on compatibilities such as Python2, Hadoop2, and newer Java releases. In addition, there is also static analysis to make sure the code quality and document quality.

Rigor rules of the Spark test processes

The rigor of a software development project is achieved by setting rules into the test process. Every person pulling requests in the Spark has to observe the rules in order to avoid running into problems and fail, resulting in unreliable projects. With rigor, Spark can carry on smoothly without hindrances. Here is the rigor of the test processes¹ involved in Spark when pulling requests to fix a specific issue:

Making sure the tests are self-descriptive
Writing a JIRA ID in the tests. Here is an example in Scala version:

test(“SPARK-12345: a short description of the test”){….
Run all tests with ./dev/run-tests to verify that the code still compiles, passes tests, and passes style checks. The Code Style Guide can help users to check the code style. This run-test also plays an important role in CI because it integrates all tests from different languages in the project.
Open a pull request, and the related JIRA will automatically be linked to this request.
Test user’s changes using Jenkins automatic pull request builder
Jenkin posts the results of tests with a link to full requests on Jenkins.

Users could investigate and fix failures promptly when the above procedures are finished. The detailed code style guide can be found in the following table:

Language	Code Style Guide
python	PEP 8²
R	Google’s R Style Guide³
Java	Oracle’s Java code conventions⁴ and Scala guidelines⁵
Scala	Scala style guide⁵ and Databricks Scala guide⁶

test coverage

Unit testing and Integration testing are two testing parts we should focus on in Spark. The former is for testing single functions; the latter is for Cassandra that tightly couples the Spark code. The typical functions of a class are units of the source code that should be tested.

It seems that testing spark applications can be more complicated than other frameworks because of the need to prepare a data set and the lack of tools that allow us to automate tests. However, we still can automate our tests by using “SharedSparkSessionHelper trait”⁷ method, which also can be utilized in other code coverage libraries.

Also, the appropriate and reasonable test coverage of the code in Spark is essential. There are various criteria for measuring code coverage. The commonly used matrices are line coverage, function coverage, branch coverage, and statement coverage. Statement coverage is suitable for Scala because it contains multiple statements on a single line. Therefore Scoverage is used for Scala that utilizes the statement coverage as a matrix. Another advantage of using Scoverage is its support for Jenkins and SonarQube, which enforces code coverage tests in CI and continuously obtains analysis results in SonarQube. Besides, JaCoCo is provided by IntelliJ IDEA for code coverage measurement. Jacoco is a common used free code coverage library for Java. Overall, Jacoco and other code coverage management tools are available to make sure the test code is better covered and comprehensively tested.

Spark hotspot component and code quality

Since Spark is a well-established project which lasts for over ten years, Most current modifications by developers focus on small details. By analyzing the past pull requests and commits, we find that the two most important hotspots are Spark SQL and Spark Core. Spark SQL attracts the most concentrations because all Spark programmers will directly use it. Spark Core provides the most basic functions such as network, disk storage, and memory access. Many bugs are found in Spark Core, and it becomes a hotspot component. Nevertheless, in recent years, the Spark Core is cooling down because fewer new bugs are found.

Hotspot components can easily cause code conflict problems for developers because of frequent code changes. Spark committee designs a code architecture to solve this problem; in other words, many abstraction layers for each minor function. The interface does not change for users no matter how the implementation changes because they only need to access the abstraction layer. Developers need to inheritance these abstraction layers to implement their own logic. For example, Spark defines a common serialization interface for SQL datatypes, and developers can implement different serialization methods by inheriting them. Another example is the Spark SQL common language support. Different languages such as SQL, Scala, and R have different APIs. Adding a new feature in Spark SQL can easily cause inconsistent interfaces for different languages. To solve this, Spark provides a translation layer to translate all Spark SQL logics to Scala. So adding a new feature in Spark SQL only needs to implement it in Scala.

Despite the architecture design in Spark, unit tests and continuous integration also improves the code quality for the hotspot component.

Currently, Spark organization is focusing on adding new minor features and third-party interfaces. Community developers are focusing on fixing missing documents and some minor Spark SQL problems. As a 10-year project, Spark has reached the end of its roadmap, and it seems no more significant changes will occur.

Spark Quality culture

Unit tests

In previous sections, Spark utilizes many ways to ensure code quality. However, some bugs can hide behind unit tests and only occurs in specific situations.SPARK-15135, SPARK-31511, SPARK-534 and SPARK-5355 are such issues that only occurs for specific multi-threading environment. This kind of issue can escape previous unit tests because nobody considered this specific situation before, so for this kind of issue, Spark requires a bug fixer to write new unit tests to make sure future versions can pass it. An example can be found here PR: SPARK-34652. We can conclude Spark is dynamically adding new unit tests when facing new bugs. Usually, the developer who provides a patch to a bug is also responsible for providing a unit test (if necessary). Besides, new APIs always come with new unit test, but this situation is common, so we do not discuss here.

Document and examples

As mentioned before, there are many missing documents for Spark. Spark requires new documents must come with new examples; documents can only cite example code, and these examples must be included in the Spark source files. In other words, the Spark examples must be corrected and can be compiled. If the API changes in the future, the example will fail to compile to notify developers this Document needs change. There is an example here: PR: SPARK-34492.

Least modification

Most pull requests in Spark only contain the least possible code modifications. For example, PR: SPARK-34581, PR: SPARK-34787, PR: SPARK-34315. Each pull request only tries to solve one single problem so there are only a few lines of code to change. This makes the whole system can roll back if something goes wrong in the future.

Technical debt in Spark

Technical debt commonly occurs in software development when choosing an easy(limited)solution now instead of using a better approach that would cost a long time and more resources. An example of technique debt in Spark is that the Client(BroadcastManager) chooses which Factory(BroadcastFactory) to use. In Spark 1.0.0, the Factory class was injected in via config and dynamically loaded at runtime; there used to be a default ‘HttpBroadcastFactory’, But the ‘HttpBroadcast’ code has been gone since Spark 2.0.0. Therefore it’s legacy in code architecture⁸. They are handling those technical debts by spark developers through Spark’s standard development procedure.

Loosely, reviewers and developers in Spark use Scalastyle as a code analysis tool when integrating with Github. Generally, their development process contains two main steps: they first identify the technical debt that needs to be fixed; after that, an issue in JIRA will be built for registering. Specifically, JIRA is a platform that enables users to keep track of issues. Secondly, Assigning developers to solve those technical debt issues is not available in Spark. Users are expected to sign up for an account on JIRA and interact with other users who maybe could solve this issue.

Reference

Apache Spark. Contribution. https://spark.apache.org/contributing.html ↩︎
Python. Style Guide for Python Code. https://legacy.python.org/dev/peps/pep-0008/ ↩︎
Google. Google’s R Style Guide. https://google.github.io/styleguide/Rguide.html ↩︎
Oracle. Code Conventions for the Java programming language. https://www.oracle.com/java/technologies/javase/codeconventions-contents.html ↩︎
Daniel Spiewak and David Copeland. SCALA STYLE GUIDE. https://docs.scala-lang.org/style/ ↩︎
databricks. Databricks Scala Coding Style Guide. https://github.com/databricks/scala-style-guide ↩︎
Matthew Powers. Testing Spark Applications. https://mrpowers.medium.com/testing-spark-applications-8c590d3215fa ↩︎
John Montroy. Spark, Torrents, and Technical Debt. https://medium.com/@john.montroy/spark-torrents-and-technical-debt-9b5cdb23296c ↩︎