PaddleOCR - An Elegant And Modular Architecture

This essay is going to research PaddleOCR from the perspective of its architecture. We first derive its architecture style - pipe-and-filter and blackboard patterns from its working mechanism. Then we illustrate the containers and components to describe their structure. Furthermore, we analyze PaddleOCR from development and run-time views. At last, we simply show some API design principles that PaddleOCR applies.

PaddleOCR’s Architectural Style

Figure: PaddleOCR Working Procedure

In the development of recognition software, there are two classic models - blackboard1 and pipe-and-filter. The blackboard model suits the open problem domain without deterministic solutions. The pipe-and-filter pattern extracts and processes data with sequential filters. Based on research and analysis, we believe that PaddleOCR adopts both of them and reaches a great integration.

As you can easily imagine, when PaddleOCR gets one picture, it cannot identify the language, direction, or position of characters. Firstly, Baidu engineers divide the procedure of processing a picture into text detection, direction classification, and recognition, where the pipe-and-filter model works. These developers also design and control diverse models corresponding with the knowledge source in blackboard to get an approximate solution with various algorithms. By utilizing these two models, PaddleOCR converts original data (picture) into advanced structures (texts).

Furthermore, in PaddleOCR, various algorithms are applied to solve one partial problem like text recognition. The mechanism of the blackboard model perfectly satisfies the demand for applying multi-algorithms to address one partial problem. As shown in the figure below, each participant agent in this model has its own expertise and solving-knowledge (knowledge source) that is applicable to a part of the problem. Blackboard allows multiple processes (or agents) to communicate with the global database by reading and writing requests and information. Agents communicate strictly through a common blackboard whose contents are visible to all agents. Just like the blackboard in the classroom, when a partial problem is to be solved, candidate agents that can possibly handle it are listed. The aforementioned mechanism not only enables algorithms to use others’ results but helps PaddleOCR to select the best result among many algorithms in one step.

Figure: Architectural Features and Quality of PaddleOCR

Trade-off! The Tricky Architecture

Figure: Architecture Quality and PaddleOCR’s Features

From the perspective of pattern features, blackboard and pipe-and-filter architectures share the core concept to separate the components, which endows PaddleOCR with re-usability, maintainability, and changeability for further extension. The figure above is our analysis of the key qualities of PaddleOCR contributed by the architectural style.

As for the trade-offs in designing architectures, we visualize it in the below picture. The main compromise is between accuracy and speed. In the blackboard model, multiple algorithms are applied to process data and obtain the optimal result, which guarantees the solution’s quality and robustness.

The efficiency gain in the pipe-and-filter pattern by parallel processing is often an illusion. Not only data transformation overhead between filters may be relatively high compared to the cost of the computation, but also the slowest filter determines the speed of the whole pipeline.

All former architectural features bring the computational overhead and low efficiency, limiting the response time. Baidu developers did not merely adopt one algorithm to avoid this kind of a waste but focused on polishing algorithms. In other words, the

deficiency in the architecture model is offset by advanced algorithms and code, which requires developers’ great devotion. And in paper1, we see that the PaddleOCR developers successfully proposed an 8.4 MB ultra-lightweight model, which strikes a balance between speed and accuracy.

Furthermore, PaddleOCR can be simply viewed as a data-processing tool with pictures as inputs and texts as output rather than a highly-interactive software. The weak interaction is the shortcoming of the blackboard and pip-and-filter pattern. However, it does not bother PaddleOCR users as they only get the text as outputs.

Figure: Trade-off in Modular Architecture Style Design

Architecture structure design

After the description of the main architectural style of the Paddle OCR system, we use part of the C4 model4 to visualize the architecture structure, from two aspects - containers and components. The C4 model reflects the main idea of the software system.

Container: Server and Mobile

Firstly we introduced the container diagram of the PaddleOCR software system. PaddleOCR currently has two deployments - server and mobile. Based on these deployments we drew the container diagrams.

The right part of the diagram is the model training procedure where users are able to provide their own dictionary and transform the data format by adopting the python script provided. Users are also allowed to choose the lcdar2015 data set to train the model. After input of the training set, the model will calculate the loss and pass it to the optimizer, while the user can modify the hyperparameter in the optimizer to control the learning rate, regularize and optimize the loss function. And the optimizer will pass the optimal estimate parameter to the training model. Since we get the fully trained model, we can deploy the trained model to the PaddleHub server or the mobile.

The left part of the diagram focuses on the mobile and server deployments. Paddle-Lite is used for mobile side deployment, we can get the prediction results in our own mobile. Since we still transfer the inference model to our mobile device, either we can download the inference model from the library or use the model that we trained ourselves. Users

are allowed to push the original figures to mobile devices, then users can only get text results.

In contrast with the ultra-lightweight quality that PaddleOCR has, the server deployment is designed for a faster recognition speed and remote computing, where users are able to send images as prediction requests and get the results from the Hub server. The blue dotted line expresses the possible mobile application which does not include the inference model but users can utilize the model within the server.

Figure: PaddleOCR Container Diagram

Component & Connector

The PaddleHub server includes detection, angle class, and recognition components. Compared to the server deployment, the mobile application with no need for these components.

From the perspective of component, the mobile side deployment has no inference model, cannot be used for the local text recognition, but provides a user interface and image storage. The network port builds a transmission between users and the server.

For the training model, loss analysis returns the result to the modelling component and guides how to adjust parameters. A model evaluation result will be produced through the test path.

Figure: PaddleOCR Component Diagram

The software connectors perform the transfer of control and data among components9. The connectors between PaddleHub server components, mobile side, and training model components are well-defined software interfaces. All information will be passed to the next component through the interface after being processed.

Inside the self-trained model, software interfaces are the connectors between the model and deployment. The download of the model is data transmission through the network, as well as the service request.