Generally speaking, in computer graphics, what we refer to as "rendering" is the process of generating or drawing a 2D image on a terminal like a screen, given conditions such as a virtual camera, 3D objects, and light sources.
As shown in the figure below, the rendering pipeline generally consists of three major stages. The Application Stage can be understood as the data preparation stage, where the Galacean engine prepares shader compilation, frustum culling, render queue sorting, shader data, and other initialization tasks; the Geometry Processing Stage processes the aforementioned data through vertex shaders, projection transformations, clipping, and screen mapping, outputting it to the screen coordinate system; the Rasterizer Stage rasterizes the vertices from the previous stage into fragments, and then through fragment shaders, color blending, and other operations, assigns the appropriate color to each pixel.
In dynamic scenes, objects like the camera, 3D objects, and light sources are constantly changing, meaning we need to execute the rendering pipeline for each frame. Generally, the minimum requirement for smooth visuals is FPS > 30, meaning the time to complete one rendering pipeline cycle needs to be less than 33 (1000 / 30) milliseconds. In the 0.3 milestone, the Galacean engine comprehensively restructured the rendering pipeline with the help of the material system. **Through the skeletal animation demo (as shown below, with 728300 faces, 50 draw calls, and without using techniques like Instance batching), in the nearly 1 million faces browser rendering with 6x downclocking, the render-related render method only took 900 microseconds (in version 0.2, rendering the same scene, the render method took 10 ms)!
MacBook Pro, 2.7GHz Quad-Core Intel Core i7, Intel Iris Plus Graphics 655 1536 MB, Google Chrome 90.0
This article will focus on the Application Stage, analyzing how the Galacean engine's rendering pipeline is designed and optimized from the perspectives of data structures, syntax, and architecture design.
The first stage is frustum culling.
If all active objects in the scene enter the subsequent rendering pipeline, but some objects are actually outside the camera's frustum, then rendering these out-of-frustum objects is redundant. This is especially true for models with a high number of faces, which can significantly waste performance. The operation of excluding objects outside the camera's frustum before the rendering pipeline starts is called frustum culling.
At the core level, the Galacean engine first uses dirty flags to detect whether the frustum has changed, thereby updating the frustum;
Then, perform collision detection between the frustum and the object's AABB bounding box; similarly, whether the object's bounding box has changed is also detected using a dirty flag:
By the way, the bounding box update uses the component-wise algorithm, which only requires two vector operations to obtain the bounding box's world coordinates:
Finally, only objects whose bounding boxes are contained or intersected by the frustum will enter the rendering pipeline, reducing subsequent pipeline pressure and improving performance.
The second stage is rendering queue sorting.
In the rendering pipeline, the default rendering queue for objects is unordered, meaning that factors such as the distance of objects from the camera, transparency, and other rendering states are uncontrolled. This affects both the correctness of rendering effects and pipeline performance, so the engine divides the rendering queue into three types: Opaque, AlphaTest, and Transparent.
Opaque queue sorted from near to far. We generally sort non-transparent objects from near to far because when we enable depth testing, fragments that fail the depth test will not undergo color blending, and GPUs that support early-Z will skip the fragment shader directly. Since the default configuration for depth test failure is z1 > z2, meaning that fragments farther away will fail the depth test, sorting non-transparent objects from near to far can maximize the exclusion of subsequent fragments that fail the depth test, optimizing performance. It is worth mentioning that some GPU architectures, such as PowerVR, support HSR (Hidden Surface Removal) technology. The engine can utilize this technology to achieve early-Z exclusion without sorting, so sorting from near to far is not always necessary.
Transparent queue sorted from far to near. This sorting ensures the correctness of rendering effects because, according to the principle of color overlay, the final result of mixing two colors is affected by their order.
Sorting algorithm optimization. The Galacean engine uses a more efficient sorting algorithm. The built-in JS method Array.sort can also achieve sorting, but its source code contains too many general-purpose logics, affecting performance. Interested parties can discuss this in this issue. By the way, besides Array.sort, many built-in JS methods are designed for generality. If you pursue extreme performance, special optimizations for the engine are needed. For example, the Array.forEach method is used for iteration, but its source code contains too many array-related logics, callback stack passing, and other non-engine-required functions. Therefore, using for loops for iteration within the engine is more efficient, especially for frame-level calls.
The third stage is compiling shaders.
Shaders include vertex shaders and fragment shaders and are programmable modules in the hardware rendering pipeline. By writing shader code, we can connect the CPU and GPU, controlling the final vertex positions, pixel colors, etc.
A core concept in compiling shaders is macros. When writing shaders, #ifdef
and other macro commands can choose whether to compile that part of the Shader code. Apart from whether the macro is defined, the remaining Shader code is entirely the same. We call a macro that needs to be defined a macro switch; different combinations of macro switches in the engine are called macro sets; and different Shaders generated based on macro sets are called shader variants.
Automatically compile shader variants. Usually, developers only need to operate the macro switches. The engine will automatically decide whether to recompile the shader based on whether the macro set has changed at runtime.
Precompiled Shader Variants. If high runtime smoothness is required, we can manually call the engine's Shader.compileVariant
method for precompilation, so the rendering runtime does not need to compile again, avoiding stuttering.
Bitwise Operations: Since each macro switch in the macro set is actually a string, and various macro switches have needs such as addition, deletion, modification, and query, how to efficiently operate on string combinations can use this small trick of bitwise operations: we treat each macro switch, such as #ifdef test
, as a "bit" in a byte, where 0 means the macro is off and 1 means the macro is on. A 32-bit Int type number can represent 32 macro switches; using |=
can turn on a macro, and using & ~
can turn off a macro. Thanks to bitwise operations, to determine if two macro sets have changed, you only need to check if the two numbers are equal.
The fourth stage is updating shader data.
Like programming languages such as JS and C++, shaders use the GLSL language, which also has concepts like variables and functions. The difference is that shaders exist on the GPU, and we can upload shader data via Uniform.
Based on the data caching mechanism in shaders, if the data has not changed, it does not need to be uploaded every frame. This leads to an optimization method, which is block updating/uploading.
As shown in the figure above, the Galacean engine design stores shader data ShaderData
in four blocks: Scene, Camera, Renderer, and Material. After blocking, the engine can update shader data in blocks on the CPU and then upload it to the GPU in blocks. There are also some small optimization points during this process, which will be discussed next.
In many designs, shader-related data is stored in the material (Material), meaning all shader data is updated when the renderer is updated, even if the data does not need to be updated. The complexity is O(SceneCameraRenderer).
After blocking, the logic for updating shader data is placed in scene updates, camera updates, renderer updates, and material updates, reducing the complexity to O(Scene+Camera+Renderer). The corresponding data is only processed in the corresponding hooks. For example, if there is a variable in the shader that is only related to the scene, after blocking, this data will only be updated when the scene changes.
It is worth mentioning that although the blocking mechanism can certainly reduce CPU computation times, the storage location of the data is still very important. For example, if a developer places a shader variable that can belong to the Scene level in the Renderer, it would originally be calculated once when rendering the scene, but now each Renderer will repeat the calculation, wasting scene * camera * renderer computations.
Similarly, if not blocked, all shader data is generally uploaded by default when finally uploading to the GPU.
Thanks to the blocking mechanism, we can choose whether to upload a block of data based on whether the content of these four blocks has changed, thereby reducing GPU communication. For example:
As shown in the figure above, if the camera belonging to the current rendering pipeline has not changed, the entire block of shader data belonging to the camera will not be uploaded. Even if this block of data is uploaded, we have performed duplicate checks at the lowest level when uploading shader data. If it duplicates the cached value of the shader, this shader data will not be uploaded to the GPU.
As mentioned above, we can upload shader data via Uniform, but different shader data types require different API calls:
If we call the corresponding API based on the shader data type when uploading shader data, it inevitably involves using syntax like switch and for loops, which can be very time-consuming in frequently called interfaces. Therefore, the engine not only provides the function of automatically creating the ShaderProgram context but also records some necessary conditions for shaders at runtime during creation. Subsequently, when the engine updates the shader data, it only needs to call the saved hooks, saving the time-consuming runtime lookup.
There are various ways to upload shader data. For example, if there is a float array variable uniform float test[4]
in the shader, the user can upload the corresponding elements of the array separately in 4 times, i.e., uniform1f(test[0],v0)
, uniform1f(test[1],v1)
, uniform1f(test[2],v2)
, uniform1f(test[3],v3)
, or upload the entire array at once, i.e., uniform4f(test[0],v0,v1,v2,v3)
or uniform4fv(test[0],[v0,v1,v2,v3])
.
In the above example, although there are multiple ways to upload shader data, it is clear that uploading once will have higher performance. Therefore, the engine canceled the independent upload of array elements and the upload of struct arrays. This means that if there is an array variable in the shader, the user can only upload the entire shader array in bulk, not one by one; similarly, if using a struct array, since the shader cannot upload structs in bulk and the engine does not support independent upload, the user needs to split the struct array into multiple arrays.
For the handling of texture units, the engine pre-binds the texture units to all sampler variables in the shader during the pre-analysis of shader data. When updating the shader texture later, it only needs to call the interface to activate the texture unit, reducing the operation of binding texture units.
When updating shader data, the engine definitely needs to frequently query a large amount of shader-related data. For example, to upload a uniform variable of the shader, we need to find this data in the CPU based on the name of this uniform variable. If we use strings in this query process, it will be very slow, so we use numeric indexing to improve performance.
As shown in the figure above, if frequent data queries are involved, it is recommended to use numeric indexing as much as possible. Practical verification shows that in JS objects, numeric indexing is much faster than string indexing, and the larger the sample size and the more complex the string, the more significant the speed difference. As shown in the example, comparing the use of numeric indexing and string indexing in 1000 samples, the speed difference is more than 10 times:
The fifth stage is updating the render states.
In the graphics rendering pipeline, there are many parallel render states, such as depth testing, color blending modes, stencil testing, and backface culling. Therefore, the engine design divides the numerous render states into BlendState, DepthState, StencilState, and RasterState.
There are many optimization methods for the rendering pipeline. Generally, the common rendering pipeline of the Galacean engine can meet the needs of most developers. If there are special rendering requirements, we hope this article can provide some help through the engine's design ideas and optimization methods.