Open Source
In 2023 Samsung joined and became an official member of “RISE (RISC-V Software Ecosystem)” project. Since then our company is participating in the development of a variety of projects and porting them to RISC-V architecture. In this article, we will throw some light on one of them: Chromium. When porting software to a new architecture, the first step is always to make it start and run on hardware, the next step is stabilization, but as Amber Huffman, chairperson of the RISE project emphasized: “In order for RISC-V to be commercialized, it is important to secure software that has performance, security, reliability, and compatibility”. For the final users performance is very important when browsing web pages, viewing streamed content or running web applications. All of those mentioned activities have a common ground: showing multimedia content and vectorization is one of the possibilities for performance improvement.
RISC-V is an open standard instruction set architecture based on established reduced instruction set computer (RISC) principles. Moreover, RISC-V is offered under royalty-free open-source licenses, and documents defining its ISA are offered under a Creative Commons license or a BSD License. That is one of major differences between this and other commercial vendors of processors, such as Arm Ltd. and MIPS Technologies.
And what is an instruction set architecture? It is a part of the abstract model of a computer that defines how the CPU is controlled by the software and acts as an interface between the hardware and the software, specifying both what the processor is capable of doing as well as how it gets done. The ISA defines the supported data types, the registers, how the hardware manages main memory, key features (such as virtual memory), which instructions a microprocessor can execute, and the input/output model of multiple ISA implementations. The ISA can be extended by adding instructions or other capabilities, or by adding support for larger addresses and data values. [https://www.arm.com/glossary/isa]
The first successful implementation of vector extension was MultiMedia eXtension (MMX) introduced by Intel in 1997. MMX defines eight processor registers, named MM0 through MM7, and operations that are performed on them. Each register is 64 bits wide and can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format: one instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. MMX could take small and only integer numbers, like those representing shades of color in a picture, and process many of them at the same time.
The successor of MMX was Streaming SIMD Extensions (SSE) introduced by Intel in 1999. SSE contains 70 new instructions (65 unique mnemonics using 70 encodings), most of which work on single precision floating-point data. SIMD instructions can greatly increase the performance when exactly the same operations are to be performed on multiple data objects. SSE could work with floating-point numbers, which are numbers with decimals, like 3.14 or 2.718. This made it perfect for 3D graphics processing and scientific calculations. SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3 and SSE4. Because it supports floating-point math, it had a wider range of applications than MMX and became more popular.
Next, in 2011 Advanced Vector Extensions (AVX) was introduced by Intel. AVX uses sixteen YMM registers to perform a single instruction on multiple pieces of data. Each YMM register can hold and do simultaneous operations (math) on: eight 32-bit single-precision floating point numbers or four 64-bit double-precision floating point numbers. It was designed to help with extremely demanding tasks like high-definition video processing, advanced gaming graphics, and heavy scientific simulations. AVX also had its own family, with AVX2 and AVX-512, each more powerful than the last.
In the ARM family processors there also is implementation of vector extensions: The Advanced SIMD extension (also known as Neon or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardised acceleration for media and signal processing applications. Neon is included in all Cortex-A8 devices, but is optional in Cortex-A9 devices. Neon can accelerate signal processing algorithms and functions to speed up applications such as audio and video processing, voice and facial recognition, computer vision, and deep learning. Neon instructions allow up to: 16x8-bit, 8x16-bit, 4x32-bit, 2x64-bit integer operations and 8x16-bit, 4x32-bit, 2x64-bit floating-point operations. [https://www.arm.com/technologies/neon]
Next, ARM introduced Scalable Vector Extension (SVE) which is a vector extension for the A64 instruction set of the Armv8-A architecture. Armv9-A builds on SVE with the SVE2 extension. Unlike other SIMD architectures, SVE and SVE2 do not define the size of the vector registers, but constrain it to a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. The design of SVE and SVE2 guarantees that the same program can run on different implementations of the instruction set architecture without the need to recompile the code. [https://developer.arm.com/Architectures/Scalable Vector Extensions]
It is worth to mention that RISC-V vector extension (RVV) has adopted the same style as ARM SVE and SVE2: scalable vector registers of size unknown at compile-time. This approach meant to address ISA fragmentation by different vector sizes in x86 extensions.
How can the "V" Vector extension be used? The RISC-V "V" Vector Extension allows CPUs to perform many operations simultaneously, making them faster and more efficient. Let's dive into this with the key components, including the registers and a selection of important instructions that are part of this extension.
The RISC-V Vector Extension (RVV) version 1.0 introduces a rich set of vector registers that can take multiple data structures as arguments. It also provides instructions designed to perform parallel processing efficiently.
These registers are critical for performing vectorized operations.
1. Vector Registers (v0 - v31): • There are 32 vector registers, each identified as v0 to v31. • Each vector register can hold multiple elements, and the size of these elements can vary (e.g., 8-bit, 16-bit, 32-bit, 64-bit). • The total number of elements that a vector register can hold is determined by the configured vector length.
2. Vector Configuration Registers: • VLEN: The length of the vector registers in bits. • SEW (Standard Element Width): Specifies the width of individual elements in the vector register (e.g., 8-bit, 16-bit, 32-bit, 64-bit). • LMUL (Length Multiplier): Determines the grouping of vector registers to allow for different vector lengths.
These instructions configure the vector processing unit (VPU) to set up the vector length and element width.
These instructions are used to load data from memory into vector registers and store data from vector registers into memory.
The above operations can have multiple variants. Besides the simple load and store (le32/se32), it’s possible to use strides (lse32/sse32) or segments (lsseg4_32/sseg4_32) which makes it possible to load and store data not only in SoA (Structure of Arrays) configuration but also as AoS (Array of Structures).
These instructions perform arithmetic operations on vector registers.
These instructions enable conditional operations on vector elements.
Adding arrays with 64-bit numbers is a complex task because standard registers in CPUs are typically 32-bit or 64-bit wide and this task can't be done in one step. Using the RISC-V "V" Vector Extension, we can break down this task and perform it more efficiently.
Here's a step-by-step guide to adding two arrays with 64-bit numbers using the RISC-V "V" Vector Extension:
1. Set up the Vector Registers: First, we need to configure the vector registers to handle arrays. 2. Load the Data: Load the arrays into the vector registers. 3. Perform the Addition: Use vector instructions to add the numbers. 4. Store the Result: Store the result back to memory.
Here is the assembly code for this:
.section .data num1: .quad 0x123456789ABCDEF0, 0x0FEDCBA987654321, 0x0011223344556677, 0x8899AABBCCDDEEFF num2: .quad 0x1122334455667788, 0x99AABBCCDDEEFF00, 0x1234567890ABCDEF, 0xFEDCBA9876543210 result: .space 32 # Reserve 32 bytes (256 bits) for the result .section .text .globl _start _start: # Set up vector registers li t0, 4 # Set AVL (Application Vector Length) to 4 (256 bits / 64-bit elements) vsetvli t0, t0, e64 # Set vector length and element width to 64-bit la a0, num1 # Load 256-bit numbers into vector registers vle64.v v0, (a0) # Load the first array into vector register v0 la a0, num2 vle64.v v1, (a0) # Load the second array into vector register v1 # Perform the addition vadd.vv v2, v0, v1 # Add vector register v0 and v1, store the result in vector register v2 # Store the result back to memory la a0, result vse64.v v2, (a0) # Store the result from vector register v2 to the memory location result # Exit (for demonstration purposes, assuming an environment that supports this) li a7, 93 # ECALL number for exit ecall
1. Data Section: We define two 256-bit numbers (num1 and num2) as arrays of four 64-bit values. We also reserve space for the result. 2. Set up Vector Registers: We set the vector length to handle 64-bit elements and configure the vector registers accordingly. 3. Load Data: The vle64.v instruction loads 64-bit elements from memory into vector registers v0 and v1. 4. Perform Addition: The vadd.vv instruction adds the elements in vector registers v0 and v1 and stores the result in v2. 5. Store Result: The vse64.v instruction stores the 64-bit elements from vector register v2 back into memory at the location result.
To truly appreciate the power of the RISC-V "V" Vector Extension, let's compare the assembly code for adding two arrays with (example above) and without (example below) the usage of vector extensions.
When we don't use vector extensions, we need to handle each 64-bit part of the array individually. This requires multiple instructions and more steps to achieve the same result. Here is how we can write the assembly code without the usage of vector extensions:
.section .data num1: .quad 0x123456789ABCDEF0, 0x0FEDCBA987654321, 0x0011223344556677, 0x8899AABBCCDDEEFF num2: .quad 0x1122334455667788, 0x99AABBCCDDEEFF00, 0x1234567890ABCDEF, 0xFEDCBA9876543210 result: .space 32 # Reserve 32 bytes (256 bits) for the result .section .text .globl _start _start: # Load the first 64-bit part of the first number lui a1, %hi(num1) addi a1, a1, %lo(num1) # Load the first 64-bit part of the second number lui a2, %hi(num2) addi a2, a1, %lo(num2) lui a3, %hi(result) addi a3, a3, %lo(result) # Add the third 64-bit parts ld t1, 0(a1) ld t2, 0(a2) add t3, t2, t1 # Store the result sd t3, 0(a3) # do the same stuff for every 64 bit word ld t1, 8(a1) ld t2, 8(a2) add t3, t2, t1 sd t3, 8(a3) ld t1, 16(a1) ld t2, 16(a2) add t3, t2, t1 sd t3, 16(a3) ld t1, 24(a1) ld t2, 24(a2) add t3, t2, t1 sd t3, 24(a3) # Exit (for demonstration purposes, assuming an environment that supports this) li a0, 0 li a7, 93 # ECALL number for exit ecall
Let's compare now two presented approaches:
1. Number of Instructions: • Without Vector Extensions: The code is much longer because each 64-bit part of the arrays is handled separately. For each part, we need to load the numbers, add them, and store the result, repeating this four times. • With Vector Extensions: The code is more compact. We only need to set up the vector registers once and then use a single vector addition instruction to perform the addition of all parts simultaneously. 2. Simplicity: • Without Vector Extensions: The code is more complex and repetitive, making it harder to read and maintain. • With Vector Extensions: The code is simpler and more elegant. It abstracts away the repetitive tasks, making it easier to understand and maintain. 3. Performance: • Without Vector Extensions: The CPU has to execute more instructions, which can slow down the process, especially for large data sets. • With Vector Extensions: The CPU can process multiple data elements in parallel, significantly speeding up the computation.
To turn our assembly code into a program that the CPU can run, we need to use some special tools. These tools are riscv64-unknown-elf-as and riscv64-unknown-elf-ld.
Those commands correspond to the toolchain binaries used for cross-compiling for a different architecture (RISC-V in this case) on x64/x86 machines. If you are compiling on the RISC-V machine, you can just use as and ld instead.
To install toolchain, follow the manual from https://github.com/riscv-collab/riscv-gnu-toolchain. But don't forget to add --with-arch=rv64gcv everywhere you can. Take a look also at Appendix A.
Let's see how we can use them to build our assembly code for adding 256-bit numbers.
1. Write the Assembly Code: • First, write the assembly code in a file. Let's name it add_256bit.s. 2. Assemble the Code: • Use the riscv64-unknown-elf-as utility to assemble the code. This will convert the assembly code into an object file. riscv64-unknown-elf-as -march=rv64gcv -o add_256bit.o add_256bit.s • This command will create an object file named add_256bit.o. 3. Link the Object File: • Use the riscv64-unknown-elf-ld utility to link the object file and create an executable. riscv64-unknown-elf-ld -o add_256bit add_256bit.o • This command will create an executable file named add_256bit.
1. Assembler (riscv64-unknown-elf-as): • The assembler reads the assembly code from add_256bit.s and converts it into machine code, generating an object file (add_256bit.o). The object file contains the binary representation of the instructions, but it is not yet ready to be executed on its own. 2. Linker (riscv64-unknown-elf-ld): • The linker takes the object file (add_256bit.o) and links it with any necessary libraries or other object files to produce an executable (add_256bit). The linker resolves references to external symbols and assigns final memory addresses to the program's instructions and data.
To run the executable on an RV64 processor or an emulator, you would typically use a simulator like Spike, QEMU, or run it directly on RISC-V hardware. Here is the example with Spike, a functional RISC-V ISA simulator:
spike pk add_256bit
The RISC-V Vector Extension (RVV) provides a rich set of vector functions in GCC that allow developers to leverage the power of vector processing directly in C/C++ code. These functions are typically available through intrinsic functions, which are special functions provided by the compiler to generate specific machine instructions.
Vector intrinsics in GCC for RISC-V are designed to map directly to RVV instructions, providing a way to write vectorized code without resorting to assembly language. These intrinsics follow a naming convention that helps in understanding their functionality.
The naming convention for RISC-V vector intrinsics is generally as follows:
__riscv_
Let's briefly describe some common RISC-V vector intrinsics with examples:
Vector Load/Store Instructions These sample instructions are used to load data from memory into vector registers and store data from vector registers into memory.
Vector Arithmetic Instructions These sample instructions perform arithmetic operations on vector registers.
Vector Masking Instructions These sample instructions enable conditional operations on vector elements.
1. base (const uint8_t *base): Pointer to the base address in memory for load/store operations. 2. vl (size_t vl): Vector length, specifying the number of elements to process. 3. op1, op2, acc: Vector registers or scalars involved in the operation. 4. mask (vbool4_t mask): Optional mask register for conditional operations.
Now, it’s time to write a C program that adds two 256-bit numbers using RISC-V vector intrinsics. Since 256-bit numbers are not natively supported by standard data types in C, we will treat them as arrays of smaller elements (e.g., 8-bit, 16-bit, or 32-bit integers). For this example, we will use 32-bit integers to represent the 256-bit numbers.
Step-by-Step Implementation 1. Include the necessary headers: • riscv_vector.h for vector intrinsics. • stdio.h for input/output functions. 2. Define the 256-bit numbers as arrays of 32-bit integers: • Each 256-bit number will be represented by an array of 8 uint32_t elements. 3. Use vector intrinsics to load the numbers into vector registers: • Use __riscv_vle32_v_u32m1 to load the 32-bit elements into vector registers. 4. Perform the addition using vector intrinsics: • Use __riscv_vadd_vv_u32m1 to add the elements of the two vector registers. 5. Store the result back into an array: • Use __riscv_vse32_v_u32m1 to store the result from the vector register back into memory.
Complete Code
#include
Explanation 1. Define Constants: • VECTOR_LENGTH is defined as 8 to represent the number of 32-bit elements in a 256-bit number. 2. add_256bit_numbers Function: • This function takes two 256-bit numbers (num1 and num2) and stores their sum in result. 3. Set Vector Length: • vsetvl_e32m1 sets the vector length to handle 32-bit elements, with a length multiplier of 1. 4. Load Vector Registers: • vle32_v_u32m1 loads the 32-bit elements of num1 and num2 into vector registers vec_num1 and vec_num2. 5. Vector Addition: • vadd_vv_u32m1 adds the elements of vec_num1 and vec_num2 and stores the result in vec_result. 6. Store Result: • vse32_v_u32m1 stores the result from vec_result back into the result array. 7. Main Function: • Defines two 256-bit numbers and an array to store the result. • Calls add_256bit_numbers to add the two numbers. • Prints the result.
Compilation To compile this code with RISC-V vector extension support, use the following command:
riscv64-unknown-elf-gcc -march=rv64gcv -mabi=lp64d -o add_256bit add_256bit.c
It is worth mentioning that this command is in fact cross compilation on x64/x86 machine and produces a binary for RISC-V architecture. However, remember that you do not necessarily need to do that at all if you have a system working on a RISC-V capable platform like Banana PI or VisionFive2. Then, the build command will not differ from any other compiler invocation and you can use regular gcc or clang.
What is worth mentioning is that GCC vector extensions (also supported by Clang) support RVV, so this can also be done in a much simpler way by enabling portable code and other compiler optimizations:
#include
SRPOL is currently working on porting Chromium project, to be able to run on RISC-V architecture. This year we have been focusing on enabling full features for web/JavaScript engine, compatibility with other architectures, stability and performance optimizations, because as we mentioned before: for the final users performance is very important when browsing web pages, viewing streamed content or running web applications. We have used ARM as a reference and looked for NEON code in the whole Tizen Web Components code with its dependencies, as Chromium widely uses various libraries to run and to present multimedia. As a result, we have a long list of modules, libraries in which some optimizations for RISC-V in the case of vector extensions can be introduced – please see below examples.
The above list includes just some examples and is much longer than the one presented. This shows the complexity and also the amount of work which needs to be done in performance optimization not only in Chromium itself, but also in many libraries on which it depends.
In the near future, we need Chromium to work with RVV code. The project is dependent on dozens of Open Source libraries which are crucial for excellent multimedia processing in browser and operating system. Gaining performance increase requires many modifications in various different libraries, like ffmpeg, v8, libpng, pixman or libjpeg-turbo. SRPOL has some achievements in pixman, a library being used in Chromium for composing and overlaying different layers, mixing them using alpha channel, etc., as some algorithms have already been ported using RVV, e.g. RGB565 to RGB888 conversion algorithm which gained ~10 times speedup comparing vector to scalar one. However, the engineers working in the Chromium project cannot do all the work by themselves. You can also be a part in this initiative, so please check a guide on how to optimize RISC-V at: https://gitlab.com/riseproject/riscv-optimization-guide/-/blob/main/riscv-optimization-guide.adoc, as it is a huge job for the whole industry. This is the reason why Samsung joined the RISC-V Software Ecosystem (RISE) project as a premier member in 2023.
Contributing to optimizations for RISC-V in terms of vector extensions is not an easy job to do. First of all, Chromium and its dependencies create a lot of issues as those are Open Source projects supporting many different architectures and having their own path of development where maintainers decide to merge or not to merge the committed patch sets. On the other hand, there are still not many RISC-V boards with CPU that supports RVV on the market, so it is more difficult for maintainers to test those patches.
As we are responsible for Chromium, we are planning to keep an eye on RVV support in the libraries that are most important for us and our plans for the near future are to port reported ARM NEON usage in Chromium and its dependencies to corresponding RISC-V RVV instruction, and measure performance gain in those specific regions. We keep our fingers crossed that the actual numbers proving performance increase and that tests made on our side will convince maintainers to merge our changes. If this will not be possible, the other option could be to branch out from the main repository, or postpone merges in some libraries. We will see what the future of RISC-V will bring.