Open Source

RISC-V and Vectorization

By Bartłomiej Kobierzyński Samsung R&D Institute Poland

Introduction

In 2023 Samsung joined and became an official member of “RISE (RISC-V Software Ecosystem)” project. Since then our company is participating in the development of a variety of projects and porting them to RISC-V architecture. In this article, we will throw some light on one of them: Chromium. When porting software to a new architecture, the first step is always to make it start and run on hardware, the next step is stabilization, but as Amber Huffman, chairperson of the RISE project emphasized: “In order for RISC-V to be commercialized, it is important to secure software that has performance, security, reliability, and compatibility”. For the final users performance is very important when browsing web pages, viewing streamed content or running web applications. All of those mentioned activities have a common ground: showing multimedia content and vectorization is one of the possibilities for performance improvement.

RISC-V and Instruction Set Architecture (ISA)

RISC-V is an open standard instruction set architecture based on established reduced instruction set computer (RISC) principles. Moreover, RISC-V is offered under royalty-free open-source licenses, and documents defining its ISA are offered under a Creative Commons license or a BSD License. That is one of major differences between this and other commercial vendors of processors, such as Arm Ltd. and MIPS Technologies.

And what is an instruction set architecture? It is a part of the abstract model of a computer that defines how the CPU is controlled by the software and acts as an interface between the hardware and the software, specifying both what the processor is capable of doing as well as how it gets done. The ISA defines the supported data types, the registers, how the hardware manages main memory, key features (such as virtual memory), which instructions a microprocessor can execute, and the input/output model of multiple ISA implementations. The ISA can be extended by adding instructions or other capabilities, or by adding support for larger addresses and data values. [https://www.arm.com/glossary/isa]

Short story of vector extensions

The first successful implementation of vector extension was MultiMedia eXtension (MMX) introduced by Intel in 1997. MMX defines eight processor registers, named MM0 through MM7, and operations that are performed on them. Each register is 64 bits wide and can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format: one instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. MMX could take small and only integer numbers, like those representing shades of color in a picture, and process many of them at the same time.

The successor of MMX was Streaming SIMD Extensions (SSE) introduced by Intel in 1999. SSE contains 70 new instructions (65 unique mnemonics using 70 encodings), most of which work on single precision floating-point data. SIMD instructions can greatly increase the performance when exactly the same operations are to be performed on multiple data objects. SSE could work with floating-point numbers, which are numbers with decimals, like 3.14 or 2.718. This made it perfect for 3D graphics processing and scientific calculations. SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3 and SSE4. Because it supports floating-point math, it had a wider range of applications than MMX and became more popular.

Next, in 2011 Advanced Vector Extensions (AVX) was introduced by Intel. AVX uses sixteen YMM registers to perform a single instruction on multiple pieces of data. Each YMM register can hold and do simultaneous operations (math) on: eight 32-bit single-precision floating point numbers or four 64-bit double-precision floating point numbers. It was designed to help with extremely demanding tasks like high-definition video processing, advanced gaming graphics, and heavy scientific simulations. AVX also had its own family, with AVX2 and AVX-512, each more powerful than the last.

In the ARM family processors there also is implementation of vector extensions: The Advanced SIMD extension (also known as Neon or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardised acceleration for media and signal processing applications. Neon is included in all Cortex-A8 devices, but is optional in Cortex-A9 devices. Neon can accelerate signal processing algorithms and functions to speed up applications such as audio and video processing, voice and facial recognition, computer vision, and deep learning. Neon instructions allow up to: 16x8-bit, 8x16-bit, 4x32-bit, 2x64-bit integer operations and 8x16-bit, 4x32-bit, 2x64-bit floating-point operations. [https://www.arm.com/technologies/neon]

Next, ARM introduced Scalable Vector Extension (SVE) which is a vector extension for the A64 instruction set of the Armv8-A architecture. Armv9-A builds on SVE with the SVE2 extension. Unlike other SIMD architectures, SVE and SVE2 do not define the size of the vector registers, but constrain it to a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. The design of SVE and SVE2 guarantees that the same program can run on different implementations of the instruction set architecture without the need to recompile the code. [https://developer.arm.com/Architectures/Scalable Vector Extensions]

It is worth to mention that RISC-V vector extension (RVV) has adopted the same style as ARM SVE and SVE2: scalable vector registers of size unknown at compile-time. This approach meant to address ISA fragmentation by different vector sizes in x86 extensions.

Vector extensions in RISC-V

How can the "V" Vector extension be used? The RISC-V "V" Vector Extension allows CPUs to perform many operations simultaneously, making them faster and more efficient. Let's dive into this with the key components, including the registers and a selection of important instructions that are part of this extension.

Vector Registers

The RISC-V Vector Extension (RVV) version 1.0 introduces a rich set of vector registers that can take multiple data structures as arguments. It also provides instructions designed to perform parallel processing efficiently.

These registers are critical for performing vectorized operations.

1. Vector Registers (v0 - v31):
• There are 32 vector registers, each identified as v0 to v31.
• Each vector register can hold multiple elements, and the size of these elements can vary (e.g., 8-bit, 16-bit, 32-bit, 64-bit).
• The total number of elements that a vector register can hold is determined by the configured vector length.

2. Vector Configuration Registers:
• VLEN: The length of the vector registers in bits.
• SEW (Standard Element Width): Specifies the width of individual elements in the vector register (e.g., 8-bit, 16-bit, 32-bit, 64-bit).
• LMUL (Length Multiplier): Determines the grouping of vector registers to allow for different vector lengths.

Vector Configuration Instructions

These instructions configure the vector processing unit (VPU) to set up the vector length and element width.

Vector Load/Store Instructions

These instructions are used to load data from memory into vector registers and store data from vector registers into memory.

The above operations can have multiple variants. Besides the simple load and store (le32/se32), it’s possible to use strides (lse32/sse32) or segments (lsseg4_32/sseg4_32) which makes it possible to load and store data not only in SoA (Structure of Arrays) configuration but also as AoS (Array of Structures).

Vector Arithmetic Instructions

These instructions perform arithmetic operations on vector registers.

Vector Masking Instructions

These instructions enable conditional operations on vector elements.

Example: Adding multiple 64-bit Numbers

Adding arrays with 64-bit numbers is a complex task because standard registers in CPUs are typically 32-bit or 64-bit wide and this task can't be done in one step. Using the RISC-V "V" Vector Extension, we can break down this task and perform it more efficiently.

Here's a step-by-step guide to adding two arrays with 64-bit numbers using the RISC-V "V" Vector Extension:

1. Set up the Vector Registers: First, we need to configure the vector registers to handle arrays.
2. Load the Data: Load the arrays into the vector registers.
3. Perform the Addition: Use vector instructions to add the numbers.
4. Store the Result: Store the result back to memory.

Here is the assembly code for this:

.section .data
num1: .quad 0x123456789ABCDEF0, 0x0FEDCBA987654321, 0x0011223344556677, 0x8899AABBCCDDEEFF
num2: .quad 0x1122334455667788, 0x99AABBCCDDEEFF00, 0x1234567890ABCDEF, 0xFEDCBA9876543210
result: .space 32 # Reserve 32 bytes (256 bits) for the result

.section .text
.globl _start
_start:
# Set up vector registers
li t0, 4 # Set AVL (Application Vector Length) to 4 (256 bits / 64-bit elements)
vsetvli t0, t0, e64 # Set vector length and element width to 64-bit

la a0, num1
# Load 256-bit numbers into vector registers
vle64.v v0, (a0) # Load the first array into vector register v0
la a0, num2
vle64.v v1, (a0) # Load the second array into vector register v1

# Perform the addition
vadd.vv v2, v0, v1 # Add vector register v0 and v1, store the result in vector register v2
# Store the result back to memory
la a0, result
vse64.v v2, (a0) # Store the result from vector register v2 to the memory location result

# Exit (for demonstration purposes, assuming an environment that supports this)
li a7, 93 # ECALL number for exit
ecall

Explanation

1. Data Section: We define two 256-bit numbers (num1 and num2) as arrays of four 64-bit values. We also reserve space for the result.
2. Set up Vector Registers: We set the vector length to handle 64-bit elements and configure the vector registers accordingly.
3. Load Data: The vle64.v instruction loads 64-bit elements from memory into vector registers v0 and v1.
4. Perform Addition: The vadd.vv instruction adds the elements in vector registers v0 and v1 and stores the result in v2.
5. Store Result: The vse64.v instruction stores the 64-bit elements from vector register v2 back into memory at the location result.

Comparing assembly code with and without Vector Extensions

To truly appreciate the power of the RISC-V "V" Vector Extension, let's compare the assembly code for adding two arrays with (example above) and without (example below) the usage of vector extensions.

When we don't use vector extensions, we need to handle each 64-bit part of the array individually. This requires multiple instructions and more steps to achieve the same result. Here is how we can write the assembly code without the usage of vector extensions:

.section .data
num1: .quad 0x123456789ABCDEF0, 0x0FEDCBA987654321, 0x0011223344556677, 0x8899AABBCCDDEEFF
num2: .quad 0x1122334455667788, 0x99AABBCCDDEEFF00, 0x1234567890ABCDEF, 0xFEDCBA9876543210
result: .space 32 # Reserve 32 bytes (256 bits) for the result

.section .text
.globl _start
_start:

# Load the first 64-bit part of the first number
lui a1, %hi(num1)
addi a1, a1, %lo(num1)
# Load the first 64-bit part of the second number
lui a2, %hi(num2)
addi a2, a1, %lo(num2)

lui a3, %hi(result)
addi a3, a3, %lo(result)

# Add the third 64-bit parts
ld t1, 0(a1)
ld t2, 0(a2)
add t3, t2, t1

# Store the result
sd t3, 0(a3)

# do the same stuff for every 64 bit word
ld t1, 8(a1)
ld t2, 8(a2)
add t3, t2, t1
sd t3, 8(a3)

ld t1, 16(a1)
ld t2, 16(a2)
add t3, t2, t1
sd t3, 16(a3)

ld t1, 24(a1)
ld t2, 24(a2)
add t3, t2, t1
sd t3, 24(a3)

# Exit (for demonstration purposes, assuming an environment that supports this)
li a0, 0
li a7, 93 # ECALL number for exit
ecall

Let's compare now two presented approaches:

1. Number of Instructions:
• Without Vector Extensions: The code is much longer because each 64-bit part of the arrays is handled separately. For each part, we need to load the numbers, add them, and store the result, repeating this four times.
• With Vector Extensions: The code is more compact. We only need to set up the vector registers once and then use a single vector addition instruction to perform the addition of all parts simultaneously.
2. Simplicity:
• Without Vector Extensions: The code is more complex and repetitive, making it harder to read and maintain.
• With Vector Extensions: The code is simpler and more elegant. It abstracts away the repetitive tasks, making it easier to understand and maintain.
3. Performance:
• Without Vector Extensions: The CPU has to execute more instructions, which can slow down the process, especially for large data sets.
• With Vector Extensions: The CPU can process multiple data elements in parallel, significantly speeding up the computation.

Building the Assembly Code with RISC-V Tools

To turn our assembly code into a program that the CPU can run, we need to use some special tools. These tools are riscv64-unknown-elf-as and riscv64-unknown-elf-ld.

Those commands correspond to the toolchain binaries used for cross-compiling for a different architecture (RISC-V in this case) on x64/x86 machines. If you are compiling on the RISC-V machine, you can just use as and ld instead.

To install toolchain, follow the manual from https://github.com/riscv-collab/riscv-gnu-toolchain. But don't forget to add --with-arch=rv64gcv everywhere you can. Take a look also at Appendix A.

Let's see how we can use them to build our assembly code for adding 256-bit numbers.

Step-by-Step Guide

1. Write the Assembly Code:
• First, write the assembly code in a file. Let's name it add_256bit.s.
2. Assemble the Code:
• Use the riscv64-unknown-elf-as utility to assemble the code. This will convert the assembly code into an object file.
riscv64-unknown-elf-as -march=rv64gcv -o add_256bit.o add_256bit.s
• This command will create an object file named add_256bit.o.
3. Link the Object File:
• Use the riscv64-unknown-elf-ld utility to link the object file and create an executable.
riscv64-unknown-elf-ld -o add_256bit add_256bit.o
• This command will create an executable file named add_256bit.

Detailed Explanation

1. Assembler (riscv64-unknown-elf-as):
• The assembler reads the assembly code from add_256bit.s and converts it into machine code, generating an object file (add_256bit.o). The object file contains the binary representation of the instructions, but it is not yet ready to be executed on its own.
2. Linker (riscv64-unknown-elf-ld):
• The linker takes the object file (add_256bit.o) and links it with any necessary libraries or other object files to produce an executable (add_256bit). The linker resolves references to external symbols and assigns final memory addresses to the program's instructions and data.

Running the Executable

To run the executable on an RV64 processor or an emulator, you would typically use a simulator like Spike, QEMU, or run it directly on RISC-V hardware. Here is the example with Spike, a functional RISC-V ISA simulator:

spike pk add_256bit

RISC-V Vector Functions in GCC

The RISC-V Vector Extension (RVV) provides a rich set of vector functions in GCC that allow developers to leverage the power of vector processing directly in C/C++ code. These functions are typically available through intrinsic functions, which are special functions provided by the compiler to generate specific machine instructions.

RISC-V Vector Intrinsics

Vector intrinsics in GCC for RISC-V are designed to map directly to RVV instructions, providing a way to write vectorized code without resorting to assembly language. These intrinsics follow a naming convention that helps in understanding their functionality.

Naming Convention

The naming convention for RISC-V vector intrinsics is generally as follows:

__riscv____
1. : Describes the vector operation (e.g., vmseq for vector mask equal).
2. : Indicates the types of arguments (e.g., v for vector, f for float, x for integer).
3. : Indicates the type of elements (e.g., u8 for unsigned 8-bit integers).
4. : Specifies the vector register group multiplier (e.g., m2).
5. : Optionally indicates whether the operation is masked (e.g., b4 for 4-bit mask).

Intrinsics Examples

Let's briefly describe some common RISC-V vector intrinsics with examples:

Vector Load/Store Instructions
These sample instructions are used to load data from memory into vector registers and store data from vector registers into memory.

Vector Arithmetic Instructions
These sample instructions perform arithmetic operations on vector registers.

Vector Masking Instructions
These sample instructions enable conditional operations on vector elements.

Understanding the Arguments

1. base (const uint8_t *base): Pointer to the base address in memory for load/store operations.
2. vl (size_t vl): Vector length, specifying the number of elements to process.
3. op1, op2, acc: Vector registers or scalars involved in the operation.
4. mask (vbool4_t mask): Optional mask register for conditional operations.

Example: Adding Two 256-bit Numbers Using RISC-V Vector Intrinsics

Now, it’s time to write a C program that adds two 256-bit numbers using RISC-V vector intrinsics. Since 256-bit numbers are not natively supported by standard data types in C, we will treat them as arrays of smaller elements (e.g., 8-bit, 16-bit, or 32-bit integers). For this example, we will use 32-bit integers to represent the 256-bit numbers.

Step-by-Step Implementation
1. Include the necessary headers:
• riscv_vector.h for vector intrinsics.
• stdio.h for input/output functions.
2. Define the 256-bit numbers as arrays of 32-bit integers:
• Each 256-bit number will be represented by an array of 8 uint32_t elements.
3. Use vector intrinsics to load the numbers into vector registers:
• Use __riscv_vle32_v_u32m1 to load the 32-bit elements into vector registers.
4. Perform the addition using vector intrinsics:
• Use __riscv_vadd_vv_u32m1 to add the elements of the two vector registers.
5. Store the result back into an array:
• Use __riscv_vse32_v_u32m1 to store the result from the vector register back into memory.

Complete Code
#include
#include
#define VECTOR_LENGTH 8
// Number of 32-bit elements to represent a 256-bit number
void add_256bit_numbers(const uint32_t *num1, const uint32_t *num2, uint32_t *result) {
// Set the vector length and element width
size_t vl = __riscv_vsetvl_e32m1(VECTOR_LENGTH);
// Load the 256-bit numbers into vector registers
vuint32m1_t vec_num1 = __riscv_vle32_v_u32m1(num1, vl);
vuint32m1_t vec_num2 = __riscv_vle32_v_u32m1(num2, vl);
// Perform the addition vuint32m1_t vec_result = __riscv_vadd_vv_u32m1(vec_num1, vec_num2, vl);
// Store the result back to memory
__riscv_vse32_v_u32m1(result, vec_result, vl);
}

int main() {
// Define two 256-bit numbers as arrays of 8 32-bit integers
uint32_t num1[VECTOR_LENGTH] = {0x12345678, 0x9ABCDEF0, 0xFEDCBA98, 0x76543210, 0x0F1E2D3C, 0x4B5A6978, 0x11223344, 0x55667788};
uint32_t num2[VECTOR_LENGTH] = {0x87654321, 0x0FEDCBA9, 0x12345678, 0x9ABCDEF0, 0x89ABCDEF, 0x12345678, 0x90ABCDEF, 0x12345678};
uint32_t result[VECTOR_LENGTH] = {0};
// Add the 256-bit numbers add_256bit_numbers(num1, num2, result);
// Print the result
printf("Result: ");
for (int i = 0; i < VECTOR_LENGTH; i++) {
printf("%08x ", result[i]);
}
printf("\n");
return 0;
}

Explanation
1. Define Constants:
• VECTOR_LENGTH is defined as 8 to represent the number of 32-bit elements in a 256-bit number.
2. add_256bit_numbers Function:
• This function takes two 256-bit numbers (num1 and num2) and stores their sum in result.
3. Set Vector Length:
• vsetvl_e32m1 sets the vector length to handle 32-bit elements, with a length multiplier of 1.
4. Load Vector Registers:
• vle32_v_u32m1 loads the 32-bit elements of num1 and num2 into vector registers vec_num1 and vec_num2.
5. Vector Addition:
• vadd_vv_u32m1 adds the elements of vec_num1 and vec_num2 and stores the result in vec_result.
6. Store Result:
• vse32_v_u32m1 stores the result from vec_result back into the result array.
7. Main Function:
• Defines two 256-bit numbers and an array to store the result.
• Calls add_256bit_numbers to add the two numbers.
• Prints the result.

Compilation
To compile this code with RISC-V vector extension support, use the following command:

riscv64-unknown-elf-gcc -march=rv64gcv -mabi=lp64d -o add_256bit add_256bit.c

It is worth mentioning that this command is in fact cross compilation on x64/x86 machine and produces a binary for RISC-V architecture. However, remember that you do not necessarily need to do that at all if you have a system working on a RISC-V capable platform like Banana PI or VisionFive2. Then, the build command will not differ from any other compiler invocation and you can use regular gcc or clang.

Example: Adding two 256-bit numbers using GCC and clang support for RVV

What is worth mentioning is that GCC vector extensions (also supported by Clang) support RVV, so this can also be done in a much simpler way by enabling portable code and other compiler optimizations:

#include
#include
#define VECTOR_LENGTH 8 // Number of 32-bit elements to represent a 256-bit number
typedef uint32_t uint32x8_t __attribute__ ((vector_size (sizeof(uint32_t) * VECTOR_LENGTH)));

int main() {
uint32x8_t num1 = {0x12345678, 0x9ABCDEF0, 0xFEDCBA98, 0x76543210, 0x0F1E2D3C, 0x4B5A6978, 0x11223344, 0x55667788};
uint32x8_t num2 = {0x87654321, 0x0FEDCBA9, 0x12345678, 0x9ABCDEF0, 0x89ABCDEF, 0x12345678, 0x90ABCDEF, 0x12345678};

// Add the 256-bit numbers
uint32x8_t result = num1 + num2;

// Print the result
printf("Result: ");

for (int i = 0; i < VECTOR_LENGTH; i++) {
printf("%08x ", result[i]);
}
printf("\n");

return 0;
}

Working on Chromium

SRPOL is currently working on porting Chromium project, to be able to run on RISC-V architecture. This year we have been focusing on enabling full features for web/JavaScript engine, compatibility with other architectures, stability and performance optimizations, because as we mentioned before: for the final users performance is very important when browsing web pages, viewing streamed content or running web applications. We have used ARM as a reference and looked for NEON code in the whole Tizen Web Components code with its dependencies, as Chromium widely uses various libraries to run and to present multimedia. As a result, we have a long list of modules, libraries in which some optimizations for RISC-V in the case of vector extensions can be introduced – please see below examples.

The above list includes just some examples and is much longer than the one presented. This shows the complexity and also the amount of work which needs to be done in performance optimization not only in Chromium itself, but also in many libraries on which it depends.

Contributions and further plans

In the near future, we need Chromium to work with RVV code. The project is dependent on dozens of Open Source libraries which are crucial for excellent multimedia processing in browser and operating system. Gaining performance increase requires many modifications in various different libraries, like ffmpeg, v8, libpng, pixman or libjpeg-turbo. SRPOL has some achievements in pixman, a library being used in Chromium for composing and overlaying different layers, mixing them using alpha channel, etc., as some algorithms have already been ported using RVV, e.g. RGB565 to RGB888 conversion algorithm which gained ~10 times speedup comparing vector to scalar one. However, the engineers working in the Chromium project cannot do all the work by themselves. You can also be a part in this initiative, so please check a guide on how to optimize RISC-V at: https://gitlab.com/riseproject/riscv-optimization-guide/-/blob/main/riscv-optimization-guide.adoc, as it is a huge job for the whole industry. This is the reason why Samsung joined the RISC-V Software Ecosystem (RISE) project as a premier member in 2023.

Contributing to optimizations for RISC-V in terms of vector extensions is not an easy job to do. First of all, Chromium and its dependencies create a lot of issues as those are Open Source projects supporting many different architectures and having their own path of development where maintainers decide to merge or not to merge the committed patch sets. On the other hand, there are still not many RISC-V boards with CPU that supports RVV on the market, so it is more difficult for maintainers to test those patches.

As we are responsible for Chromium, we are planning to keep an eye on RVV support in the libraries that are most important for us and our plans for the near future are to port reported ARM NEON usage in Chromium and its dependencies to corresponding RISC-V RVV instruction, and measure performance gain in those specific regions. We keep our fingers crossed that the actual numbers proving performance increase and that tests made on our side will convince maintainers to merge our changes. If this will not be possible, the other option could be to branch out from the main repository, or postpone merges in some libraries. We will see what the future of RISC-V will bring.

#RISC-V #RVV