Open Source

Bringing RVV to Life: Overcoming Hardware Gaps in RISC-V Development

By Marek Pikuła Samsung R&D Institute Poland

Introduction & Motivation

As one of the premier members of the RISC-V Software Ecosystem (RISE) project , Samsung participates in efforts to improve the quality and availability of software on the RISC-V platform. Samsung has demonstrated its commitment to RISC-V by making it one of the first-class platforms for the Tizen OS . This effort includes providing open-source support for multiple RISC-V boards, as well as maintaining compatible Linux kernels and system packages required to deploy Tizen OS on RISC-V-based products.

One of the goals of the System Libraries Work Group within RISE is to extend support for the RISC-V Vector (RVV) extension throughout the Linux software ecosystem, providing significant performance and power efficiency gains for selected computation-heavy libraries. These include audio and video codecs, graphical workloads, and vector-based computations such as AI inference. As the first porting effort, our team selected the pixman project, which is used in Cairo and Chromium. Pixman has well-compartmentalized SIMD code, with existing implementations for multiple well-known backends, including x86 and ARM64 SIMD extensions.

At the beginning of the project, the main issue was hardware availability. Even though RISC-V International ratified the RVV1.0 extension back in November 2021, no off-the-shelf RVV1.0-capable board was available by the start of 2024. The only option for developing software for this platform was QEMU emulation, which is helpful for checking the correctness of a given algorithm against its scalar variant. However, it doesn't provide any performance insights, as QEMU doesn't implement any microarchitectural model for the emulated vector instructions.

This fueled our interest in researching alternative RVV-capable targets and exploring the possibility of integrating RVV checks into CI pipelines for the system libraries under our care, which will be discussed in the following sections.

FPGA-based RVV Target

Due to the lack of available hardware, our team researched alternative ways to support performance-focused RVV development. RISC-V is known for its vibrant open-source IP core ecosystem, with many companies and universities publishing full-featured RISC-V cores under permissive licenses. One such project is PULP Ara , an RVV1.0 coprocessor for CVA6 , a well-established, high-quality application-class RISC-V core.

The project had the following requirements:

1. Full RVV1.0 support to enable software engineers to fully utilize the capabilities of the extension in the ported software. 2. Linux and MMU support since the ported software is Linux-based. 3. Capability to run on an FPGA, to meet performance requirements and perform meaningful benchmarks within finite time. 4. Ease of use and deployment, as the goal is to provide software developers with a ready-made development and testing environment. 5. Option to adjust microarchitecture, to benchmark code on low- to high-end configurations (e.g., adjustable VLEN, lane count, etc.).

Selecting the Hardware Platform

For the hardware platform, we selected AWS EC2 F1 cloud instances featuring powerful Xilinx VU9P FPGAs. Having a cloud-based hardware platform made it much easier to create a reproducible and easy-to-deploy environment for software engineers, who may not necessarily have expertise in digital design. It also allows for easy integration into CI pipelines in the future.

AWS provides a complete HDK (Hardware Development Kit) to integrate Custom Logic (CL) on the FPGA, but for this project, it was much easier to use FireSim , along with the Chipyard project. These frameworks provide an easy-to-use, "batteries included" environment for simulating arbitrary hardware designs (in this case, a complete SoC) at near-FPGA-prototype speeds in the cloud or on-premises. FireSim supports virtual network interfaces and block devices and includes the generic FireMarshal software configuration suite, all of which simplify the process of bootstrapping a new design by reducing the degrees of freedom to the simulated design itself.

Figure 1. System diagram for PULP Ara running within FireSim on AWS EC2 F1 instance

PULP Ara Integration

We used the existing CVA6 wrapper in Chipyard as a starting point for the PULP Ara integration. We configured Ara with two lanes and a VLEN of 2048 for the initial tests. This synthesized at 80 MHz core clock frequency, allowing us to boot a complete Linux Buildroot environment in under a minute. The complete design, including the AWS wrapper and FireSim logic, occupies about 30% of the logic resources available on the VU9P . This resource headroom enables us to experiment with larger configurations and potentially multicore setups in the future.

Figure 2. Logic utilization diagram: yellow – Ara+CVA6, green – FireSim, orange – AWS SH

The porting effort involved adjusting the build scripts for Ara sources, creating an appropriate wrapper to connect Ara and CVA6 to the Chipyard system bus, and applying several patches to Ara to enable MMU support and resolve some outstanding hardware bugs. We plan to move these changes upstream in the coming months. For now, they reside in a collective project repository .

Benchmarking Results

To assess the RVV compatibility of the new target, we ran the rvv-bench test suite. Due to some hardware bugs in Ara, a few tests were unsuccessful (145/188 instruction tests passed), but this was enough to assess the performance of a couple of algorithms our team developed. One output from the RVV instruction test is a table showing the cycle count required to run a given vector instruction for different VLENs (vector length), SEWs (Selected Element Width), and LMULs (vector register group multiplier).

Table 1. Excerpt from cycle count statistics table for PULP Ara

This data allowed us to improve the performance of a couple of algorithms in pixman. For instance, we reduced the instruction count of an RGB565 to RGB888 conversion algorithm by 25% (resulting in a 21% improvement in cycle count), which also increased our understanding of the platform. Overall, we observed an average 10× performance improvement of the vector implementation over its scalar counterpart on the PULP Ara platform. For comparison, a later-released CanMV-K230 board with a single RVV core and VLEN of 128 provides about 2.5× speedup, highlighting the importance of VLEN size and system bus width.

Figure 3. Performance difference between scalar and vector implementation

Pixman CI Pipeline

In addition to RVV support, we were interested in improving the pixman project's CI pipeline. Previously, it tested only a single x86 backend, which was greatly insufficient for a project of this complexity. As mentioned earlier, pixman supports multiple platforms and SIMD extensions (MMX, SSE2, SSSE3 for x86, ARMv6 SIMD, ARMv7/v8 Neon, LoongSon MMI, PowerPC VMX, MIPS DSPr2, and the upcoming RISC-V RVV). As part of our RVV contribution, we developed a vastly improved CI workflow that benefits all supported platforms.

Our plan was to create a comprehensive pipeline to test all architectures and backends with both GNU and LLVM toolchains, including a collective coverage report. Since not all architectures have upstream support in Debian, our distro of choice for the pipeline, we conceived two types of targets:

1. "Code coverage" native target – generates a code coverage summary. It uses QEMU to run native Docker images with all optional dependencies, making it much easier to build and test projects with external dependencies than using a cross-build environment. This also allowed us to easily generate code coverage reports for those targets. 2. "Platform coverage" cross target – basic "smoke test" targets that show whether tests pass on platforms without official distro support. Here, pixman is cross-compiled, and tests are run under QEMU or Wine on an x86 Linux host.

Figure 4. PPixman CI pipeline flow

Our CI overhaul yielded significant insights for the project. We discovered that some targets fail to build or pass tests when pixman is built using the LLVM toolchain. By upstreaming the CI workflow , our team was able to iterate faster on the RVV implementation , as all checks are now performed automatically.

Summary

Adding software support for new RISC-V ISA extensions is a complex challenge that requires a deep understanding of the hardware-software ecosystem. As demonstrated by our work on the RISC-V Vector (RVV) extension, the lack of available hardware severely hampers the development process, particularly for performance-critical extensions. While emulation tools like QEMU can help verify the correctness of software implementations, they are inadequate for performance benchmarking and tuning, which are essential for fully exploiting ISA extensions like RVV.

To address this, we explored open-source RISC-V cores such as PULP Ara to create an FPGA-based RVV target, allowing us to perform meaningful performance benchmarks. By using a cloud-based FPGA solution on AWS EC2 F1 instances and leveraging frameworks like FireSim and Chipyard, we were able to create a reproducible development environment and perform initial benchmarking tests. These tests provided valuable insights into the performance impact of RVV on computation-heavy algorithms, such as those used in pixman, demonstrating up to a 10× improvement in performance over scalar implementations.

Furthermore, this project highlighted the importance of integrating new platforms into CI pipelines. Our overhaul of pixman's CI pipeline improved testing across multiple architectures and platforms, ensuring broader compatibility and speeding up the development cycle. In the future, the availability of reference hardware upon the ratification of ISA extensions will be critical for software developers to fully exploit the potential of emerging ISA extensions. Our experience underscores the importance of collaboration within the open-source ecosystem to bridge these hardware gaps and ensure software readiness as new extensions are ratified.

Topics were presented at the RISC-V Summit Europe 2024 in Munich and ORConf 2024 in Gothenburg, Sweden. Here are the links to the recordings and slides:

1. Accelerating software development for emerging ISA extensions with cloud-based FPGAs: RVV case study – recording, slides, more information. 2. CI setup for multi-platform software project – recording, slides.

References

[1] https://research.samsung.com/news/Samsung-Electronics-Participates-in-RISE-an-Open-Source-Project-for-Enabling-Software-Ecosystem-of-RISC-V

[2] https://lf-rise.atlassian.net/wiki/spaces/HOME/pages/8589332/Tizen+RISC-V+Status

[3] https://github.com/riscv/riscv-v-spec/releases/tag/v1.0

[4] https://pixman.org

[5] https://github.com/pulp-platform/ara

[6] https://docs.openhwgroup.org/projects/cva6-user-manual/index.html

[7] https://aws.amazon.com/ec2/instance-types/f1

[8] https://fires.im

[9] https://chipyard.readthedocs.io/en/latest

[10] https://firemarshal.readthedocs.io/en/stable

[11] Total logic utilization: 31% LUT, 12% FF, 19% RAMB36, 5% URAM, 2% DSP.

[12] https://github.com/MarekPikula/RISC-V-Summit-Europe-ORConf-2024

[13] https://github.com/camel-cdr/rvv-bench

[14] https://gitlab.freedesktop.org/pixman/pixman/-/tree/master/.gitlab-ci.d

[15] https://gitlab.freedesktop.org/pixman/pixman/-/merge_requests/102