3.5. Gemmini

The Gemmini project is developing a systolic-array based matrix multiplication unit generator for the investigation of software/hardware implications of such integrated SoC accelerators. It is inspired by recent trends in machine learning accelerators for edge and mobile SoCs.

Gemmini is implemented as a RoCC accelerator with non-standard RISC-V custom instructions. The Gemmini unit uses the RoCC port of a Rocket or BOOM tile, and by default connects to the memory system through the System Bus (i.e., directly to the L2 cache).

To add a Gemmini unit to an SoC, you should add the gemmini.DefaultGemminiConfig config fragment to the SoC configurations. To change the configuration of the Gemmini accelerator unit, you can write a custom configuration to replace the DefaultGemminiConfig, which you can view under generators/gemmini/src/main/scala/configs.scala to see the possible configuration parameters.

The example Chipyard config includes the following example SoC configuration which includes Gemmini:

class GemminiRocketConfig extends Config(
  new chipyard.iobinders.WithUARTAdapter ++
  new chipyard.iobinders.WithTieOffInterrupts ++
  new chipyard.iobinders.WithBlackBoxSimMem ++
  new chipyard.iobinders.WithTiedOffDebug ++
  new chipyard.iobinders.WithSimSerial ++
  new testchipip.WithTSI ++
  new chipyard.config.WithBootROM ++
  new chipyard.config.WithUART ++
  new chipyard.config.WithL2TLBs(1024) ++
  new gemmini.DefaultGemminiConfig ++                        // use Gemmini systolic array GEMM accelerator
  new freechips.rocketchip.subsystem.WithNoMMIOPort ++
  new freechips.rocketchip.subsystem.WithNoSlavePort ++
  new freechips.rocketchip.subsystem.WithInclusiveCache ++
  new freechips.rocketchip.subsystem.WithNExtTopInterrupts(0) ++
  new freechips.rocketchip.subsystem.WithNBigCores(1) ++
  new freechips.rocketchip.subsystem.WithCoherentBusTopology ++
  new freechips.rocketchip.system.BaseConfig)

To build a simulation of this example Chipyard config, run the following commands:

cd sims/verilator # or "cd sims/vcs"
make CONFIG=GemminiRocketConfig

3.5.1. Generator Parameters

Major parameters of interest include:

  • Systolic array dimensions (tileRows, tileColumns, meshRows, meshColumns): The systolic array is composed of a 2-level hierarchy, in which each tile is fully combinational, while a mesh of tiles has pipeline registers between each tile.
  • Dataflow parameters (dataflow): Determine whether the systolic array in Gemmini is output-stationary or weight-stationary, or whether it supports both dataflows so that programmers may choose between them at runtime.
  • Scratchpad and accumulator memory parameters (sp_banks, sp_capacity, acc_capacity): Determine the properties of the Gemmini scratchpad memory: overall capacity of the scratchpad or accumulators (in KiB), and the number of banks the scratchpad is divided into.
  • Type parameters (inputType, outputType, accType): Determine the data-types flowing through different parts of a Gemmini accelerator. For example, inputType may be an 8-bit fixed-point number, while accType, which determines the type of partial accumulations in a matrix multiplication, may be a 32-bit integer. outputType only determines the type of the data passed between two processing elements (PEs); for example, an 8-bit multiplication may produce a 16-bit result which must be shared between PEs in a systolic array. If your datatype is a floating-point number, then you might also want to change the pe_latency parameter, which specifies how many shift registers to add inside the PEs. This might be necessary if your datatype cannot complete a multiply-accumulate operation within a single cycle.
  • Access-execute queue parameters (ld_queue_length, st_queue_length, ex_queue_length, rob_entries): To implement access-execute decoupling, a Gemmini accelerator has a load instruction queue, a store instruction queue, and an execute instruction queue. The relative sizes of these queue determine the level of access-execute decoupling. Gemmini also implements a reorder buffer (ROB) - the number of entries in the ROB determines possible dependency management limitations.
  • DMA parameters (dma_maxbytes, dma_buswidth, mem_pipeline): Gemmini implements a DMA to move data from main memory to the Gemmini scratchpad, and from the Gemmini accumulators to main memory. The size of these DMA transactions is determined by the DMA parameters. These DMA parameters are tightly coupled with Rocket Chip SoC system parameters: in particular dma_buswidth is associated with the SystemBusKey beatBytes parameter, and dma_maxbytes is associated with cacheblockbytes Rocket Chip parameters.

There are also optional features, which can be either enabled or left out of Gemmini at elaboration-time. For example:

Scaling during “move-in” operations (mvin_scale_args, mvin_scale_acc_args): When data is being moved in from DRAM or main memory into Gemmini’s local scratchpad memory, it can optionally be multiplied by a scaling factor. These parameters specify what the datatype of the scaling factor is, and how the scaling is actually done. If these are set to None, then this optional feature will be disabled at elaboration time. If both the scratchpad inputs are accumulator inputs are to be scaled in the same say, then the mvin_scale_shared parameter can be set to true so that the multipliers and functional units are shared.

3.5.2. Gemmini Software

The Gemmini non-standard ISA extension is specified in the Gemmini repository. The ISA includes configuration instructions, data movement instructions (from main memory to the Gemmini scratchpad, and from the Gemmini accumulators to main memory), and matrix multiplication execution instructions.

Since Gemmini instructions are not exposed through the GNU binutils assembler, several C macros are provided in order to construct the instruction encodings to call these instructions.

The Gemmini generator includes a C matrix multiplication library which wraps the calls to the custom Gemmini instructions. The software directory of the generator (within the generator repository in generators/gemmini/software) includes the aforementioned library and macros, as well as bare-metal tests, and some FireMarshal workloads to run the tests in a Linux environment. In particular, the matrix multiplication C library can be found in the generators/gemmini/software/gemmini-rocc-tests/include/gemmini.h file.

The Gemmini generator generates a C header file based on the generator parameters. This header files gets compiled together with the matrix multiplication library to tune library performance. The generated header file can be found under generators/gemmini/software/gemmini-rocc-tests/include/gemmini_params.h

Gemmini can also be used to run ONNX-specified neural-networks through a port of Microsoft’s ONNX-Runtime framework. The port is included as the onnxruntime-riscv repository submoduled in the software directory. The port is under development, and usage documentation can be found within its repository. Build and Run Gemmini Tests

To build Gemmini tests:

cd generators/gemmini/software/gemmini-rocc-tests/

Afterwards, the test binaries will be found in generators/gemmini/software/gemmini-rocc-tests/build. Binaries whose names end in -baremetal are meant to be run in a bare-metal environment, while binaries whose names end in -linux are meant to run in a Linux environment. You can run the tests either on a cycle-accurate RTL simulator, or on a (much faster) functional ISA simulator called Spike.

The Gemmini generator implements a custom non-standard version of Spike. This implementation is found within the esp-tools Spike implementation, together with the Hwacha vector accelerator non-standard ISA-extension. In order to use this version of Spike, please make sure to build the esp-tools software toolchain, as described in Building a Toolchain.

In order to run Spike with the gemmini functional model, you will need to use the --extension=gemmini flag. For example:

spike --extension=gemmini <some/gemmini/baremetal/test>

Spike is built by default without a commit log. However, if you would like to add detailed functional log of gemmini operation to the spike model, you can rebuild spike manually (based on the instructions in the esp-tools/riscv-isa-sim/README file), with the --enable-gemminicommitlog option added to the configure step.

3.5.3. Alternative SoC Configs

The Gemmini generator includes additional alternative SoC configs (configs that are not in the Chipyard example project). If you would like to build one of these alternative SoC configurations which are defined in within the Gemmini project repository, you can run the following commands. These commands are similar to the one required when building a simulation from the example project, but they specify that the location of the configs are in the Gemmini subproject, as opposed to the Chipyard example project:

cd sims/verilator # or "cd sims/vcs"
make CONFIG=GemminiAcceleratorConfig CONFIG_PACKAGE=gemmini MODEL_PACKAGE=freechips.rocketchip.system GENERATOR_PACKAGE=freechips.rocketchip.system TOP=ExampleRocketSystem