A RISC-V emulator in Zig, Part 1: Instruction Decoding
Introduction
About a week ago, I took it upon myself to implement a RV64IM (64-bit RISC-V, with multiplication extensions) emulator in Zig, with the goal of running no-MMU Linux eventually. Just like with a physical CPU, the most important step in any CPU emulator is the tricky job of actually decoding the instructions. Thankfully, Zig made this much easier to do than it would’ve been if I wrote the emulator in C.
In this post, I will not be going over the entire RISC-V specification, nor describing the instruction set in great detail. If you want to learn about that, I recommend reading the specification itself. It’s free.
Instruction Formats
RISC-V has six main types of instructions. There are others, but we won’t worry about them right now. They are the R, I, S, B, U, and J types. R-type instructions have three fields (rd, rs1, rs2) for two source and one destination register, along with two fields (funct3 and funct7) to specify “subfunctions” of the opcode. I-type instructions have two fields (rd and rs1) for destination and source registers, along with an immediate field and and a funct3 field to specify the subfunction. I won’t describe them all here, but check out this excellent reference to quickly see how they’re laid out.
Parsing
Thankfully, in the base specification, all RISC-V instructions are 32 bits, and the opcode is always contained in the lower 7 bits. In addition to this, the lower two bits don’t seem to be used for anything. In Zig, it’s trivial to snip off the lower two bits using the >>
operator and then pick out the remaining 5 bits with a simple &
. Then we can use switch
to match opcodes with functions to decode the rest of the instruction.
pub fn decode32(raw: u32) anyerror!Instruction {
var opcode = (raw >> 2) & 0b11111;
return switch (opcode) {
0b01100 => decodeAluOp(raw),
0b00100 => decodeAluImmOp(raw),
0b00000 => decodeLoad(raw),
0b01000 => decodeStore(raw),
0b11000 => decodeBranch(raw),
0b11011 => init(.jal, JType.decode(raw)),
...
Of course, getting the opcode is just the first step in parsing instructions. One of the first thing you’ll realize is that in RISC-V, all opcodes correspond to one and only one instruction format. This greatly simplifies implementation.
R-Type Instructions: ADD, SUB, etc.
We’ll take a brief look at the R-type instruction format with ADD, SUB, XOR, and similar instructions.
bits = 32
f3 = funct3
7 5 5 3 5 7
+--------------------------------+
|funct7|rs2 |rs1 |f3|rd |opcode |
+--------------------------------+
Here’s how we’ll decode those instructions. The actual function of the instruction is determined by the funct3 and sometimes funct7 fields.
fn decodeAluOp(raw: u32) !Instruction {
var funct7and3 = tup(raw >> 25, (raw >> 12) & 0b111);
return switch (funct7and3) {
tup(0b0000000, 0b000) => init(.add, RType.decode(raw)),
tup(0b0100000, 0b000) => init(.sub, RType.decode(raw)),
tup(0b0000000, 0b001) => init(.sll, RType.decode(raw)),
tup(0b0000000, 0b010) => init(.slt, RType.decode(raw)),
...
Probably you’ll notice the strange tup
function. Zig doesn’t have Rust-style pattern matching, so in order to keep the code simpler, we’ll need to create a work-around. Simply put, it takes two u32
s and creates a single u64
that can be compared. The result is an easy way to match funct7 and funct3 at the same time. The init
function creates an instance of Instruction
with the appropriate tag set to the decoded parameters.
Let’s explore the RType
struct now. In Zig, a structure representing the parameters of an R-type instruction is laid out as follows:
pub const RType = struct {
rs2: u5,
rs1: u5,
rd: u5,
pub fn decode(raw: u32) RType {
return .{.rs2 = @intCast(u5, (raw >> 20) & 0x1f), .rs1 = @intCast(u5, (raw >> 15) & 0x1f), .rd = @intCast(u5, (raw >> 7) & 0x1f)};
}
};
The decode
function uses your usual bitshifts and AND operations to pick out the bits corresponding to each parameter. Since the other fields were already processed, we don’t bother to include them here.
I-Type Instructions: ADDI and Friends
bits = 32
f3 = funct3
12 5 3 5 7
+--------------------------------+
|imm |rs1 |f3|rd |opcode |
+--------------------------------+
I-type instructions carry a 12 bit immediate instead of funct7 and a second source register. Now we’ll take a look at how one of these instructions is decoded.
fn decodeAluImmOp(raw: u32) !Instruction {
var funct3 = (raw >> 12) & 0b111;
return switch (funct3) {
0b000 => init(.addi, IType.decode(raw)),
...
Because there’s no funct7 field, we don’t need to use tup
. We switch
directly on funct3
, which determines which function is to be carried out. In this case, a value of 0
means ADDI
, and we decode the parameters into an IType
struct. What does that struct look like?
pub const IType = struct {
imm: u12,
rs1: u5,
rd: u5,
pub fn decode(raw: u32) IType {
return .{.imm = @intCast(u12, (raw >> 20)), .rs1 = @intCast(u5, (r
aw >> 15) & 0x1f), .rd = @intCast(u5, (raw >> 7) & 0x1f)};
}
pub inline fn immSigned(self: IType) i12 {
return @bitCast(i12, self.imm);
}
};
It’s simply more of the usual bit operations. You’ll notice that the upper twelve bits are taken for the immediate: raw >> 20
discards the lower 20 of the 32 bits, leaving only 12. You should also notice the immSigned
function. This function makes sign extension easier for us to do during execution. If you aren’t familiar with it, sign extension is the practice of taking the most significant bit of smaller signed integer and repeating it to fill the unused bits of a larger signed integer. Sign extension of immediates is required by the RISC-V specification, and the resulting signed integer from immSigned
can be sign extended with a simple @intCast
.
Anatomy of a Decoded Instruction
The Instruction
type was briefly mentioned above. It’s a tagged union, which allows for easy switch
ing based on which instruction it contains.
pub const Instruction = union(enum(u8)) {
add: RType,
sub: RType,
xor: RType,
@"or": RType,
@"and": RType,
...
In this struct, a specific instruction is paired with the appropriate structure corresponding to its format. That means later on, we can access the parameters of an instruction using something like decoded_instruction.add.rs1
, which gets the first source register of the instruction, assuming the instruction decoded is ADD
. In a future post, we’ll take a look at how these instructions are actually executed.
Code
The source code for the full instruction decoder is available on GitHub.