While I have done a lot of programming, including assembly language, I have always regarded CPUs as mysterious black boxes. This project delves into how they actually work.
ASICs (application specific ICs), which FPGAs (field programmable gate arrays) are one form of, are intended to be an alternative to CPUs and microcontrollers, i.e., dedicated hardware that might do something better or faster than can be done by a CPU. But for hobbyists (like me) and academics, the ultimate FPGA project is building a CPU.
Designing a CPU is itself a big project, before you even consider how to construct it in a FPGA. What will be the instruction set? Data path width? Number of internal registers? How much RAM, Instruction width, and is it fixed or variable? How many clock cycles will an instruction take? How will program code be stored? Etc.? Etc.?
In this project, I am going to design and build a fairly simple CPU. I am building it on Alchitry's AU FPGA board. My CPU is roughly modeled after a CPU designed by Alchitry founder, Justin Rajewski. Justin's CPU is 8 bits wide with a simple 4 bit op code and a limited instruction set. Mine is a little more complex, with 16 bit data path, and an expanded 5 bit instruction set. But I have retained much of the structure in Justin's design: a fixed width instruction (his is 16 bits, mine 24), the instruction code stored in ROM, rather than RAM, all instructions kept simple and executable in a single clock cycle.
DesignThere are many decisions to make in designing a CPU and they all seem to involve compromise. Every decision seems to have benefits but also downsides.
Let's start with the data path width. I am going with 16 bits. For internal registers, I decided to have eight. I call them D1 (destination), S1, S2 (sources), x1, x2, x3, x4, x5 (indexes). However, they are each a general purpose 16 bit register, completely interchangeable. So I need 3 bits in my instruction to specify a register.
I want to add several instructions that don't fit in a 4 bit op code, so I opted for a 5 bit op code. A generic instruction will involve an op code, a register, and a 16 bit number. That takes my instruction width up to 24 bits. Because my instruction width is fixed (desirable if we are storing instructions in ROM), whenever the instruction takes less than 24 bits, we need to add the extra "don't care" bits to bring the total to 24.
For RAM, I am going with 4096 16 bit numbers. I'd rather have it organized as 8 bit bytes, but doing so would require multiple instructions to get a 16 bit number loaded into a register. Using 16 bit numbers, it's much easier, though it still can't be done in a single clock cycle (more on this subject coming up shortly). I already have my addressing set up to handle 16 bit addresses, so I could have a lot more RAM if necessary, But 4096 numbers is plenty of RAM, as everything I might do with this simple CPU won't use much RAM.
My program counter is also 16 bits. I don't need that much, as anything I do here will be pretty basic. But one way that a big program counter may help is that it allows me to write subroutines and park them at some high address, where they could reside in different programs without having to move or modify them.
Design EvolutionOnce I had the CPU designed, I started creating simple projects with it. With this experience, both the hardware and the instruction set evolved some. Here are some of the things I learned and changed as we progressed.
Input/Output
When I started, the only input and output was a single 8 bit IN port and a single 8 bit OUT port. I quickly realized we need something more. I wanted serial communication, so that all by itself would use up the original ports. But it's easy to add more, because the Alchitry IO Element board already has 24 outputs to LEDs (3 banks of 8) and 24 inputs connected to DIP switches (3 banks of 8). So I decided to use those to create 3 additional 8 bit input ports and 3 additional 8 bit output ports.
We end up with the following defined I/O ports:
IN0, IN1, IN2, SER_IN
OUT0, OUT1, OUT2, SER_OUT
Program Counter Control
I started out with just 4 jump instructions: JMP unconditional, JPZ jump if zero, JPE jump if equal, JPL jump if less than. While you can accomplish almost anything with these, it's not always easy. I found myself constantly wanting JNE jump not equal. I also prefer the term Branch for conditional jumps.
So I now have BRZ, BNZ, BNE, and BLT for if 0, if not 0, if not equal, and if less than, respectively.
I also needed to make it easy to handle subroutines. So I added a small stack (only 8 addresses deep) and an instruction PSHI (push immediate) to push an address onto the stack. I then added an RTN (return from subroutine) instruction that pulls the address off the stack and jumps to that address.
Handling RAM
The Simple RAM module I have employed for random access memory requires that the address be loaded into the module one clock cycle before trying to read the content of that address. This turns out to be a major complication in this single clock cycle instruction system. It is further complicated because I don't normally want to be concerned with how instructions are sequenced. Usually I am not worried about speed and may space instructions out every 10 program counter steps, placing many clock cycles between instructions.
So I solved this issue be placing the RAM address in a register, where it stays until explicitly changed. I can then use a read or write to RAM instruction in any subsequent clock cycle. But that requires one more clock cycle between loading the address and reading its content in order to latch the address into the register. So to read RAM we need clock 1: load the address. clock 2: latch the address, clock 3: read the content. The clock 2 step can be a NOP or just a skipped step in the program counter. This is not a solution I am totally happy with, but at least it works!
Serial Interface
It took me some time to decide what I wanted to do here. FPGA's have component modules for serial data communication, but they aren't designed to be run from my CPU. And they need to interrupt the CPU for many clock cycles to transmit or receive even one character!
An alternative would be to design serial transmit and receive hardware that could be controlled directly by my CPU, but that would involve a lot of extra work!
The solution I came up with is slightly unconventional. I have added a 5 bit operand to the NOP instruction. If that operand is 0, it is treated like NOP (or no operation) and doesn't do anything.. But if it is 1-31, the processor stops and waits for that operand to be cleared. A peripheral device can then recognize that operand as its signal to do something. When its finished it clears the operand and the processor continues. This is not a very elegant solution, but it does allow me to use existing component hardware modules.
So I am adding serial receive and transmit modules and hooking them up to one set of my CPUs 8 bit input and output ports. To make this process of calling the serial modules a little more user friendly, I have added Call.SER_IN and Call.SER_OUT to the global constants table. They are used following a NOP instruction.
Handling Text
Once I started thinking about serial communication, I realized I had no efficient way to deal with text I might want to transmit. A ROM dedicated to storing text seemed like a reasonable addition. It only required one new instruction (LDR) to load characters from ROM.
The HardwareHere is the final hardware configuration. With RAM, ROM, serial communication, and several additional I/O ports, I actually have a primitive 16 bit microcontroller!
Below is the final version of the instruction set, as implemented in the final two projects. (Earlier projects use very similar instructions, but they did evolve some as I learned from actually using the instructions) Note that xx = unused bits in the 24 bit instruction.
I am pretty happy with this instruction set. Given that all operations are single clock cycle, I am pleasantly surprised by what I can do in a single instruction. As stated earlier, the one exception is that RAM read and writes we need up to 3 clock cycles and two instructions.
How Does the CPU Work?To really understand the inner workings of the CPU, you probably need to study the hardware description file "cpu.luc". But this will give you a general idea of what is going on.
The logic code above defines the inputs and outputs of the CPU. Most of them are the input and output ports.
The registers are defined by the code above. There are other registers as well, but these are the main CPU registers.
As shown in the code above, the 24 bit instruction is parsed into its various components: the opcode itself, the registers selected, a program counter address, other 16 bit values following a register, and the perfcode, which directs processing to peripheral hardware.
Then, as shown in the code below for the first few instructions, a case structure actually executes the specific op code.
HDLs (hardware description languages) are very powerful software tools. You mostly tell the software what you want the hardware to do and it builds the hardware for you, rather than making you literally design all the hardware. Nowhere is this more evident than in a CPU, which is an incredibly complex piece of hardware. Yet I can completely build my CPU in less than 150 lines of code!
BuildI am doing everything in the Lucid HDL language. Lucid is a variation of the more common Verilog HDL. If you like Verilog better, look in the Work folder of any project and you can find all the Lucid code translated into Verilog. (Alchitry Labs does this conversion before it calls Vivado to actually program the FPGA.)
We are using Alchitry's AU FPGA board to build our project. To get set up to build this project, start with my Beginner's Guide to FPGAs, where in Part 1, we set up both hardware and software for the Alchitry Au FPGA board. We will also be using the Alchitry IO Element board that comes in the Alchitry Au FPGA Kit.
All of the code for the FPGA is available in the attached ZIP file. The complete CPU and peripherals are contained in each of the projects. As the CPU is programmable, the main thing that is different between the various projects is the code in the Instruction ROM module.
Here is a brief description of the various modules that make up each project. For additional information about each module, look at the module itself and the comments in each file:
au_top - this is where the various modules are all tied together and connected to the outside world. By outside world, I mean LEDs, a serial TTY terminal, buttons, i/o pins, etc.
cpu - as the name implies, this is the CPU itself, where registers are defined and the instruction set is executed.
global_constants - this is where we define our mnemonics, the short-form names we assign to instructions like LDI for load immediate and X1 for the first index register.
instRom- this is the ROM where we store our list of instructions or program.
textRom - this is a ROM where text can be stored for serial output.
simple_ram - a component module that configures our RAM memory.
uart_rx - a component module that creates a serial data receiver.
uart_tx - a component module that creates a serial date transmitter.
reset_conditioner - a component module that creates clean resets.
Coding / Creating ProgramsCreating programs to run on this CPU is obviously a challenge, as we don't have even an assembler for it, let alone a high level language, The programs we do write will need to been fairly simple, and are all basically designed to test the functionality of our hardware.
If you look at the instruction ROM, you can see that the syntax is a little tedious with a lot of repetition. This can be overcome for the most part with cut and paste, so it's not a big problem. Our global constants table provides easy-to- remember names (mnemonics) for instructions, registers, ports, and calls to peripherals, so the coding is pretty much like assembly language.
The one exception is the program counter. An assembler lets you assign names to program counter locations and allows relative addressing. Here jumps and branches are explicit and are done manually - very tedious and prone to errors.
Another coding pitfall in this coding scheme is the need to add unused bits to get each instruction equal to 24 bits. They are not optional, and the instruction will fail if the instruction is not exactly 24 bits wide.
Writing programs is actually surprisingly easy with this scheme. Debugging programs can be difficult. Branching and the 24 bit fixed instruction length are two common sources of error, but another problem with these programs is that we are also still debugging the CPU's hardware. When something doesn't work, I frequently don't know if it is my program, or some new previously unnoticed hardware design flaw!
So each of the programs listed in the projects below involved some time to tweak and debug, in some cases finding a problem in the hardware instead of the program code.
The ProjectsAll the projects listed here are contained in the attached ZIP file. Each project is a complete stand alone system. In theory the only difference between the projects is the InstRom module that contains the program, but as discussed previously, as I progressed through these projects, I made some additions to the hardware and modifications to the instruction set. Only the last two projects on this list contain the final instruction set and all the hardware.
All projects were built using Alchitry Labs V1.2.7, and using the IO Element board in conjunction with the Alchitry Au FPGA board.
CPU16_simple_counter
This project was the first, designed to see if we had a working CPU. At this point, we only had single 8 bit input and output ports.
The main loop increments a register and puts the results lower 8 bits in the output port. I connected the IO board's LED Bank0 to the output port, so that the counter results shows up on the LEDs.
A subroutine contains a 1 second timer and is called after each increment of the counter, so our binary counter is counting seconds. It doesn't do anything else, but does demonstrate that we have a working CPU!
CPU16_timer
This project was also done as part of early testing. It is similar to simple_counter, but connects the input port to Bank0 of the dip switches on the IO board. The counter is loaded with the binary number stored on the switches and counts down to 0, forming a programmable seconds timer. All the LEDs are turned on when the timer "times out"!
CPU16_serial_test
This project is where we added serial send and receive modules and incorporated the peripheral call system. The program itself is trivial, simply getting a character from the serial input and echoing it back at the serial output. But it demonstrates a working serial interface!
To use the serial input and output, we need a host computer with a TTY terminal. Alchitry Labs has a built-in TTY terminal which I used, but a stand alone TTY terminal program would work as well. I have the serial interface set at 9600 baud, but we could change it to any other data rate.
CPU16_serial_greeting
This project implements a classic CPU demo using the serial interface. The CPU asks you your name, you input your Name, and the CPU responds with "Hello Name". In addition to testing the serial interface, it also needs to store the name in RAM, and read it back. Again, baud rate is 9600.
This program took a while to get working correctly, because it was the first time I actually tried writing and reading RAM. I needed to implement some hardware changes to get it working correctly.
Because our instructions are all single clock cycle, I needed to add an instruction to pre-load the RAM address. Also, as you can see in the program at pc 47, I had to add a NOP to give the CPU one cycle to latch the RAM address before trying to read the content. (This could also have been accomplished by just skipping one program counter address, though I wanted to expressly show the NOP here.)
CPU16_expanded_IO
This project is where we expanded the IO from a simple input and output ports to 4 of each, with one dedicated to serial communication and the others available for anything, but for now, connected to the dip switches (input) and LEDs (output) on the Alchitry IO board.
For an initial test of this new IO structure, I just created another counter, this time sending the 8 low order bits through Out port 0 to LED bank0 and the upper 8 level bits through Out port 1 to LED bank1. That gives us a counter that goes up to 65, 525, and takes a while to get there. I changed the 1 second delay routine to 0.1 sec.
CPU16_blink
This project is similar to the last one, just playing around with the new I/O ports.. We are using three output ports to light the LEDs, this time flashing them in order from left to right, moving to the next LED every 0.25 seconds and repeating.
This is the first program to incorporate the final version of the hardware and instruction set.
CPU16_switch_count
Our final project is just another test of our expanded IO. This time we put all the dip switches on the IO board to work as an input. The program is simple - we count how many switches are turned on and display the number in binary on the first LED bank. We repeat this process many times per second, so the count updates in real time as we open and close switches.
This program is only slightly more complicated than it sounds. We need to look at each of 24 switched one at a time and add it to the count if it's turned on. This required examining each bit in each of three input ports. Even so, it's a very simple program.
This program also incorporates the final version of the hardware and instruction set.
Where to Go from HereI learned a lot about CPUs and computer hardware in general during this extended series of projects. If you actually read all this, I hope you did too! There are many potentials directions I could take this from here. I am not sure what is next, or if I will do anything more.
One simple addition to the existing system would be to bring the I/O ports out on Alchitry's Br Element board, which has about 50 general purpose I/O pins. I could then create several more simple programs to interface with external hardware.
Another possibility is to build a new CPU with a variable length instruction where the program code resides in RAM. The only ROM based program would be a simple loader that loads an external program into RAM perhaps through the serial I/O. This would be a major redesign, but could still use the same instruction set. I would want to change the organization of the RAM to bytes, and instructions would be 1, 2, or 3 bytes long.
A system like that, with variable length instructions and RAM based program storage would also make it possible to take a generic assembler, such as TASM, and configure it to assemble code from this new CPU.
If I wanted to tackle a really big project, I could attempt to create a C+ like compiler for this second generation CPU. It would be a very limited subset of C+ however!
Comments