Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Foreward

17 years ago, I picked up my first Lego brick. 14 years ago, I built my first K’nex tower. 10 years ago, I wrote my first line of code and blinked my first LED. 6 years ago, I flashed my first FPGA.

These events happened years apart but are somehow part of the same journey; each one a critical piece in shaping how I became an engineer. In many ways, going back in time is like peeling back the layers of abstraction in a computer system, digging deeper and deeper until all that remains is the very thing that makes things tick.

For the longest time, I’ve enjoyed teaching others and passing on the little wisdom I have. No better words on the value of teaching have been said than by Richard Feynman, who neatly encapsulated the entire motivation for this course and others in his quote: “If you want to master something, teach it.” It’s with this spirit that I decided to apply my understanding of ASIC design to a very topical matter and assemble this course. My hope is that by addressing AI we can provide yet another avenue for more hardware engineers to enter this field.

I’m deeply grateful for all the support and mentorship I’ve received from everyone along my engineering journey: teachers, family, friends and work colleagues included. Without them, I never would’ve stepped into the incredibly rewarding world of chip design.

About the author

Matias Wang Silva is a hardware engineer working on custom silicon at Raspberry Pi. He holds a BA and MEng from the University of Cambridge in Electrical and Information Engineering.

He is passionate about teaching curious young minds about engineering topics and has a long history of experience in tutoring.

Introduction

Hardware design is in vogue. If the 2010s were about bringing us great software and cloud, then the 2020s are firmly about the lower layer of the stack, the bare metal. Semiconductor companies like NVIDIA, Intel, and AMD continue to push the limits of technological innovation in the world of AI, spurred on by massive memory and compute demands.

Make no mistake, these are not software companies. According to some, software engineers will evolve to become AI supervisors, merely inspecting and validating the output of AI agents. Hardware engineers on the other hand must understand Physics, Computer Architecture and Electronics to create — this set of skills can never be replaced, only augmented.

Course objective

The object of this course is to use neural network inference as a vehicle to learn about chip design. The end goal is a fully functional FPGA-based inference engine that can accelerate, in essence, matrix multiplication. This should set you on a good path to understanding the fundamental hardware requirements and constraints of artificial intelligence computations.

Learning outcomes

  • Understand the tradeoffs involved in hardware design
  • Understand the interaction between the layers of abstraction in computer architecture
  • Improve skills in Python, C and SystemVerilog
  • Gain hands-on experience with FPGAs, HDLs and Python-based verification
  • Gain intuition about machine learning computation and the mathematical operations underpinning them

Deliverable

You will build an inference accelerator on an FPGA. Inference is the process of extracting useful output from a pre-trained neural network, given a set of inputs. This process can be slow and consumes wasteful compute power when run on a CPU. Using FPGAs, we can build hardware accelerators that are custom tailored to this particular type of computation.

Your neural network will be a digit classifier. This is the equivalent of ‘Hello World’ in the AI world. A digit classifier takes an image as an input and outputs 10 numbers, each representing a probability that the image corresponds to a particular number.

You’ll also write a report that conforms to the CREST Gold guidelines

Prerequisites

This course will not teach you machine learning nor will it teach you how to program. Therefore, the more you know on the below topics, the less confused you will be and the more you will gain from the course.

  • Math: linear algebra, in particular matrix multiplication, number systems including binary counting, functions
  • Computing: computer architecture, two’s complement, boolean algebra, memory, bit manipulation
  • Programming: conditionals and control flow, loops, variables, data types, compilation and linking, logical operators
  • Electronics: voltage and current, logic gates, transistors, circuits
  • Machine learning: feed-forward fully connected networks, convolutional networks, backpropagation, gradient descent, weights, activation functions, cost functions
  • Software: command line familiarity, basic operating systems knowledge, Git

No machine learning or SystemVerilog knowledge is required, learning material will be provided.

Housekeeping

Work area

All teaching work will take place on a shared server, which students can access at any time. This shared server will host all the tools required to complete the course, including:

  • cocotb for Python-based RTL verification
  • Icarus Verilog simulator
  • Verilator simulator
  • yosys for synthesis
  • Gowin (FPGA vendor) tools

Editor

We will use the VSCode Editor with the Remote SSH plugin to work on files hosted on the shared server.

Course structure

There are 12 contact hours allocated to this project. To meet CREST Gold guidelines, students are expected to make up 58 hours of independent work.

Material has been appropriately portioned into 12 sessions but note that the last few sessions will place emphasis on building and problem solving. In broad strokes, we will learn about:

  • training neural networks with PyTorch
  • digital design basics
  • FPGAs
  • machine learning computations

Schedule

We will meet once every weekend, with occasional breaks given for longer term work. Your supervisor will let you know in advance when this happens.

There will be homework set on an ad-hoc basis.

DateDuration (hrs)Time (GMT+0)
21/1218:30
27/1219:30
3/11.59:30
18/11.59
25/119
1/21.59
8/219
14/21.59:30
21/219:30
28/219:30

Teaching style

While I try and approach all topics from the most intuitive and ‘first-principles’ perspective, it’s possible it might not work for you. I therefore highly encourage the practice of asking questions all throughout the course as well as for you to supplement the course material with some of the resources linked on this site. Remember, this course is not self-contained, it’s merely an introduction to the weird and wonderful world(s) of digital design and machine learning. I encourage you to go off-piste!

A few more tips:

  • Do your homework! I’ve deliberately crafted homework to be direct and short. The more you put in, the more you will get out of it.
  • Do more beyond your homework! You’ll find that as you stretch beyond the scope of the problems I’ve given you, that’s where the real learning will happen. Try something new, break something, extend something, and then tell me about and share it with the class!
  • The classroom is your friend. Imagine sitting in a room with a group of like-minded people with the same interests as you. Sounds great, right? Well, that’s exactly what our classroom is! Please share ideas, thoughts, questions and more with the group. Nothing is worth feeling embarrassed.
  • Embrace discomfort. I will introduce ideas you will never have heard of and it will all seem quite complicated at first. This is fine and intentional. I find that learning works best when you’re thrown in the deep end, made to swim in the ocean as it were.

And lastly, I will repeat: do not be afraid to ask questions! Interrupt me, please. You may have heard this several times before but I’ll repeat it again: if you’ve got a question, chances are someone has the same one too. Do us all a favor and ask it!

This website

This website is meant to be a digital companion for the course, it is not a replacement. It is highly unlikely you will be able to complete the course relying solely on this website if you are a beginner to digital design.

My explanations are not comprehensive, you should always refer to a textbook for a full description of the topic. The information I give you is motivated by:

  1. Helping you get to those “aha!” moments
  2. Filling in the intuition gaps that textbooks miss
  3. Relevance to our course and what we’re trying to build

I’ve organized the topics and information on each page in accordance with the above guidelines. The goal is that each heading topic motivates the next, in a cascade effect.

A glossary of terms is included at the end of each session to help you in report writing and to avoid any use of imprecise language.

Session 1: Understanding chip design

Have you ever wondered what powers the millions of electronic devices around you? Which electrical component kickstarted the digital revolution? Starting with transistors, then gates, logic blocks and ending with entire chips, the semiconductor industry is an amazing feat of humanity. Chips, or Application Specific Integrated Circuits (ASICs), are hidden everywhere: in Huawei cell towers, in Samsung RAM memory controllers, and Apple’s M-series chips. They are so ubiquitous yet they are nearly forgotten and overlooked.

The basics

So, how’s it all done then? First, let’s get some terminology sorted. Chip design, is also known as IC design, IC being short for integrated circuit. You’ll see ASIC thrown around as well, which means the same thing, and expands to application-specific integrated circuit.

The key piece of intuition here is to grasp the layers of abstraction involved. Very roughly, the layers are:

  1. Electrons
  2. Transistors (pn junctions)
  3. Logic gates
  4. Flops and logic primitives
  5. Functional blocks, e.g. adders
  6. Processors/cores
  7. Systems-on-chip (SoCs)
  8. Firmware / bare metal code
  9. Operating Systems
  10. Userland applications
  11. Web applications

The second piece of intuition is that it’s all just circuits. We’re designing circuits with logic gates that carry out a particular logic function. This could be anything. For a CPU, it might be an instruction decoder that generates corresponding control signals for the rest of the pipeline, for a hardware accelerator like ours, it’s a robust matrix multiplier (matmul).

This brings us nicely to FPGAs. FPGAs are extremely useful in hardware design. They allow us to, among other things, execute these logic functions at a much higher speed than in simulation. While ASICs and FPGAs provide us with two different ways of executing these functions, underneath it all, they are still the same logic functions.

The waterfall model

Tools

Session 2: Digital logic primer

Now that we understand what we’re building and roughly how we’re going to do it, let’s dig into some of the fundamentals of digital design.

Binary arithmetic

The binary counting system, or base 2, is much like the decimal counting system we use everyday. As humans, we are no strangers to using a non-decimal counting system. We count our minutes and hours in the sexagesimal counting system, a notion which first originated with the ancient Sumerians. A counting system simply determines how we represent a number, and it is this binary representation that is particularly suitable for computer operations. That’s all there is to it!

The key ideas in binary arithmetic applicable to digital design are:

  1. There are two states: 1 and 0
  2. All numbers have an associated bit width. A 4-bit number holds 4 bits. The same is true in decimal but we often conveniently ignore this in basic arithmetic since this limitation is unimportant.
  3. When a number overflows (eg. the result of 1111 + 1), we roll back to 0. You can think of this as truncated addition, see below.
  4. 1 + 1 = 0 !
       2³  2²  2¹  2⁰
     ┌───┬───┬───┬───┐
       1   1   1   1    (15)
    +  0   0   0   1    (+1)
     ├───┼───┼───┼───┤
   1 │ 0 │ 0 │ 0 │ 0 │  (0, overflow!)
     └─┬─┴───┴───┴───┘
       │
       └─ Overflow bit (truncated)

Boolean functions

Boolean algebra is a close sister of binary counting, mainly because the 1 and 0 states map nicely onto true and false. True is commonly associated with 1 and false with 0. In boolean algebra, we define a few fundamental operations:

  1. AND: a.b
  2. OR: a+b
  3. XOR: a^b
  4. NOT: !a

There are a few more, but these are just negations of the above, like NAND !(a.b). I’ve been a bit careless with my syntax above, since I’ve tried to stick to the mathematical notation used for those operations. In software and HDLs, we use a slightly different mixture of symbols to represent the above.

For each operation, we can define a truth table. These enumerate the outputs for all the possible inputs. You’ll notice all operations save for NOT have 2 operands (inputs). Here’s the truth table for AND as an example:

A | B | A AND B
--|---|--------
0 | 0 |   0
0 | 1 |   0
1 | 0 |   0
1 | 1 |   1

Tip

Try and write out the truth tables for all boolean operations

Signed numbers

As computers started to proliferate, we needed a way to represent negative numbers. Remember, all we’ve got is 1s and 0s, you can’t just “add a -” to the start of the number. Eventually, we standardized on a method called two’s complement. There are three ideas we need to apply 2’s complement in our circuits:

  1. The most significant bit (ie. the leftmost bit) is the sign bit.
  2. If the sign bit is set, we subtract the number associated with that power of two from the remaining bits.
  3. To negate a number, we invert all the bits and add 1.

Let’s take an example:

  2³  2²  2¹  2⁰
┌───┬───┬───┬───┐
│ 1 │ 0 │ 0 │ 1 │
└───┴───┴───┴───┘

Interpreting this as an unsigned binary number, we’ve got 9 (8 + 1). If we interpret as signed, this becomes -8 + 1 = -7.

For negation, we have:

 2³  2²  2¹  2⁰
 0   1   1   1   Original: 7
 1   0   0   0   Inverted
 1   0   0   1   +1 = -7

What is digital logic?

Digital logic is an abstraction built upon fundamental units called logic gates. These gates are themselves made up of switching transistors arranged in special patterns with resistors to achieve the respective binary functions. There is no one strictly correct implementation of a gate, the “best” one will vary according to the specific power, area and performance requirements of the design. Here, we have a generic NOR and AND gate implementation:

The notion of digital is itself part of this transistor-level abstraction. Digital implies binary: 0 or 1, HIGH or LOW, 0V or 5V. However, MOSFETs or BJTs1 can take any allowed voltage within their rated values, which is inherently analog. So, to recap: gates are digital logic components themselves made of “analog” transistors. Of course, the transistors inside gates are chosen for properties that make them particularly suitable for gates, like their voltage transfer characteristics.

Digital logic won over analog logic for various reasons that mostly stem from the fact that having only two allowed states makes circuits easier to design and understand, and therefore lets us turn up the level of complexity of our circuits.

All logic gates implement one boolean function. These are circuits, after all, so it’s useful to have symbols for them:

You might imagine a simple logic function like (A AND B) OR NOT C can be represented as such:

Tip

Feel free to mess about with the circuit above!

Combinational logic

Combinational logic is formed from boolean functions whose outputs are fully determined by their current inputs. You can achieve quite complex operations using just combinational logic and clever tricks to manipulate bits. These operations execute in zero time, ie. in under 1 clock cycle.

At the lowest level, combinational circuits are built from simple primitives such as NAND gates (in ASIC designs) or LUTs (in FPGA designs). We’ll get to these later; for now, you just need an awareness that how your logic functions are implemented will vary depending on the platform (ASIC or FPGA).

Note

Logic here means the collection of boolean functions that our circuit will implement. It can also refer to the gates, to the different digital components that you might build out of gates, to the HDL code you write and so on. It’s a general term for the thing that your circuit will do.

Sequential logic

So far, we’ve seen that we can evaluate arbitrary boolean functions using logic gates. However, any meaningful and useful computing system needs a bit more, in particular it needs to store previous values so it can use them in future computations.

The notion of ‘previous’ and ‘current’ require a concept of time. This is called synchronous logic. We say that our circuit is synchronous to a clock, ie. it responds to the clock, most commonly a rising edge (positive edge). A clock signal is a simple square wave with a 50% duty cycle. It is on for 50% of the period and off for the other half.

Tying this all together, we arrive at sequential logic circuits, where the outputs are a function of their current and previous inputs. Flip-flops, also called registers, are used to accomplish this. The simplest flop is a D-flip flop, as shown.

The component above has 3 inputs: clock, d and reset. It has one output q. A D-flip flop holds the value of the previous input for one clock cycle. D-flip flops are so common that we simply refer to them as ‘flops’.

Note

Sometimes, you get an additional output which is the inverted version of Q, since it falls out naturally out of the transistor implementation of a D-flop.

Waveforms

One way to visualize sequential logic circuits is with waveform diagrams, as below.

On the 2nd positive edge of the clock, the input to the flop goes high. The output, however, remains unchanged. One clock cycle later, the flop’s output matches its previous input. Aha, we’ve got 1 cycle’s worth of “memory”!.

This diagram exposes a key understanding of synchronous logic: launching and sampling/capturing. We say that d is launched on the 2nd clock edge but only sampled by the flop on the 3rd positive edge. This diagram makes that a bit clearer by drawing the 0->1 transition of d slightly after the positive edge. In reality, things are more complicated.

Glossary

Digital
Describes a system in which the values are constrained to 1 or 0
Analog
Describes a system in which values can span a continuous range
Logic gate
An electronic component that implements a logic function
(Boolean) logic
A system of reasoning with two values, true and false, that uses operations like AND, OR to combine manipulate statements
Binary
Two. A counting system in which only two values are allowed: 1 and 0.
2’s complement:
A way of expressing negative numbers in the binary counting system

  1. Metal Oxide Field Effect Transistor, Bipolar Junction Transistor

Session 3: Overview

Session 4: Overview

Session 5: FPGAs

FPGAs consists of a sea of look-up tables, each representing some logic function like ‘x OR y AND z’ and some programmable routing. Using a hardware description language (HDL) like SystemVerilog we can describe the behavior we want our FPGA to execute and the synthesis tool will convert this into thousands of logic functions, which we can then program onto the FPGA.

Session 6: Neural Networks

This course is interesting because it brings together two incredibly complex and rapidly-advancing topics: digital design and machine learning. Now that we’ve wrapped our heads around a few core concepts in the former space, we’re ready to see how to apply our skills to a novel scenario.

Let’s first remind ourselves of our goal: building an inference accelerator on an FPGA. But what exactly are we going to accelerate?

Deep learning

Deep learning is subcategory in the domain of machine learning, which is itself a subcategory of the field of artificial intelligence. It gains its name from the fact that information travels several layers deep through an arrangement of so-called neurons. It is simply one of many approaches to AI but it is also one that has received tremendous attention in the past decade, in part due to its excellent accuracy.


*Figure taken from https://www.deeplearningbook.org/contents/intro.html

Network topologies

The quintessential example of a deep learning model is the feedforward deep network, or multilayer perceptron (MLP). This is also called a fully connected network.


*Figure taken from http://neuralnetworksanddeeplearning.com

A few observations:

  • Information travels from left to right
  • All neurons in one layer are connected to all neurons in the next
  • There is one input and one output layer as well several intermediate “hidden” layers

Another kind of neural network is a convolutional network. We’re going to be building one of these. They’re known for their high accuracy with image inputs due to their awareness of spatial features.


*Figure taken from http://neuralnetworksanddeeplearning.com

The basic structure is as follows:

  1. Information resident in the 28x28 grid gets gradually reduced to the last output layer with 10 neurons.
  2. The middle layer is called the convolutional layer and is formed of several feature maps.
  3. Each feature map is then turned into a unit in the pooling layer.
  4. Finally, all neurons in the pooling layer are connected to all 10 neurons in the output layer, reminiscent of our fully-connected network.

The ‘convolution’ in the name comes from the fact that the initial operation applied to the image to obtain the feature maps is known as a convolution.

Note

We could, of course, train an MLP and use it for inference. For our relatively well-scoped problem of digit recognition on a 28x28 grid of greyscale pixels, both approaches would work well. That said, MLPs are considered ‘old technology’, having been around for decades before CNNs and are rarely used for image recognition. CNNs themselves were introduced in a seminal paper by LeCun and others in 1998!

Training and inference

Machine learning is a two-step process.

Tip

I’ve found that neural networks can be somewhat of a misnomer. You can draw an equivalence between human neurons firing and being interconnected in a complex web but that’s pretty much where the analogy stops. I’d therefore encourage you to think more in terms of ‘nodes’ or ‘units’ and as the whole system being just a function.

Session 7: Floating point

You may have heard the expression “we’re only as good as our weakest link”. It turns out we can draw a direct parallel between this age-old adage and computer architecture, more precisely that the slowest/least efficient part of a system places an upper bound on its performance. It thus follows that if we’re going to optimize something, we ought to optimize the slowest thing.

Motivation

One of these “slow things” in ASICs is floating point multiplication. This is an interesting example of a type of computation that’s relatively easy to calculate for us humans but quite involved for computers. Most of this is due to limitations imposed by a need to agree on a common standard for representing decimals in binary (IEEE-754). What we gain in standardization we potentially lost in optimization.

Floating point multipliers are logic-heavy, requiring a very large number of adders, and slow, taking potentially multiple cycles to complete one operation. They’re such a common point of speedup that practically every CPU has a floating point coprocessor and the gains to be had are significant! Since design of such a multiplier is potentially a months-long endeavor, we’re going to sidestep this and instead perform all our multiplications with fixed-point numbers.

Primer on floating point

Floating point is a standard way of representing numbers with decimal points in binary. An international standard was defined and can be found as IEEE-754 on the internet.

The key idea behind floating point is that any decimal number can be expressed as n number of significant digits multiplied by ten to some power m. In a fixed 32 or 16 bit binary number, n and m are in conflict and are inversely related. Raising m allows us to represent bigger numbers while a bigger n gives us greater precision. IEEE-754 defines the following bit arrangement:

Single Precision (32-bit):
┌─┬──────────┬───────────────────────────────────────────────┐
│S│ Exponent │                  Mantissa                     │
│ │  (8 bit) │                  (23 bit)                     │
└─┴──────────┴───────────────────────────────────────────────┘
 1     8                        23
 bit   bits                     bits

\( \int x dx = \frac{x^2}{2} + C \). Programmers first encounter floating point numbers when they either a) quite literally see the word “float” in their C programs or b) try and spend hours debugging why 1.5 + 1.5 != 3.0.

Fixed point

Train NN using TensorFlow/PyTorch Train Converge Accuracy > 95%

Load data, define model, train loop, evaluate. You’ll also learn practical things like batch sizes, learning rates, and overfitting

Stanford CS231N

Pipelining convolutions

Memory management

Convolution

Session 8: Architecture and Datapath

In Session 6 we saw how a convolutional network is structured and the kinds of computation involved during training and inference. In this session, we’re going to breaking down the exact data operations required to achieve one successful round of inference. This will allow us write our specification for what we’re going to build.

Session 9: Overview

Session 10: Overview

Session 11: Overview

Session 12: Overview

Extensions

There are a few extensions available for this project:

Creating a digital twin in Python; a computer simulated version of your accelerator. This will allow you to quickly prototype different architecture designs. Improving speed and logic resource utilization, are there any optimizations you can make to your design? Using HLS on a more advanced FPGA. Higher Level Synthesis allow you to program your FPGA in special higher level language, which can speed up design cycles at the cost of losing control over how the design is synthesized.