As alluded to in a few of my other posts, I’m working on developing an open-source FPGA-accelerated vision platform. This post is a detailed overview of the project’s architecture and general development methodology. Future (and past) posts will elaborate on specific pieces of the project, as they’re implemented.
Stereo-vision is the main objective for the project – but once the general framework is in place, an obvious next-step would be the offloading of additional vision algorithms onto an FPGA.
(Update: 2011-06-10: I’ve now released the first version of my Open-source FPGA Stereo Vision core!)
This post is rather long, even by my standards. Here’s an index:
- FPGA Logic
Above is an approximate diagram of how the initial implementation will be organized. Key hardware components include:
- Multiple images sensors (provided by my MT9V032 LVDS camera boards)
- FPGA with PCIe interface and external high-speed memory (provided by a Xilinx SP605 Spartan-6 FPGA development board)
- Interface board to connect the image sensors to the FPGA board (provided by my FMC-LPC to SATA adapter board).
- Host computer (a generic PCIe-capable x86/AMD64 PC running Linux)
Eventually, much of that hardware will be combined onto a single PCB. As a precursor to that, I designed my Spartan-6 BGA test board as a means by which to characterize some of the more critical board design issues.
I’ve talked about my reasoning behind selecting the MT9V032 image sensors before, as I’ve also talked about the merits of Spartan-6 FPGAs. So, in the interest of keeping this post somewhat concise, I’ll refrain from repeating that content here.
PCIe is the bus of choice for connecting the FPGA to the host computer. This is largely made possible by the integrated PCIe endpoint present in all Spartan-6 LXT devices.
PCIe offers a high-bandwidth and low-latency link, with very low overhead (on both the FPGA and the PC side). On the FPGA, since the PCIe controller is largely a hard-IP block, it consumes few logic resources (though some non-trivial amount of logic is still required to bridge the controller’s streaming interface to a standard on-chip DMA bus). All bandwidth-intensive operations can be handled by having the FPGA DMA to/from the host’s memory, freeing the host CPU for other tasks. With a zero-copy driver, CPU overhead would be practically zero.
While the project is being developed on a very specific hardware platform, the ultimate goal is to create something that is portable across many different hardware systems. As such, the actual stereo processing blocks will be implemented so as to not be dependent on any platform-specific hardware (e.g. the LVDS cameras, the PCIe bus, or the DRAM controller).
ROS – the Robot Operating System – is just such a system. Originally developed in Stanford’s AI Lab and by Willow Garage, ROS is an open-source platform that’s gained rapid acceptance amongst all levels of robot developers and researchers. It provides a growing library of inter-operable packages that foster sharing and reuse (rather than reinvention) of robot control software. It’s a natural choice for much of my planned future robotics development. Thus, ROS compatibility is a goal of this project.
The software to achieve that comprises two major pieces: a Linux device driver to provide a low-level interface to the PCIe endpoint in the FPGA, and a user-space application that implements a ROS node.
The ROS node will export a generic set of topics and services that allow other nodes to control and consume vision data. This generic interface would be designed so as to abstract away any hardware-specific details. Thus, if one were (for example) to transition from a PCIe based implementation to an Ethernet based implementation, the change would be entirely transparent to other nodes. The existing stereo_image_proc ROS node already defines much of the required interfaces, so compatibility with it is a major goal.
The Linux device driver is necessarily hardware-dependent – and in some cases may not even be required (e.g. if using an Ethernet interface). It’s still desirable, however, to present a generic interface to any user-space applications (like the ROS node) – especially in instances where a user may want to directly access the vision hardware without going through ROS. Time permitting, I’d like to make the driver export a standard video interface (e.g. Video4Linux), so that existing applications can make use of the hardware without modification. This isn’t, however, a priority for me.
Each one of the blocks in this diagram could easily warrant a whole ten-thousand character post. This post is intended as an overview, however, so I’ll try to restrain myself to a couple of paragraphs each:
Mapping all control logic to a bunch of Verilog state-machines isn’t always a great idea (from a maintenance and resource usage perspective). A lightweight soft-core CPU is provided to implement non-compute-intensive tasks. The exact CPU architecture isn’t critical, so long as it supports some form of higher-level language – the ZPU is a probable choice, as it is one of the simplest and smallest 32-bit CPU architectures to have GCC support.
If using a host interface other than PCIe – especially one that isn’t DMA-oriented and has significant protocol overhead (e.g. Ethernet with UDP, or even more-so with TCP), the CPU would be responsible for implementing the protocol (though actual packet data payload transfer would still be performed by a dedicated DMA block in the FPGA).
Most FPGAs lack enough built-in SRAM blocks to hold a whole VGA image frame, much less multiple frames (the relatively large XC6SLX45T on the SP605 has on the order of 232KB of block RAMs; the more modest XC6SLX16 has a mere 64KB). Buffering a whole 768×512 pixel frame (at 1 byte-per-pixel) for two cameras would take 768KB. You’d need more for the rectification look-up-tables and other system functions. Buffering only sixteen 768 pixel rows for two cameras would require about 24KB of RAM.
It should theoretically be possible to perform basic stereo-vision in an entirely streaming fashion, provided that the camera distortions are sufficiently well-controlled that the image-rectification step doesn’t have to examine more than a handful of input rows for each output row. This requirement should hold for well-aligned cameras/lenses in the normal-to-telephoto range, but would tend to break down for misaligned or wide-angle lenses. Any sort of more complex global/multi-pass algorithm would likely require complete frames. A high-bandwidth external memory part is therefore a reasonable requirement.
Xilinx provides their Memory Interface Generator (MIG) tool for creating DRAM interface logic. On their Spartan-6 FPGAs, this instantiates a hard-IP block within the FPGA. A small wrapper/gasket is needed to convert the native MIG interface to whatever bus fabric the rest of the design is using (if using an AXI bus and a Spartan-6 or Virtex-6 device, the MIG can create an AXI wrapper for you).
Point-to-point streaming interfaces can be used for the bulk of the high-bandwidth communication between image processing blocks – but there’s still a need for a general-purpose register bus for configuration, and a higher-performance DMA bus for accessing external memory and PCIe.
The Wishbone bus is an obvious choice for the register bus, as many open-source logic cores already employ it (e.g. many of the cores on OpenCores). There’s even a decent Wishbone slave generator tool available for quickly creating well-documented register spaces.
The DMA bus is much trickier. Due to the presence of high-latency burst-oriented bus slaves, like PCIe (and, to a lesser-extent, DRAM), you really need to have a bus fabric that can represent multiple outstanding multi-word transactions – otherwise, performance would be substantially impacted. ARM’s AXI bus is basically the bus for these purposes in the ASIC industry. It hasn’t, however, seen much acceptance amongst open-source hardware developers.
Wishbone was, for a while, completely unsuitable for this task. The recently introduced Wishbone B4 specification, however, adds a pipelined variant of the bus. It’s still lacking relative to AXI (e.g. it still has no provision for indicating how long a burst is ahead of time – though you can now partially infer it by accepting multiple pipelined commands), but it’s at least “good enough” for most purposes.
Since both Wishbone and AXI buses could be of interest to others, I’ll be designing my cores such that they can be easily bridged to either bus.
The integrated PCIe endpoint in Spartan-6 LXT devices offloads a lot of work to a hard-IP block – but it mostly only handles lower-level protocol functions. User-logic is still required to parse and generate PCIe-compliant TLP packets. This is a relatively trivial task for slow single-word-at-a-time PIO transactions (as Xilinx’s bundled example application implements), but it’s rather more complicated for large DMA transactions.
Some work can be saved by relying on a restricted programming model – that is, only implementing a subset of the full PCIe spec, and mandating that any driver for the device abide by those restrictions (e.g. mandating that all DMA transactions be 128 bytes in length and aligned to a 128 byte boundary). This wouldn’t be hard to require, as the driver is being developed specifically for this device. Still, my goal for the PCIe controller (which is really a whole separate project in-and-of-itself) is to create a general-purpose interface core, so arbitrarily restricting use-models is not my preference.
The PCIe controller will be designed so that it can eventually interface with both Wishbone and AXI DMA buses. Both master and slave ports will be present – allowing for the host PC to access the FPGA’s memory, and allowing the FPGA to access memory in the host. In most high-bandwidth use-cases, the FPGA will be doing the bulk of the DMA transactions, so that path will receive the most scrutiny.
MT9V032 LVDS camera interface
This block allows for actually receiving image data from multiple MT9V032 LVDS camera boards (seen above). The MT9V032 image sensors output a simple 320 Mbps serialized 10 bit-per-pixel video stream with embedded clock and horizontal/vertical sync codes.
The Spartan-6′s integrated SerDes blocks are used to efficiently deserialize this stream. Since the SerDes blocks are unable to perform embedded-clock recovery, they instead sample the incoming data using an internal clock generated by one of the Spartan-6′s PLLs. In order for this to work, the MT9V032 sensors have to be generating their LVDS data off of the same base clock – thus, the FPGA supplies the same ~27 MHz reference clock to each sensor (and also uses this reference clock to generate the internal SerDes sampling clock).
While the sensors and the FPGA are using the same frequency clock for data transmission/reception, the phase relationship isn’t guaranteed. Thus, to meet the setup/hold timing requirements on the Spartan-6′s deserializer blocks, extra steps are needed to align the data and clock. The Spartan-6′s IDELAY (input delay) blocks can be used for this task (to a certain extent). They include a phase-detector mode that allows for mostly-automatic data/clock alignment (it examines edges on the incoming data, and tries to position them half of a clock cycle away from clock edges).
The cheaper IDELAY blocks used in the Spartan-6 have limitations, however: they can’t be used to delay a signal by more than 1 clock cycle (and, in practice, you shouldn’t go much beyond half of a cycle). Since this can’t be guaranteed, yet another measure is needed. By using a serializer block running off of a 3x (960 MHz) clock to generate the 26.67 MHz reference clock to each image sensor, we can then apply a coarse phase-shift to the reference clock. The resolution of that phase-shift is equal to the bit-time of the serializer: 1/960 Mhz – or 1/3rd of a 320 Mbps bit-time. Now that the data/clock relationship can be controlled to within ~1/3rd of a bit-time, the IDELAY elements can then be used for fine-tuning the rest of the way.
None of this would be practical without the new SerDes, IDELAY, and PLL blocks introduced in the Spartan-6 (or already found in higher-end Vertex devices). A more typical system with parallel-interfaced image sensors would, of course, not require any of this.
Image rectification is a crucial first-step in many image processing tasks – and especially in stereo-vision. Most stereo-vision algorithms depend on the input images conforming to a simplified epipolar geometry with coplanar image planes. This allows for the assumption that a given point in one image can be found in the same row of the other image (provided that the point isn’t occluded, of course) – thus dramatically reducing the search space.
You can’t easily build a system with distortion-free cameras/lenses and perfect alignment, so the image rectification step is required to take real-world image data and turn it into something resembling the ideal case. A calibration process is run on the un-corrected stereo image data to determine what sort of transformation the rectification step has to perform.
The most straight-forward way for an FPGA to implement this rectification step is by using look-up-tables: for each rectified output pixel, you have a table entry that indicates the source pixel. A naive implementation that allows each table entry to reference anywhere in the entire source image would be very memory-intensive; sub-pixel resolution would only compound matters.
A better implementation might, for example, encode coordinate differences between adjacent pixels (under the perfectly-reasonable assumption that the source coordinate for a particular output pixel will be very similar to the source coordinate for its neighbor). An 8-bit value could encode both a 4-bit X and Y difference, which could themselves be fixed-point fractions (e.g. with a range of +3.75 to -4.00).
Alternatively (or possibly in conjunction with), one could use a lower resolution look-up-table with simple linear interpolation between entries. The size of that lower resolution table would depend on the severity of the distortion being corrected, and the amount of error that is tolerable (relative to an ideal full-resolution table).
With coordinates in hand, the FPGA can then use a simple sampling algorithm (e.g. bilinear interpolation) to generate the output pixels. For all but nearest-neighbor interpolation, the sampling algorithm would need to read multiple source pixels for each output pixel, so a cache would be needed to reduce external memory accesses.
For well-controlled distortion, it would be possible to perform rectification in a streaming fashion without any reliance on external memory. Each image sensor would write directly into an internal memory that is large enough to hold several rows worth of image data. Then, each output row could be generated entirely from this input buffer. Within each output row, the range of source Y-coordinates would have to span less than the height of the input buffer.
If using a color image sensor, the de-Bayer/de-mosaicing step could also be rolled into the sampling algorithm.
Finally: the module for which the rest of the project has been created. It is intended to implement something compatible with OpenCV’s block-matching algorithm, which is invoked by the aforementioned stereo_image_proc ROS node.
The underlying algorithm is quite simple: for each pixel in the left image, find the closest match (along a horizontal line) in the right image. The distance you had to shift the left image to match the right image is the disparity. A larger disparity indicates that the pixel is physically closer to the camera (the relationship between disparity and distance is not linear; there is considerably more depth-resolution near the camera than far away).
Comparing individual pixels isn’t very effective, so larger windows are used (OpenCV efficiently supports 5×5 to 21×21 windows). To quickly compute how closely two windows match, the sum-of-absolute-differences (SAD) algorithm is used. SAD maps very well to an FPGA, since it can be implemented with heavily-pipelined adder trees.
Here’s an (un-implemented, un-verified) example 5×5 SAD pipeline:
The yellow blocks are RAMs, while the purple ones are pipeline registers. The rest is logic. The diagram omits most control signals (it’s data-path only). Each level of logic is kept very shallow to allow it to run as fast as possible (that is, at the same speed as the block RAMs in the FPGA – e.g. 280 MHz in a Spartan-6 speed-grade-3 part). Scaling it to larger window sizes merely requires enlarging the adder tree, extending the SAD delay-line, and adding extra RAMs to buffer the extra rows of image data.
The pipeline accepts a new column of pixels from each image every cycle. After initially filling (~10 cycles), it produces a new SAD value every cycle (by adding the SAD value for the newest column and subtracting the SAD value of the old column that just fell outside of the window). The pipeline is designed to sequentially calculate the SAD value for every pixel in a row at a particular disparity (changing the disparity requires flushing the pipeline). Each row has to be run through the pipeline multiple times – one time for each disparity of interest (e.g., if you’re searching up to a maximum disparity of 64 pixels, without skipping any intermediate values, you’d have to run each row through 64 times).
The ‘disparity’ RAM stores the current best-match for each pixel across subsequent runs of the pipeline. On the final pipeline run (for a given row), the value written to the disparity RAM is the best match. This is the value that would be sent down the stereo processing pipeline (to the post-processing stage).
Performance is a concern; lets run some example numbers: Assuming a ~30 MHz pixel clock (MT9V032 at 60 FPS) and a disparity-search-space of 64 pixels, we would need to run the above pipeline at ~1920 MHz. That’s not going to happen any time soon. If we run the pipeline at an actually-achievable speed of 240 MHz, we need to find an 8x speed-up somewhere else.
If the logic were fixed, we would be limited to reducing the search-space or the pixel clock:
- Examine a lot fewer disparity levels: Only 8 levels instead of 64? That’s not an option.
- Lower the frame-rate: 7.5 FPS? No thanks.
- Lower the resolution: 1/8th VGA? Ugh.
- Some combination of the above: more realistic, but still not great.
But the logic isn’t fixed, and FPGAs are at their best when running algorithms in parallel. So, we could:
- Process more than 1 pixel per clock-cycle (in a particular row): not my preferred option, as it requires significant modification to the pipeline.
- Process multiples adjacent rows simultaneously: this is a good option, provided you have enough block RAMs to buffer additional rows of input and disparity data. Large portions of the adder-tree could be shared across rows.
- Process multiple adjacent disparity levels per pass/cycle: a good compromise; it requires no additional RAM, but does duplicate the entire adder-tree (and requires extra pipelined comparators).
None of these options require a lot of extra logic. I’d be inclined to use a combination of the latter two: e.g., process 2 rows simultaneously, and evaluate 4 disparity levels per pass. That would easily achieve the desired performance target.
The trade-offs are more complicated if you have very limited block RAM (e.g. not even enough RAM to buffer multiple rows of image data). That’s not likely to occur with a VGA image sensor – but a 5 megapixel sensor (with rows over 2500 pixels wide) could easily exceed the block RAM capacity of a smaller FPGA (buffering 16 3KB rows across 2 cameras would require 96 KB). In that case, you might want to improve the locality of memory accesses (since we now have to access relatively low-bandwidth external memory); processing the image in blocks, rather than just rows, could help with this.
The goal for this project is to create synthesizable Verilog for this module that is parameterized to support arbitrary window sizes and combinations of row-parallel/disparity-parallel processing. These parameters would be set at synthesis time. It would be theoretically possible to provide run-time options for reducing window size (below what the hardware is configured for), but it’s unclear that there would be much benefit for the additional logical complexity and development effort. If this were targeting an ASIC, rather than an FPGA, it would be a Good Idea.
To achieve the highest level of performance, device-specific implementations could be made (e.g. using adder trees comprised of 3-input adders to take full advantage of the Spartan-6′s 6-input LUTs).
In the future, after implementing this basic block-matching algorithm, I plan on investigating FPGA targeted implementations of more sophisticated algorithms. Some dynamic programming methods, for example, perform a per-scanline global optimization that can lead to much better results than simple block-matching (see the paper “A Hierarchical Symmetric Stereo Algorithm using Dynamic Programming” for a good description of one such algorithm). The basic building blocks of these algorithms, when suitably optimized, are simple enough to lend themselves to an FPGA implementation (see “Real Time Rectification for Stereo Correspondence” for a paper about one such FPGA implementation).
The raw disparity data from the stereo-correspondence block isn’t always entirely suitable for direct consumption. The basic block-matching algorithm will try to assign disparities to all pixels, even if the match is poor or non-existent. Post-processing steps can be applied to filter out low-quality disparity data, and possibly even to fill in missing data through interpolation.
Depending on the nature of the post-processing, it may be necessary to add additional features to the stereo-correspondence block. For example: recording additional statistical data on the range and distribution of computed SAD values in order to assess the confidence of a particular match.
Verification will be a huge task for this project, as one can imagine – large enough to warrant its own overview post. I’ll just touch on a few of the high-level bits here, and save the rest for later.
There exist a surprising number of open-source tools for verifying Verilog FPGA and ASIC designs. Some especially noteworthy ones are:
- Verilator – a Verilog compiler/synthesizer that converts synthesizable Verilog code into cycle-accurate C or SystemC models.
- Icarus Verilog – a fantastic fully-featured Verilog simulator, suitable for verifying constructs that Verilator is unable to tackle (asynchronous, intra-cycle behavior). It is, however, quite a bit slower than Verilator.
- SystemC – a templated C++ class/macro library that grafts HDL capabilities onto C++. It’s not good for hardware description tasks (leave that to Verilog), but it’s great for higher level verification and system modeling (especially in conjunction with the SystemC transaction-level modeling (TLM) library).
My preferred approach to hardware development emphasizes simple, well-defined interfaces between all modules. Unit testing is then used to thoroughly validate individual modules as they are created.
All of the above tools will play a role in this. Icarus works well for verifying the lowest level components using simple Verilog test-benches. Verilator and SystemC are well-suited to verifying higher level components, especially when existing C reference models are available (e.g. OpenCV’s stereo correspondence implementation).
Modeling and verification of the complete system is enabled by SystemC and the TLM. The TLM provides a standard method for creating abstract interfaces between components, allowing individual components in a system to be represented with models at varying levels of abstraction. This enables the functionality of the complete system to be evaluated even before all of the Verilog has been written. At the lowest level of abstraction, you represent the system completely with synthesizable Verilog blocks – allowing the proposed hardware implementation to be verified.
This is a big project, and it’s going to take quite a while to complete (especially seeing as I’m only working on it in my spare time). Since the project is inherently modular, I expect to be able to release pieces of it as development progresses.
Needless to say, as the project progresses, you can look forward to many future posts about its individual components. I wouldn’t, however, count on any of those posts being especially concise.
(Update: 2011-06-10: I’ve now released the first version of my Open-source FPGA Stereo Vision core!)