(have a look at my previous FPGA Stereo Vision Project post for some more context)
It’s written purely in synthesizable Verilog, and uses device-agnostic inference for all FPGA primitives (though the current implementation is more optimized for Xilinx devices). I’m releasing it under a standard 3-clause BSD license.
The design is heavily pipelined. Under realistic conditions (in a highly-utilized, slowest-speed-grade part without any floor-planning), it can run around 150 MHz in Spartan-3E and Spartan-6 parts, and around 300 MHz in Virtex-6 parts. Much higher speeds (50+%) are possible under unrealistic (ideal) conditions.
The design is fully parameterized and highly scalable; some example implementations include:
- 320×240 @ 120 FPS in Spartan-3E 250
- 640×480 @ 30 FPS in Spartan-3E 500
- 800×480 @ 60 FPS in Spartan-6 LX25
- 800×480 @ 60 FPS in Cyclone IV EP4CE22
- 1920×1080 @ 30 FPS in Spartan-6 LX75
The core has been verified in simulation using Verilator with SystemC testbenches. Post-synthesis results (from Xilinx’s XST tool) have been verified using a simplified Verilog testbench and Xilinx’s own ISim simulator.
Now, before everyone runs off and tries to build their own open-source Kinect, I must stress that this isn’t a complete solution just yet; here’s a block diagram of what I have implemented:
If we now refer back to the high-level block diagram that I presented before:
..we can see that this core implements all of the “Stereo Correspondence” block, some/all of the “Post-Processing” block and (had I actually included it on the original diagram) the “Pre-Filtering” block. While “Image Rectification” is the only significant missing image pipeline component, there’s still a lot of other system level infrastructure to develop (external interfaces, buses, etc.) before I can call the project “complete.”
That being said, the correspondence core easily represents the most critical, most resource-intensive and highest-performance component of the entire system. Completing it is a major milestone in the project.
- OpenCV compatibility
- Using the Stereo Correspondence Core
- Synthesis and Place & Route
- Inside the Core
- Future development
The correspondence core implements a significant subset of the functionality provided by OpenCV’s block-matching stereo correspondence algorithm (
findStereoCorrespondenceBM). When using the supported subset, the core produces 100% identical results to that of the OpenCV algorithm (specifically: the non-SSE implementation present in OpenCV 2.2.0; there are actually subtle differences between the unoptimized vs. SSE versions).
Looking at the parameter/state structure for OpenCV’s algorithm (
CvStereoBMState), we can enumerate how my correspondence core compares:
CV_STEREO_BM_XSOBELis fully supported.
CV_STEREO_BM_NORMALIZED_RESPONSEis not supported.
preFilterSize: not applicable for
preFilterCap: fully supported.
- Stereo Correspondence:
SADWindowSize: fully supported.
minDisparity: not currently supported (fixed at 0).
numberOfDisparities: fully supported.
textureThreshold: fully supported.
uniquenessRatio: fully supported.
speckleRange: not currently supported.
Due to performance related implementation details, some combinations of parameters (which would otherwise be valid) are prohibited. The usage section describes the core’s parameters, and all applicable restrictions on them.
Likewise, the core supports additional tuning parameters that allow for creating non-OpenCV-compatible implementations with a smaller logic footprint.
Here are two images in the ‘Cones’ data set from the Middlebury Stereo Vision Page:
And this is the ground-truth disparity image:
We convert the two inputs images to gray-scale, and then run them through OpenCV’s
findStereoCorrespondenceBM function with state parameters of:
state->preFilterType = CV_STEREO_BM_XSOBEL; state->preFilterCap = 7; state->SADWindowSize = 17; state->minDisparity = 0; // not supported by dlsc_stereobm state->numberOfDisparities = 96; state->textureThreshold = 1300; state->uniquenessRatio = 25; state->speckleRange = 0; // not supported by dlsc_stereobm state->speckleWindowSize = 0; // ""
This filters the input images to something like this:
(just showing the left; the right is similar)
Subsequently, it creates this unfiltered disparity map:
And finally yields the following filtered disparity map:
(normally you would only care about the final result; I’ve included the intermediate steps so you can see what the image looks like at various stages in the pipeline)
If we also run the same input images through my correspondence core (
dlsc_stereobm_prefiltered, with equivalent configuration parameters), we get:
No need to break out a diff program: they’re completely identical.
(in fact, all of the intermediate steps presented above were actually produced by my reference model, while the OpenCV result was truly from executing the code present in OpenCV 2.2.0 (with SSE disabled))
Using the Stereo Correspondence Core
I’m initially releasing three versions of my correspondence core:
dlsc_stereobm_core– the raw stereo correspondence core; doesn’t include any additional buffering, nor does it include pre-filtering.
dlsc_stereobm_buffered– a wrapper around the core which includes extra input and output buffering (which also has the benefit of providing an asynchronous clock domain crossing, so the rest of the system can run at a slower clock).
dlsc_stereobm_prefiltered– a thin wrapper that includes the buffered wrapper and a pre-filtering core (
I’ll primarily be talking about usage of
dlsc_stereobm_prefiltered, since it’s the easiest to integrate and has the greatest support for OpenCV features.
Each wrapper levels adds additional logic and non-trivial amounts of RAM, so it’s still possible that you may want to use a lower-level core in order to save on these resources. Some of the usage differences between the various versions are discussed later.
Here’s an example instantiation of the pre-filtered wrapper (this exactly matches the example OpenCV configuration seen above):
dlsc_stereobm_prefiltered #( .DATA ( 8 ), // input image bits-per-pixel .DATAF ( 4 ), // filtered image bits-per-pixel .DATAF_MAX ( 14 ), // 2*state->preFilterCap .IMG_WIDTH ( 640 ), // image width .IMG_HEIGHT ( 534 ), // image height .DISP_BITS ( 7 ), // output disparity image is // (DISP_BITS+SUB_BITS) bits-per-pixel .DISPARITIES ( 96 ), // state->numberOfDisparities .SAD_WINDOW ( 17 ), // state->SADWindowSize .TEXTURE ( 1300 ), // state->textureThreshold .SUB_BITS ( 4 ), // must be 4 for OpenCV compatibility .SUB_BITS_EXTRA ( 4 ), // must be 4 for OpenCV compatibility .UNIQUE_MUL ( 1 ), // state->uniquenessRatio == .UNIQUE_DIV ( 4 ), // (UNIQUE_MUL*100)/UNIQUE_DIV .OUT_LEFT ( 1 ), // enable out_left (filtered image output) .OUT_RIGHT ( 1 ), // enable out_right // remaining parameters are only throughput/timing related; not functional .MULT_D ( 8 ), // process MULT_D disparities per pass .MULT_R ( 2 ), // process MULT_R rows at a time .PIPELINE_BRAM_RD ( 1 ), // synthesis performance tuning .PIPELINE_BRAM_WR ( 0 ), // "" .PIPELINE_FANOUT ( 0 ), // "" .PIPELINE_LUT4 ( 0 ) // "" ) dlsc_stereobm_prefiltered_inst ( .core_clk ( core_clk ), // high-speed async clock for core .clk ( clk ), // clock for interfaces .rst ( rst ), // synchronous reset .in_ready ( in_ready ), // ready/valid handshake for input .in_valid ( in_valid ), // "" .in_left ( in_left ), // left image input .in_right ( in_right ), // right image input .out_ready ( out_ready ), // ready/valid handshake for output .out_valid ( out_valid ), // "" .out_disp ( out_disp ), // disparity image output .out_masked ( out_masked ), // disparity outside of valid area .out_filtered ( out_filtered ), // disparity filtered by post-proc .out_left ( out_left ), // filtered left image output .out_right ( out_right ) // filtered right image output );
All of the core’s configuration parameters are set when the core is instantiated via standard Verilog parameters (not a single `define required!). No provision is provided for run-time adjustment of configuration (this core is designed for reconfigurable FPGAs, not general-purpose ASICs).
There are two major sets of configuration parameters. The first set deals with functional details (which can generally be mapped to OpenCV terms), while the second set deals more with performance/throughput-related implementation details (which don’t have an OpenCV equivalent, but can impose additional restrictions on the functional parameters):
- Functional parameters
- Pixel size and pre-filtering
DATA– width, in bits, of each input pixel.
DATAF– width, in bits, of each filtered pixel (must be enough to represent
DATAF_MAX– maximum value for filtered pixels (set to twice OpenCV’s
preFilterCap; will default to
(2**DATAF)-1if not explicitly set).
- Image size
IMG_WIDTH– width, in pixels, of a whole frame.
IMG_HEIGHT– height, in pixels, of a whole frame.
- Disparity search space
DISP_BITS– bits required to represent
DISPARITIES-1. Output disparity values are
DISPARITIES– number of disparity levels to search (equivalent to OpenCV’s
numberOfDisparities; will default to
2**DISP_BITSif not explicitly set).
SAD_WINDOW– size of the sum-of-absolute-differences window (must be odd; equivalent to OpenCV’s
TEXTURE– texture filtering (equivalent to OpenCV’s
SUB_BITS– number of bits used for sub-pixel interpolation results (0 to disable; must be 4 for OpenCV compatibility)
SUB_BITS_EXTRA– number of extra internal bits to compute for potentially increased precision when rounding sub-pixel interpolation results (0 is recommended to save a bit of logic, but must be 4 for strict OpenCV compatibility).
UNIQUE_DIV– these control uniqueness filtering, and are approximately equivalent to OpenCV’s
uniquenessRatio. The conversion between the two is:
((UNIQUE_MUL*100)/UNIQUE_DIV) == (uniquenessRatio).
UNIQUE_DIVmust be a power-of-2.
- Pixel size and pre-filtering
- Implementation parameters
OUT_LEFT– enable the out_left output
OUT_RIGHT– enable the out_right output
MULT_D– the number of disparity levels to compute in parallel.
DISPARITIESmust be an integer multiple of this.
MULT_R– the number of image rows to process in parallel.
IMG_HEIGHTmust be an integer multiple of this.
SAD_WINDOW/2must also be an integer multiple of this.
- Synthesis performance tuning
PIPELINE_BRAM_RD– adds extra pipelining on block RAM read paths; this is recommended for most FPGA architectures.
PIPELINE_BRAM_WR– adds extra pipelining on block RAM write paths; this is recommended for high-frequency targets (e.g. Virtex-6, and sometimes Spartan-6).
PIPELINE_FANOUT– adds extra pipelining on some high-fanout paths; this is recommended for high-frequency targets (e.g. Virtex-6).
PIPELINE_LUT4– this optimizes the design for FPGA architectures with 4-input LUTs (e.g. Spartan-3). You shouldn’t typically need it on newer devices with 6-input LUTs.
All of the parameters must be integers. Boolean values should use 0 for ‘false’ and 1 for ‘true’.
Some parameters which are more ammenable to run-time adjustment (e.g.
UNIQUE_MUL) may, in the future, be converted to input ports on the core.
Parameters – determining throughput
To set some of the core’s performance related parameters, you need to know your required throughput.
The required throughput is based (approximately) on two things: your effective pixel clock, and the number of
DISPARITIES you want to search. The core must, in general, effectively compute all
DISPARITIES in a single pixel clock. To do this, it needs a combination of clock-frequency advantage and parallelization. That is:
MULT_D * MULT_R * (core_clk / pixel_clk) >= DISPARITIES
Note that if you’ve enabled
TEXTURE filtering, the core requires an additional processing pass, and the above equation should be ammended to:
MULT_D * (MULT_R * (core_clk / pixel_clk) - 1) >= DISPARITIES
A reasonable approximation for the effective pixel clock is to simply multiply the image size by the frame-rate; for example, for the MT9V032 image sensor: 752×480 @ 60 FPS = 21.7 MHz.
In actuality, the core only needs to process approximately
(IMG_WIDTH - DISPARITIES + SAD_WINDOW) pixels per row (rather than the full
IMG_WIDTH). Thus, in a well-buffered system (one that can hide the dead-time resulting from the sensor’s horizontal and vertical blanking), this approximation should yield a reasonably conservative design.
For example, with
DISPARITIES = 120 and
SAD_WINDOW = 17, the effective pixel clock is actually much closer to
(752-120+17)*480*60 = 18.7 MHz (over a 10% margin).
Continuing with the MT9V032 example: suppose we’re targetting a low-cost Spartan-class FPGA, and expect to run at upwards of 150 MHz. We’ll call it 132 MHz (~21.7 MHz * 6), to keep the math clean. With a 6x frequency advantage, that leaves us with a deficit of
120 / 6 = 20 to make up for with parallelization. Setting
MULT_D = 10 and
MULT_R = 2 would satisfy that (
10 * 2 = 20).
Working in reverse, we can estimate how many cycles the core will take to process an image:
The core requires
(DISPARITIES/MULT_D) passes to process
MULT_R rows. If
TEXTURE is enabled, one additional pass is required.
Each pass takes approximately
Thus, processing one whole frame requires approximately this many cycles:
For the above MT9V032 example, this works out to:
(480/2)*(120/10)*(752-120+17) = 1869120 cycles * 132 MHz = 14.2 ms
Which is just a bit shy of the MT9V032′s default frame-valid-time of 15.23 ms (indicating that we may actually be okay without mitigating vertical blanking through buffering), and a healthy margin short of the 16.7 ms implied by 60 FPS operation.
Parameters – resource impact
All of these parameters have some impact on FPGA logic usage, block RAM usage, and (to a much lesser extent) operational frequency. Performing a quick trial synthesis of the core with your desired parameters is really the best way to accurately gauge resource usage. The synthesis section includes a bunch of specific examples.
That being said, here are some vague generalities to help in making trade-offs regarding parameters:
DATA has a small impact on logic and RAM usage, since its effects are confined to the small pre-filtering block.
DATAF has a huge impact on logic and RAM usage, since it affects large swaths of the correspondence pipeline. You should use as small of a value here as you can get away with (this will be driven by the
DATAF_MAX parameter, which, in OpenCV terms, is
2*preFilterCap). With a prefilterCap of 7 (output range of [0,14]), as used in the above example, only 4 bits is required for
IMG_HEIGHT have essentially zero direct impact on logic usage (but significant throughput implications).
IMG_WIDTH (but not
IMG_HEIGHT) has a large impact on RAM usage (all of the core’s internal buffers scale with
Power-of-2 values for
IMG_WIDTH will, in general, most efficiently use RAM. Unfortunately, many image sensors are just a bit above a power-of-2. While all of the front-end’s buffers are exactly
IMG_WIDTH deep, the disparity block has a large set of buffers as well, which are exactly
(IMG_WIDTH - (DISPARITIES-1) - (SAD_WINDOW-1)) deep. So, even if you can’t have a power-of-2
IMG_WIDTH, you may be able to find a combination of
SAD_WINDOW that yields an efficiently-sized disparity buffer.
DISP_BITS (and by extension,
DISPARITIES) have only a small impact on logic and RAM usage (but significant throughput implications).
SAD_WINDOW has a big impact on logic usage, and a bigger impact on RAM usage. It controls the size of the SAD adder trees, and the number of image rows that must be internally buffered.
TEXTURE has essentially zero impact on logic and RAM usage, since it re-uses the SAD pipeline to do its work (this, of course, has throughput implications). If you’re already using
UNIQUE_MUL, you may find that
TEXTURE is redundant (in general, untextured regions should fail the uniqueness check as well).
The other post-processing options can be surprisingly costly in terms of logic and (especially) RAM usage. When either
UNIQUE_MUL is selected, the core requires extra RAM to track additional SAD values. If you enable
UNIQUE_MUL, you may as well enable
SUB_BITS as well, since it is only a small incremental cost (the opposite case is somewhat less apparent).
The exact values of
UNIQUE_DIV, once enabled, don’t play a big role in logic usage.
SUB_BITS_EXTRA is really only required if you want 100% OpenCV compatibility. If left at 0 (but with
SUB_BITS still at 4), you can achieve results that are within +-1 LSbit (+-1/16th of a disparity) of the OpenCV result. You can save a couple percent on FPGA logic by leaving
SUB_BITS_EXTRA at 0.
MULT_D has a large impact on logic usage (since it results in duplication of large parts of the stereo correspondence pipeline), and no impact on RAM usage. High
MULT_D values (>>10) can lead to fanout issues on the row buffer outputs.
PIPELINE_FANOUT can help here.
MULT_R is, in theory, a relatively cheap way of gaining additional throughput. If the post-processing options are disabled, this is generally true.
MULT_R is best used when the SAD pipeline is large (i.e. large
DATAF and/or large
SAD_WINDOW), and when post-processing is not needed.
MULT_R, a small amount of logic is needed to extend the SAD tree to an extra row, and some extra RAM is needed to buffer that row. If post-processing is enabled, however, use of
MULT_R can lead to significant increases in RAM and logic usage. A
MULT_R of 1 or 2 is typically recommended.
(if you’re using the unbuffered
MULT_R also has significant system-level implications, since it requires supplying/consuming multiple rows to/from the core in parallel)
OUT_RIGHT have little impact on logic, but enabling them requires some additional RAM for buffering. If you don’t have a need for the pre-filtered image data after it goes through the core, then you should set these to 0.
The four pipelining options have a small impact on logic usage (mostly additional registers). There is no harm in enabling all of them (and this should yield the highest frequency solution, assuming a device that isn’t highly utilized), but you may be able to save some resources by only using the ones that are required to meet timing (use my recommendations as a starting point, and then run your own synthesis tests).
The actual port-level interface to the buffered/pre-filtered wrapper is quite simple:
core_clk– high-speed clock for the internal stereo correspondence core. This clock can be totally asynchronous to the rest of the system.
clk– the interface/pixel clock. All inputs and outputs from the buffered wrapper are synchronous to this clock.
rst– synchronous reset input. This synchronously resets all of the interface and core logic. A single-cycle reset pulse is sufficient.
in_ready– handshake output indicating that the core can accept a new pair of input pixels.
in_valid– handshake input indicating that a valid pair of pixels is currently being supplied to the core.
in_left– left pixel input (width:
in_right– right pixel input (width:
out_ready– handshake input indicating that the core can supply a new set of output data.
out_valid– handshake output indicating that the core is currently supplying a valid set of output data.
out_disp– computed disparity value (width:
DISP_BITS + SUB_BITS)
out_masked– pixel was outside usable region (disparity is invalid, and
out_disphas been zeroed).
out_filtered– disparity value failed the uniqueness ratio or texture threshold check (but it’s still presented, in case you want it).
out_left– left pixel output (width:
DATAF); will be 0 if
out_right– right pixel output (width:
DATAF); will be 0 if
It’s designed to be as user-friendly as possible. All of your interface logic can run at a relatively slow speed (e.g. at the pixel clock) for ease of timing closure, while the actual stereo correspondence core runs as fast as you need it to (via the separate
core_clk input). All those tricky asynchronous boundary crossings are handled within the wrapper.
A brief elaboration on ready/valid handshaking, for the uninitiated:
ready is driven by the consumer/sink/slave block.
valid is driven by the producer/source/master (along with all qualified data). When
valid are both asserted on a clock edge, data is transferred (this is the only time a transfer can happen).
The consumer may deassert
ready to prevent data transfer (e.g. if its internal buffers are full).
After data is transferred, the producer may supply another piece of data and leave
valid asserted, or it can deassert
valid to indicate that no more data is immediately available. Once data becomes available again, the producer can supply the new data and assert
ARM’s AMBA-AXI bus is a good example of a high-performance interface relying on a ready/valid handshake.
The unbuffered core
You’ve already seen one of the wrapper modules (
dlsc_stereobm_prefiltered). There also exists another nearly-identical wrapper which omits any pre-filtering functionality:
dlsc_stereobm_buffered. If you don’t want to use xsobel pre-filtering, but would still like to benefit from buffering, you should use this module. Its parameters and port-list are nearly identical, so I won’t go into detail.
The 3rd module isn’t a wrapper at all – it’s the actual stereo correspondence core:
dlsc_stereobm_core. Its interface is a little bit different (especially on the output side); here’s an instantiation:
dlsc_stereobm_core #( .DATA ( 4 ), // input/output image bits-per-pixel .DATA_MAX ( 14 ), // 2*state->preFilterCap .IMG_WIDTH ( 640 ), // image width .IMG_HEIGHT ( 534 ), // image height .DISP_BITS ( 7 ), // output disparity image is // (DISP_BITS+SUB_BITS) bits-per-pixel .DISPARITIES ( 96 ), // state->numberOfDisparities .SAD_WINDOW ( 17 ), // state->SADWindowSize .TEXTURE ( 1300 ), // state->textureThreshold .SUB_BITS ( 4 ), // must be 4 for OpenCV compatibility .SUB_BITS_EXTRA ( 4 ), // must be 4 for OpenCV compatibility .UNIQUE_MUL ( 1 ), // state->uniquenessRatio == .UNIQUE_DIV ( 4 ), // (UNIQUE_MUL*100)/UNIQUE_DIV // remaining parameters are only throughput/timing related; not functional .MULT_D ( 8 ), // process MULT_D disparities per pass .MULT_R ( 2 ), // process MULT_R rows at a time .PIPELINE_BRAM_RD ( 1 ), // synthesis performance tuning .PIPELINE_BRAM_WR ( 0 ), // "" .PIPELINE_FANOUT ( 0 ), // "" .PIPELINE_LUT4 ( 0 ) // "" ) dlsc_stereobm_core_inst ( .clk ( clk ), // clock for interfaces and core .rst ( rst ), // synchronous reset .in_ready ( in_ready ), // ready/valid handshake for input .in_valid ( in_valid ), // "" .in_left ( in_left ), // left image input .in_right ( in_right ), // right image input .out_busy ( out_disp_busy ), // feedback to core requesting it temporarily halt output // (50~100 more values will be sent *after* busy is asserted) .out_disp_valid ( out_disp_valid ), // qualifier for out_disp_ signals .out_disp_data ( out_disp_data ), // disparity image output .out_disp_masked ( out_disp_masked ), // disparity outside of valid area .out_disp_filtered ( out_disp_filtered ), // disparity filtered by post-proc .out_img_valid ( out_img_valid ), // qualifier for out_img_ signals .out_img_left ( out_img_left ), // left image output .out_img_right ( out_img_right ) // right image output );
There are a few major differences between the core and its various buffered wrappers:
There’s only one clock input now.
clk clocks both the core and all of the interfaces signals.
The outputs no longer use a two-way ready/valid handshake (the input still does, though). Instead, the output is split into two groups of signals: the disparity output (prefixed with
out_disp_) and the image output (prefixed with
out_img_). Each group is qualified with a unidirectional
In operation, you’ll typically see the image output lead the disparity output by 50~100 pixels (the time required for those pixels to traverse the stereo correspondence pipeline). When the disparity output is masked, this delay will be significantly less (as masked pixels don’t traverse the pipeline).
Lacking a ready/valid handshake, the only way to throttle the core’s output is via the
out_busy feedback input. When this input is asserted, the frontend in the core will stop sending pixels down the pipeline. Pixels already in the pipeline, however, will continue to be output. Due to the pipeline’s depth, a significant number (50~100) of pixels will exit the disparity output even after
out_busy is asserted; downstream logic must be able to tolerate this.
MULT_R > 1, then all of the core’s ports (excluding clocks, resets and handshake/qualifier signals) all become wider to accommodate transfer of multiple rows in parallel.
Without buffering, the core’s inputs and outputs are very “bursty” (brief periods of significant activity; idle the rest of the time). The core takes
DISPARITIES / MULT_D passes to process each set of
MULT_R rows. It only transfers data on the last pass. Any throttling of the core’s interfaces (via deasserting
in_valid or asserting
out_busy) will result in reduced pipeline utilization, so you must make sure your interface logic can cope with these bursty transfers.
In most cases, you’ll probably encounter fewer issues by just using one of the buffered wrappers.
Synthesis and Place & Route
Synthesizing the core is straight-forward. For typical designs, it achieves timing closure with minimal effort and few changes to synthesis options.
The design has been tested in Xilinx ISE 13.1 and Altera Quartus 11.0
A complete list of files (Verilog modules and includes) needed by the pre-filtered core:
alu/rtl/dlsc_absdiff.v alu/rtl/dlsc_adder_tree.v alu/rtl/dlsc_compex.v alu/rtl/dlsc_divu.v alu/rtl/dlsc_min_tree.v alu/rtl/dlsc_multu.v common/rtl/dlsc_clog2.vh common/rtl/dlsc_synthesis.vh mem/rtl/dlsc_fifo_shiftreg.v mem/rtl/dlsc_pipedelay.v mem/rtl/dlsc_pipedelay_clken.v mem/rtl/dlsc_pipedelay_rst.v mem/rtl/dlsc_pipedelay_valid.v mem/rtl/dlsc_pipereg.v mem/rtl/dlsc_ram_dp.v mem/rtl/dlsc_ram_dp_slice.v mem/rtl/dlsc_shiftreg.v rvh/rtl/dlsc_rowbuffer.v rvh/rtl/dlsc_rowbuffer_combiner.v rvh/rtl/dlsc_rowbuffer_splitter.v rvh/rtl/dlsc_rvh_decoupler.v stereo/rtl/dlsc_stereobm_backend.v stereo/rtl/dlsc_stereobm_buffered.v stereo/rtl/dlsc_stereobm_core.v stereo/rtl/dlsc_stereobm_disparity.v stereo/rtl/dlsc_stereobm_disparity_slice.v stereo/rtl/dlsc_stereobm_frontend.v stereo/rtl/dlsc_stereobm_frontend_control.v stereo/rtl/dlsc_stereobm_multipipe.v stereo/rtl/dlsc_stereobm_pipe.v stereo/rtl/dlsc_stereobm_pipe_accumulator.v stereo/rtl/dlsc_stereobm_pipe_accumulator_slice.v stereo/rtl/dlsc_stereobm_pipe_adder.v stereo/rtl/dlsc_stereobm_pipe_adder_slice.v stereo/rtl/dlsc_stereobm_postprocess.v stereo/rtl/dlsc_stereobm_postprocess_subpixel.v stereo/rtl/dlsc_stereobm_postprocess_uniqueness.v stereo/rtl/dlsc_stereobm_prefiltered.v stereo/rtl/dlsc_xsobel_core.v sync/rtl/dlsc_domaincross.v sync/rtl/dlsc_domaincross_slice.v sync/rtl/dlsc_rstsync.v sync/rtl/dlsc_syncflop.v sync/rtl/dlsc_syncflop_slice.v
Two of those files (the .vh files in common/rtl/) are `include files; you’ll need to ensure your synthesis tool’s include path can find them (most simulators use “+incdir”; XST uses the “-vlgincdir” option; I’ve yet to find documentation on what the Altera equivalent is).
You should also set two global `defines:
ALTERA). Those aren’t strictly required, but they will improve timing/performance. They enable additional Verilog metacomment synthesis directives embedded throughout the design (on Xilinx devices anyway; I have not performed this level of optimization for Altera devices yet.. but it’s on the to-do list). All of the metacomments are defined in a central place (
dlsc_synthesis.vh), and referenced via `defines (to allow for swapping compatible Altera directives for Xilinx directives; hence the vendor-specific define).
For Xilinx, you should ensure that timing-driven mapping is enabled (enabled by default for newer 6-series devices; optional for older ones). If it’s not enabled, you may run into problems with slow, high-fanout paths (e.g. on the reset net). The core contains register duplication directives to help with fanout issues, but they won’t work without timing-driven mapping.
If your synthesis tool doesn’t support constant functions, but does support
$clog2, you should also define
USE_CLOG2 (this controls what method is used by
dlsc_clog2.vh to implement a clog2 function). Both ISE/XST and Quartus work fine with the default.
Both of the clock nets should be constrained somewhere in your design. Here’s an example UCF file I’ve been using for synthesis testing (with the wrapper as the top-level module):
NET "clk" TNM_NET = clk; TIMESPEC TS_clk = PERIOD "clk" 50 MHz HIGH 50%; NET "core_clk" TNM_NET = core_clk; TIMESPEC TS_core_clk = PERIOD "core_clk" 150 MHz HIGH 50%;
The asynchronous boundary crossings inside the modules in
sync/rtl/ may require special false-path constraints. For Xilinx targets, this should already be handled by the aforementioned Verilog metacomments (provided you’ve `defined
XILINX). For other targets (Altera), I haven’t yet taken care of this.
For all of the examples presented here,
dlsc_stereobm_prefiltered was synthesized as the top-level module. The resource utilization numbers are for a completely placed and routed design with clean timing results. Timing-driven mapping was always used (for Xilinx); other synthesis and place&route options were left at their defaults. ISE 13.1 was used for all Xilinx examples. Quartus 11.0 was used for the lone Altera example.
These examples don’t necessarily represent optimal instantiations; there may be other combinations of parameters (especially of
MULT_D and MULT_R) that yield more efficient implementations. Running a trial synthesis and place&route operation is easy and relatively quick, so I'd encourage you to experiment with other configurations.
- 320x240 @ 30 FPS in Spartan-3E 250
- 320x240 @ 120 FPS in Spartan-3E 250
- 640x480 @ 30 FPS in Spartan-3E 500
- 800x480 @ 60 FPS in Spartan-6 LX25
- 800x480 @ 60 FPS in Spartan-3E 1200
- 800x480 @ 60 FPS in Cyclone IV EP4CE22
- 1920x1080 @ 30 FPS in Spartan-6 LX75
320x240 QVGA @ 30 FPS in a Spartan-3E 250
We'll start with a small example, and move up from there: Quarter-VGA (320x240) at just 30 FPS - less than a 2.5 MHz effective pixel clock.
We want to search 48 disparities (15% of the width) with a SAD window of 11x11.
2.5 MHz is slow enough that we can easily handle it without any parallelization:
48 disparities * 2.5 MHz = 120 MHz.
This may be low resolution, but we still want most of the bells-and-whistles: texture filtering, uniqueness-ratio checking and sub-pixel interpolation. We have no interest in using the xsobel filtered image data after the pipeline.
Thus, we instantiate with these parameters:
dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 320 ), .IMG_HEIGHT ( 240 ), .DISP_BITS ( 6 ), .DISPARITIES ( 48 ), .SAD_WINDOW ( 11 ), .TEXTURE ( 500 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 1 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 1 ), .MULT_R ( 1 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 0 ), .PIPELINE_LUT4 ( 1 ) )
Targetting the 2nd-smallest Spartan-3E (an XC3S250E-4VQ100), we get:
An easy fit with lots of logic to spare. RAM is a bit tight, however. On the lower end of things, the core tends to be block-RAM limited, rather than logic limited.
320x240 QVGA @ 120 FPS in a Spartan-3E 250
So, obviously, we need to pack some more logic into the design. Quadrupling the frame-rate ought to do it: 320x240 @ 120 FPS, for an effective pixel clock of around 9.2 MHz.
We'll up the core clock to 160 MHz and set
MULT_D = 3 to achieve the required throughput. Remember, with
TEXTURE enabled, we're trying to satisfy the equation:
MULT_D * (MULT_R * (core_clk / pixel_clk) - 1) >= DISPARITIES
..which we do:
3 * (1 * (160/9.2) - 1) = 49.1 >= 48
The only difference in the instantiation is changing
MULT_D from 1 to 3.
Targetting the same Spartan-3E 250, we get:
Excellent. The core is very low latency (on the order of
SAD_WINDOW rows of delay; an image rectification front-end will contribute more), so it might make sense to take a high-framerate configuration like this and use it in a hard-realtime control system (perhaps for a quadrotor UAV..).
And there's still some logic to spare. The RAM problem, of course, persists; once we try to add some additional functions (like image rectification), we may have to upgrade to a slightly larger device (e.g. a Spartan-3E 500).
640x480 VGA @ 30 FPS in a Spartan-3E 500
We've now upgraded to a 500E; why stop at QVGA?
The source is now 640x480 @ 30 FPS (but with the same 9.2 MHz effective pixel clock). Now, however, we want to search 96 disparities (still 15%) with a larger SAD window of 15x15.
We'll set the core clock to 150 MHz (~9.2 * 16). With a 16x frequency advantage, we still need
96 / 16 = 6 parallel pipelines; set
MULT_D = 6 and leave
MULT_R = 1. Resultant parameters are:
dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 640 ), .IMG_HEIGHT ( 480 ), .DISP_BITS ( 7 ), .DISPARITIES ( 96 ), .SAD_WINDOW ( 15 ), .TEXTURE ( 0 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 4 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 6 ), .MULT_R ( 1 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 0 ), .PIPELINE_LUT4 ( 1 ) )
This fits nicely into a small Spartan-3E 500 (XC3S500E-4VQ100):
We probably even have enough space leftover to fit a complete system. Nice.
If we wind up too short on RAM, we can drop the resolution by 20% to 512x480; this saves a lot of memory (powers-of-2 and all that..):
800x480 @ 60 FPS in Spartan-6 LX25
Higher resolution is nice; so is frame-rate. Maybe we want both: 800x480 @ 60 FPS. (similar to the output of an MT9V032 image sensor, as seen in the previous throughput example).
That's around a 23 MHz pixel clock. We'll now be searching 120 disparities (15%) with a SAD window of 17x17.
Our pixel clock is now nearly 10x that of the original 30 FPS QVGA example, and our search space has increased by 150% (that's around 25x the overall compute requirement). Parallelization is a must. We'll target a newer Spartan-6 device, and figure on an easy-to-route 138 MHz core clock (6 * 23 MHz). That leaves us with a 20x throughput deficit (120 disparities / 6 = 20).
MULT_D = 10 and
MULT_R = 2 will fix that.
Again: most of the bells-and-whistles, but we're willing to drop texture filtering (relying only on uniqueness-ratio checking to catch poor results). Resultant parameters are:
dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 800 ), .IMG_HEIGHT ( 480 ), .DISP_BITS ( 7 ), .DISPARITIES ( 120 ), .SAD_WINDOW ( 17 ), .TEXTURE ( 0 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 4 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 10 ), .MULT_R ( 2 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 0 ), .PIPELINE_LUT4 ( 0 ) )
Targetting a Spartan-6 LX25 (XC6SLX25-2FTG256), we get:
Xilinx's synthesis tool seems to have a lot of trouble estimating how much logic this design takes:
Found area constraint ratio of 100 (+ 5) on block dlsc_stereobm_prefiltered, actual ratio is 169. Optimizing block <dlsc_stereobm_prefiltered> to meet ratio 100 (+ 5) of 3758 slices : WARNING:Xst:2254 - Area constraint could not be met for block <dlsc_stereobm_prefiltered>, final ratio is 165.
70% slice usage is relatively high, but it's a far cry from 170%! I haven't yet tracked down the source of this estimate; my current theory is that it isn't taking into account the savings from mapping shift-registers into LUTs (the core uses a considerable number of them for pipeline delay matching). Thankfully, this warning doesn't appear to impact the final results.
The first prototype of my complete stereo vision system will eventually wind up on a Xilinx SP605 board - which houses a Spartan-6 LX45 (the next step up from the LX25). Given that this 60 FPS WVGA example represents my design goal for the system, I should have plenty of FPGA logic leftover for exploring other features (maybe a multi-baseline setup with 3 cameras..).
Going in the opposite direction, this configuration will actually fit in an even smaller Spartan-6 LX16 device (XC6SLX16-2FTG256):
But you'd have a heck of a time putting any other logic in there with it!
800x480 @ 60 FPS in Spartan-3E 1200
Here are the results for the same design (but with
PIPELINE_LUT4 set) in an older Spartan-3E 1200 (XC3S1200E-4FT256):
800x480 @ 60 FPS in Cyclone IV EP4CE22
And once more, but so-as to not be too Xilinx-biased, we'll run it in an Altera Cyclone IV E part (EP4CE22F17C8):
The register usage is very similar to that of the Spartan-3E, but the logic usage is significantly higher. The Cyclone IV's LEs should be very comparable to the Spartan-3E's LUTs, so this is a bit perplexing. It's possible that this is due to a less efficient (or non-existent) shift-register implementation in the Cyclone IV LEs (Altera's low-end devices have historically lacked the ability to use LEs as small RAMs - in contrast to Xilinx's distributed RAM and SRL16s).
Again, I haven't performed any Altera-specific optimization yet. Despite this, the core makes it through Altera's implementation flow without any modifications, and achieves good performance - that's the benefit of using inference to create device-agnostic designs!
1920x1080 @ 30 FPS in Spartan-6 LX75
VGA (wide or otherwise) is a bit old-fashioned; let's try something with higher.. definition: 1920x1080 @ 30 FPS (1080p30).
We'll spec a search space of 300 disparities with a SAD window of 25x25.
1920x1080 @ 30 FPS is around a 60 MHz effective pixel clock. We're still in a Spartan-6, so we'll limit ourselves to a 150 MHz core clock (2.5 * 60). That leaves us with
300 / 2.5 = 120 disparities to process in parallel. That's quite a lot, but I think we can manage it (in fact, I know we can, since I've already written the results).
MULT_D = 30 and
MULT_R = 4 works out nicely.
Just to double-check some of the parameter restrictions:
SAD_WINDOW/2 = 25/2 = 12 which is a multiple of
MULT_R = 4;
IMG_HEIGHT = 1080 is a multiple as well.
DISPARITIES = 300 is a multiple of
MULT_D = 30. All is well.
We expect the device to be pretty full, so we'll skip strict sub-pixel compatibility with OpenCV. We'll also omit texture filtering, since it would be a huge waste of resources (29 of those 30 parallel pipes would be idle on the final texture pass; and we'd have to up the clock frequency to compensate; this may be fixed in the future).
This yields the parameters:
dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 1920 ), .IMG_HEIGHT ( 1080 ), .DISP_BITS ( 9 ), .DISPARITIES ( 300 ), .SAD_WINDOW ( 25 ), .TEXTURE ( 0 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 0 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 30 ), .MULT_R ( 4 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 1 ), .PIPELINE_LUT4 ( 0 ) )
Targetting a Spartan-6 LX75 (XC6SLX75-3FGG484), we get:
That's pretty heavily utilized. As a result, we had to upgrade to a -3 part to achieve timing closure (thankfully, the price premium on faster Spartan-6's isn't anywhere near as hefty as with Virtex devices). It may be prudent to upgrade to a larger (but slower speed grade) device. (I'd have used an LX100 or LX150 here, but Xilinx's free ISE Webpack license precludes such things).
Interestingly, this design is around 6.8x the compute requirement of the previous 60 FPS WVGA example (2.7x for pixel rate times 2.5x for search-space), but we're managing to fit in a device that's only 3-4.5x the size. Scaling is rarely linear.
What if we, for whatever reason, decided that we didn't need any post-processing? Setting
SUB_BITS = 0 and
UNIQUE_MUL = 0 (
TEXTURE is already disabled) and resynthesizing, we get:
That's a big savings, but not one that you're likely to want to take advantage of (un-post-processed disparity maps leave something to be desired).
I won't be presenting any specific examples targetting higher-end devices (e.g. a Virtex-6), as this core (by itself) really doesn't make good use of such a potent device (the core has no use for DSP blocks, nor for gigabit transceivers).
It's certainly possible to contrive an example that requires the resources of a large Virtex-6 (say, Quad-HD 3840x2160 @ 30 FPS - a 250 MHz pixel clock), but I wouldn't recommend using this core in that scenario; the brute-force block-matching algorithm employed for this module (and by OpenCV) is well-suited to relatively low-resolution inputs, but becomes less-and-less practical as resolutions and search-spaces are increased.
A more efficient algorithm would not need to exhaustively search every possible disparity. One very simple approach (if one didn't want to develop an entirely new algorithm) would be to apply the block-matching algorithm to a sub-sampled version of the source, and use the results of that as a starting point for an optimization algorithm that runs on the full-resolution images. That's a bit beyond the immediate scope of this project.
Now, if you're already using a Virtex-6 for other image-processing applications, and just want to add some stereo processing: I think you'll find that a ~300 MHz stereo vision core can process a lot of pixels for not a lot of logic usage!
Inside the Core
The core itself is broken into 5-7 major pieces (depending on which (if any) wrapper you're using):
dlsc_xsobel_core is an independent pipeline block included by the pre-filtered wrapper. It performs the same pre-filtering as OpenCV's
prefilterXSobel function (but only if you have SSE2 disabled; OpenCV generates slightly different results with SSE2 enabled), which is invoked when you use the
CV_STEREO_BM_XSOBEL pre-filtering option.
Since the xsobel filter requires 3 rows of image data to produce a single row of output, the xsobel core includes enough RAM to buffer 2 rows of incoming data. The 3rd row comes straight from the input without buffering.
The actual filtering operation is fully pipelined and can (almost) handle 1 pixel per clock cycle. Due to the filter requiring a 3x3 window around a given output pixel, actual throughput is slightly lower: the xsobel core spends
IMG_WIDTH + 1 cycles on each row, and must run for
IMG_HEIGHT + 1 rows to complete an entire frame.
dlsc_stereobm_frontend is the first pipeline component in the stereo correspondence core. It's responsible for buffering enough incoming image rows to create an entire sum-of-absolute-differences window, and for sending needed pixels down the rest of the pipeline.
The front-end buffers
(SAD_WINDOW + MULT_R - 1) rows each for the left and right image inputs. When the row buffers are full, the front-end begins sending pixels down the correspondence pipeline. It simultaneously sends the corresponding image data to the back-end for output.
It's designed with throughput in mind, and can keep the pipeline 100% utilized once the initial buffering phase is completed (assuming no input or output throttling, of course). Currently, the front-end is able to overlap the initial buffering phase of the next frame with the processing of the final rows for the current frame (it will only do this if the input can immediately supply data for the next frame). This can further improve pipeline utilization.
The front-end makes
(DISPARITIES / MULT_D) + (TEXTURE ? 1 : 0) passes over a given set of
MULT_R rows. On each pass, it sends the same left-image data down the pipeline. The right-image data is different for each pass, to enable computing the SAD values for different disparity levels.
On the final pass for a given set of rows, the front-end shuffles all of the buffered rows down and loads a new set of input rows onto the top of the buffers. This is performed in conjunction with sending the final pass' data down the pipeline; no pipeline cycles are wasted for loading/shuffling data.
dlsc_stereobm_multipipe is the heart of the block-matching stereo correspondence algorithm. Well, actually,
dlsc_stereobm_pipe is; the multi-pipe merely instantiates
MULT_D instances of the SAD pipeline, in order to enable processing multiple disparity levels in parallel.
The multi-pipe processes
MULT_D adjacent disparities simultaneously from a single stream of left/right input image data. The left image data is sent to all SAD pipelines, while the right image data is cascaded through each pipeline to the adjoining pipeline (with a single cycle delay on each hop). This cascading causes each pipeline to be operating on a different disparity level (1 different than its neighbor).
Each SAD pipeline comprises three note-worthy components: an absolute-differences block, an adder tree, and a window accumulator. In conjunction, these blocks enable the pipeline to implement a sum-of-absolute-differences function over a
SAD_WINDOW x SAD_WINDOW window with an output rate of 1 pixel per cycle.
The absolute-differences block (as the name implies) merely computes the absolute difference between each pair of left/right input pixels.
The adder tree takes the resultant column of
SAD_WINDOW absolute differences and sums them into a single value. This value is still only a single column out of the overall SAD window.
The SAD window accumulator takes SAD columns and (again, as the name implies) accumulates them into a complete SAD window. In operation, the window slides laterally along the image at 1 pixel per cycle. Rather than re-compute the entire window each cycle, the accumulator keeps track of the previous
SAD_WINDOW column sums. By accumulating new column sums, and subtracting old ones that are falling outside the window, it creates the complete SAD window with minimal effort.
When processing multiple rows in parallel (
MULT_R > 1), the SAD pipeline is able to share significant resources between each row.
The outputs of the each SAD pipeline in the multi-pipe are sent directly to the next stage in the overall pipeline, without first finding the "best" of the
MULT_D SAD/disparity values.
The core of the SAD pipeline closely resembles the diagram that I presented way back in my original post:
(as it would be configured when
SAD_WINDOW = 5 and
MULT_R = 1)
The left and right row buffers live in the previously described front-end; the SAD tree and accumulator live in this block; and the disparity buffers live in the disparity block:
dlsc_stereobm_disparity has become somewhat of a "catch-all" for reconciling SAD pipeline outputs. Originally, before I added support for several of OpenCV's post-processing operations (texture filtering, uniqueness filtering and sub-pixel interpolation), the disparity block was only responsible for comparing the results from the current pass against the results from a previous pass and finding the best result. Now it does rather more.
The pipeline takes multiple passes to process a given row, and each pass only covers a fraction of the overall disparity values. For this reason, the disparity block is responsible for maintaining state across multiple passes for an entire row. Only on the final pass is the disparity block able to produce a complete output.
MULT_D SAD/disparity pairs from the multi-pipe, and the saved "best" values from previous passes, the disparity block finds:
- The best overall SAD/disparity.
- The 2nd-best SAD/disparity (excluding ones immediately adjacent to the current best disparity); this is used for uniqueness-ratio checking.
- The SAD values for the disparities immediately adjacent to the current best disparity; this is used for sub-pixel interpolation.
The logic to do this is somewhat involved; requiring two many-input comparator trees and a bunch of control/masking logic.
The disparity block also performs the "filtering" part of the texture-filtering operation (the SAD pipeline is re-used to compute the "texture" value); this is merely a comparison operation.
On the final pass, the disparity block sends all of these results down the pipeline.
dlsc_stereobm_postprocess implements two optional post-processing operations: sub-pixel interpolation and uniqueness-ratio filtering.
Sub-pixel interpolation uses the SAD values of the two disparities adjacent to the winning disparity in order to compute a more accurate final disparity. In the OpenCV implementation, this requires a small (8-bit result) divider; this divider is implemented in a pipelined fashion in the core (comprising 8 subtractors and some muxing logic). If you forgo strict OpenCV-compatibility and set
SUB_BITS_EXTRA to 0, this shrinks to 4-bits (4 subtractors).
Uniqueness-ratio filtering is simple: if the condition
(best_sad + (best_sad * UNIQUE_MUL)/UNIQUE_DIV) < 2nd_best_sad isn't satisfied, then the disparity value is filtered out. For small power-of-2 UNIQUE_MUL values, adders will typically be inferred; for larger values, hardware multipliers are likely to be inferred.
The results of the uniqueness-ratio filtering are merged (OR'ed) with the texture filtering results before being sent to the final pipeline stage.
dlsc_stereobm_backend used to perform a lot more, before I moved output buffering functions into an external module. These days, the back-end is primarily responsible for generating blank/invalid disparity values for pixels that fall outside the valid region (that is, pixels that will never traverse the pipeline).
So-as to not overwhelm the output, the back-end monitors image data coming from the front-end and only generates invalid disparity values for pixels that have already been sent by the front-end. Valid disparity values can only come from the pipeline; they pass through the back-end without modification.
dlsc_rowbuffer modules are responsible for implementing the asynchronous input and output buffering present in the buffered and pre-filtered wrappers.
At their core, they resemble a standard asynchronous FIFO. But there's a twist: in order to accommodate conversion between single-row I/O and multi-row I/O (
MULT_R > 1), they contain additional logic.
dlsc_rowbuffer_splitter takes in a single row at a time, and outputs
MULT_R rows in parallel. To accomplish this, the FIFO allows the input to write multiple times to the same entry; only on the final write (to the final row) does the FIFO consider a value "pushed." At that point, the output is able to see the whole set of rows, and can begin consuming them.
dlsc_rowbuffer_combiner works similarly; it takes in
MULT_R rows in parallel, and outputs a single row at a time. The FIFO allows the output to peek at a given entry multiple times; only on the final read (from the final row) does the FIFO consider a value "popped". At that point, the input is able to overwrite the old value. The combiner also generates an
almost_full feedback signal that is used to throttle the stereo pipeline's output when the buffer approaches full.
All of the communication between read/write ports of the rowbuffer must cross asynchronous clock domains. This is accomplished with a careful cross-domain handshake that ensures changes are propagated atomically and without risk of metastability (this is implemented in my
dlsc_domaincross module). Gray coded address are not used, so-as to simplify control logic and to more readily enable non-power-of-2 buffer depths.
Simulation has played an absolutely critical role in the development and verification of the stereo correspondence core. For a module of this size, it's really the only practical strategy (an iterative write/synthesize/program/check-the-blinky-LEDs-for-correctness/repeat sort of process would have taken a very long time indeed).
Verification has been somewhat of a two step process: the first step was developing a C++ reference model that was functionally similar to the OpenCV implementation (but written for algorithmic clarity, rather than performance) and confirming that it did, in fact, match OpenCV. The second step then used this reference model to verify the actual Verilog implementation.
(the C++ code for the reference model can, if you're curious, be found in
The core was originally written (though not necessarily designed) in a bottom-up fashion, with unit-tests being written in conjunction with new Verilog modules. The top-level tests now serve as the primary testbench for the core.
The tests are just as parameterized as the Verilog modules they verify, so it's easy to have verification coverage for a large number of possible core configuration (indeed: the initial addition of this bulk parameterized verification found a number of failing configuration corner-cases that were subsequently fixed).
SystemC is my testbench language of choice (it's primarily "just" a templated class/macro library that grafts HDL-like semantics onto native C++). Being C++, it's much more testbench-friendly than plain-ol'-Verilog is.
Using SystemC for the testbench (and limiting Verilog code to just synthesizable constructs) opens up the option of using Verilator as the simulator. Verilator "synthesizes" Verilog into optimized C++ code (and optionally wraps that up in a SystemC module), which can be compiled right along with your testbench code to yield a relatively high-performance simulation executable (rivaling that of some very expensive commercial simulators). It's on the order of 100x faster than the leading full-featured open-source Verilog simulator, Icarus Verilog. (Verilator is not full-featured, as it only supports the synthesizable subset of Verilog)
In the current incarnation of my verification environment, everything is handled through a system of Makefiles. To invoke, for example, the regression tests for
stereo/tb and run something like:
make -f dlsc_stereobm_prefiltered_tb.makefile -j8 sims
This launches several simulations in parallel, so the output won't be terribly meaningful until it completes. After completion, a summary can be viewed with:
make -f dlsc_stereobm_prefiltered_tb.makefile summary (this just greps the resultant log files for the final pass/fail report).
Similarly, you can get accumulated code coverage results using:
make -f dlsc_stereobm_prefiltered_tb.makefile coverage_all (they'll be well over 95% after a regression run). This is just block/line coverage (a necessary, but not really sufficient indicator of overall verification quality); I have not (thus far) written any functional coverage points for the core.
Most of my testbenches have a 1:1 mapping between .makefile and .sp (the actual SystemC/SystemPerl testbench) files. The top-level tests for the core are not quite this way; instead, a single testbench (
dlsc_stereobm_tb.sp) is shared between them.
(there are some older testbenches around that haven't been updated in a while; if a testbench doesn't have a corresponding makefile, I probably haven't gotten around to updating the testbench yet)
Verilog and Post-synthesis Simulation
In addition to the SystemC testbenches, I've also written a simplified pure-Verilog testbench for the pre-filtered core (
dlsc_stereobm_tbv.v). Currently, my environment supports Verilog testbenches using the Icarus Verilog simulator. The Makefile-driven invocation is very similar.
If your verification environment doesn't include SystemC, you may find the Verilog testbench to be a better starting point.
Since the C++ reference model can't easily be run from within the Verilog testbench, the testbench relies on an external program to generate stimulus and expected-results files in a format compatible with Verilog's
dlsc_stereobm_models_program.cpp implements an executable wrapper around the C++ model.
The main reason for developing a Verilog testbench was for running post-synthesis and post-place&route simulations. Verilator (and, it turns out, Icarus as well) are unable to handle the complex non-synthesizable constructs that Xilinx necessarily uses in their simulation primitives. Thus, a simpler Verilog testbench is needed for executing simulations through Xilinx's own ISim simulator.
I haven't yet automated the process of invoking ISim, so post-synthesis simulations are currently a very manual process. The limited simulations that I have run indicate that the core works just fine after synthesis by Xilinx's XST tool.
The reasoning behind wanting to run post-synthesis simulations is two-fold: to confirm that the synthesis tools are performing their job correctly, and to confirm that any inferred FPGA hard IP behaves the same as the inferable model (e.g. dual-port RAMs with special read/write conflict-resolution behavior).
Lacking a complete system with which to test in real hardware, post-synthesis simulation is a good way to confirm that a design will work correctly in the real world. And, while you may not be able to run as many test vectors against a slow simulation, simulation offers the added benefit of performing internal checks that hardware cannot (e.g. checking correct block RAM usage).
Like many HDL verification environments, this one ain't entirely easy to setup. Getting it to work under Windows would be extremely difficult; I personally use Ubuntu 10.10 64-bit.
The environment has a variety of esoteric (outside the world of ASICs/FPGAs) tool dependencies (all free and mostly open-source):
- Veripool - home of Wilson Snyder's wholly remarkable collection of open-source verification tools, including a bunch I rely on:
- Verilator - required for simulating Verilog with SystemC testbenches.
- SystemPerl - pre-processor needed to convert my SystemPerl testbenches to SystemC, and to provide code coverage and waveform tracing functionality.
- Verliog-Perl - provides Verilog parsing functionality required by one of my utilities.
- SystemC - required if using my SystemC/SystemPerl testbenches. Somewhat of a hassle to build on 64-bit Ubuntu.
- Icarus Verilog - only required if you want to run simulations with Verilog testbenches.
- OpenCV - needed by my testbenches for loading/processing test images.
There are likely to be other smaller dependencies lurking behind these major ones.
The verification environment also relies on a few environment variables being set for it to function:
SYSTEMC- set to the root of your SystemC kit.
SYSTEMC_ARCH- set to the architecture your SystemC library was compiled for (e.g.
VERILATOR_ROOT- set to the root of your Verilator installation.
SYSTEMPERL- set to the root of your SystemPerl kit (source tree).
DLSC_MAKEFILE_TOP- must specify an absolute path to
common/mk/dlsc_common_top.makefilewithin your checkout of the
dls_coresrepository. It can't have any spaces in it (partly due to Make's inability to cope with them, and compounded by my reliance on generated absolute paths in certain portions of my makefiles).
The core, as it is, is a solid Version 0.9 - maybe a 1.0 Beta (or if you're a fan of date-based versioning: Version 20110605). But there's always more work to be done (and this isn't even my day job!).
First and foremost: a complete system - something that is complete enough to actually run on real hardware - is a major upcoming milestone. I'll be working towards that before I spend a lot more resources improving the correspondence core itself. The image-rectification block is next on my to-do list.
(again, have a look at my previous FPGA Stereo Vision Project post for more information about the overall stereo vision system)
In a somewhat-particular order, here's a list of some possible future work on the stereo correspondence core (excluding work on the rest of the system):
- Automate post-synthesis and post-place&route simulations.
- Make sure all non-trivial modules (not just the bigger ones) have up-to-date unit-tests.
- Simulate and optimize design power consumption (e.g. using Xilinx Power Analyzer).
- Add support for
minDisparity(better compatibility with OpenCV).
UNIQUE_MULparameters be run-time adjustable.
- Reduce throughput cost of texture filtering (avoid wasting most of the parallel pipelines on the texture pass).
- Add support for producing a complete disparity map (get rid of the masked left-margin; this is already mostly supported by my reference model).
- Add support for multiple-baseline stereo vision setups (more than 2 cameras).
- Add support for multi-channel setups (color RGB or YUV instead of just monochrome).
- Implement "block" processing for reduced memory footprint and less bursty interfaces.
- Improve support for Altera devices (perform timing optimizations; add Altera-specific Verilog metacomments).
- Investigate more sophisticated correspondence algorithms (likely outside the scope of this particular block-matching correspondence core).
All of my open-source Verilog FPGA IP cores are available in a Mercurial repository hosted by bitbucket: dls_cores. (at the moment, "all" largely means "this stereo correspondence core"; though there are some work-in-progress blocks lurking there too). You can download a ZIP snapshot of it, as well (that snapshot won't, however, track future updates).
The entire repository (except where otherwise noted) is made freely available under a 3-clause BSD license.
Refer to the file list in the synthesis section for a list of all the relevant parts of the repository.