Eventually, when my FPGA stereo-vision project nears its terminus, I’m going to want to produce a refined sensor board that combines the image sensors and FPGA onto a single board. In preparation for that, this board is a test vehicle to investigate what it takes to design and assemble a compact PCB with multiple BGA packages using only tools and services that are within the reach of a well-equipped hobbyist.
Major features include:
- Spartan-6 FPGA in FT256 package (up to an XC6SLX25)
- 64MB 800Mbps x8 DDR2 SDRAM
- 18 high-speed LVDS pairs for FPGA expansion (across 9 “SATA” connectors)
- 16 low-speed 3.3V signals for FPGA expansion (on a Gadget Factory “Wing” style header)
- 100 MHz oscillator
- ATmega32U2 USB microcontroller (responsible for configuring the FPGA via JTAG)
- 2MB SPI flash for non-volatile bitstream and data storage
- Single 5V supply (onboard regulators for 3.3V/2.5V/1.8V/1.2V rails)
- Single JTAG port selectable between AVR and FPGA
- 85mm x 50mm 4-layer PCB (3.35″ x 1.97″)
Looking at the top two items on that list – an FPGA in a 256-ball 1.0mm BGA package, and a memory device in a 60-ball 0.8mm BGA package – one can easily imagine that assembly is going to be the trickiest part of this project.. but this post isn’t about the assembly of the board, seeing as I’ve only just sent it off to be fabricated (this time by Laen’s 4-layer PCB service). I’ll make a follow-up post once the board is back and assembled.
Instead, this post is entirely about the design and layout of the board.
The star of this board is one of Xilinx’s Spartan-6 FPGAs. It’s quite the step up from their prior Spartan-3 devices (easily justifying the 100% series-number increase). Many excellent features from the higher-end Virtex devices have trickled-down to the Spartans, and at least one is entirely new. The three biggest ones in my mind are:
- 800 Mbps x16 DDR memory controllers (2-4 total controllers, depending on device and package): Spartan-3s had to make due with controllers implemented in soft-logic, and were limited to 333 Mbps. Spartan-6s are capable of over twice the performance with zero logic usage.
- 1050 Mbps SerDes blocks with variable input delays: Spartan-3s, again, had to use soft-logic for SerDes – and their LVDS I/Os were limited to lower speeds. Without run-time adjustable input delays, per-bit deskew was essentially impossible for Spartan-3s. Spartan-6s fix all of that, making 1+ Gbps source-synchronous interfaces practical.
- 6-input LUTs: older devices relied on 4-input LUTs as their basic building block. There are a lot more functions that efficiently map to 6-input LUTs, thus improving logic-usage efficiency. For example, a 4:1 mux will fit perfectly into a single 6-input LUT, but it takes three 4-input LUTs to achieve the same functionality. Adder trees can be made denser, too – as each 6-input LUT can implement one bit of a 3-input adder (a 9-input adder can be built with just 4 LUTs per bit, while it would take 8 LUTs per bit in an older device)
In order to take advantage of the FPGA’s integrated memory controller, the board includes a single 8-bit DDR2 memory device. I had really wanted to use a x16 part, for greater bandwidth – but routing it on a low-spec 4-layer board would have been extremely difficult. x8 was tricky enough.
The 4-layer stack-up used for this board wasn’t conducive to 50-ohm routing, but 75-ohm traces were nearly ideal (0.006″ minimum-width traces work out almost exactly to 75 ohms). The DDR2 spec allows for 75-ohm traces, and both the Spartan-6 FPGA and the DDR2 part support integrated 75-ohm terminations. Even the differential clock was routed with a 150-ohm pair of traces; this isn’t strictly supported, but the 150-ohm termination resistor placed by the DDR2 part should allow it to work just fine.
The Spartan-6 memory controller user guide is quite specific about trace-length matching. All of the signals between the FPGA and the DDR2 part are matched to within 3mm of each other – hence all of the serpentine traces. It was a tedious process, since EAGLE has no concept of length-matching constraints (it does, at least, bundle a script for measuring/comparing lengths of multiple traces-of-interest).
The memory controller block is really meant to be used with a much higher layer-count PCB. A non-trivial amount of PCB real-estate was used to re-arrange the signals coming from the FPGA so they matched up with the memory device’s pins. Most of the slower address and control signals are transitioned to the bottom layer, where they are re-positioned. They’re brought back up to the top layer before connecting to the DDR2 part.
The faster data signals are routed completely on the top layer (no vias). A continuous ground plane is maintained under all memory signals on both the top and bottom layers, and lots of vias are used to bridge the upper and lower ground planes in the vicinity of signal layer changes (the power plane has a ground island in it, for this purpose).
There wasn’t quite enough room around the FPGA itself to bring out the complete x16 data interface (at least, not without substantially compromising the ground and power planes). Even with just a x8 interface, compromises had to be made; I wasn’t able to connect the DM (data mask) signal to the DDR2 part, so it won’t be possible to write individual bytes to memory (the smallest writable set of data is a 4-byte word, since DDR2 typically operates with burst lengths of 4).
The memory controller monopolizes the vast majority of I/O Bank 3 on the FPGA.
High-speed LVDS is my preferred method of interconnecting high-bandwidth logic. With my FMC-LPC to SATA adapter board, I standardized on using SATA connectors and cables to to provide a cost-effective way of transporting LVDS signals. Thus, SATA connectors are used here.
There are 9 SATA connectors in total, comprising 18 differential pairs (36 signals). 4 of the connectors are connected to Bank 2 of the FPGA, while the remaining 5 are connected to Bank 0 of the FPGA. These two banks are used almost exclusively for LVDS (with the exception of one lone configuration signal).
Since SATA cables are nominally 50-ohm (100-ohm differential) mechanisms, I didn’t have the luxury of using 6-mil 75/150-ohm signals on the PCB. Each LVDS pair is routed as a wider length-matched (more serpentining!) 100-ohm differential pair.
Being the potentially highest-speed signals in the design, all of the LVDS pairs are routed without layer changes (no vias, again), and they’re all referenced to an unbroken ground plane. Avoiding the use of vias keeps the other layers cleaner (especially the ground and power planes), but has the downside of making it impossible to escape-route more than about 2 rows of BGA pins. Thus, many FPGA I/O pins are left unconnected.
In addition to the high-speed LVDS expansion, the board also features 16-bits worth of [relatively] low-speed 3.3V I/O. I’m using a Gadget Factory “Wing” style expansion connector, as seen on the popular Openbench Logic Sniffer and Gadget Factory’s own Butterfly/Papilio One board.
The Wing connector exports 16 I/O lines and 3 different power rails (5V, 3.3V, and 2.5V) on an easy-to-use 0.1″ header. I’ll be using this style connector on any of my future boards that don’t have high speed requirements (as I’ve already done on my NES cartridge adapter board).
Most Wing peripherals tend to be 3.3V devices, so these low-speed lines connect to their own dedicated 3.3V I/O bank on the FPGA (Bank 1). The 45nm Spartan-6s are, however, a bit picky when it comes to 3.3V signaling (their 40nm Virtex-6 siblings don’t even support 3.3V signaling). An IDT Quickswitch is used to prevent minor voltage spikes from making it to the FPGA (and also allows for 5V peripherals to be connected directly to the board without additional translation). By supplying 4.3V to the Quickswitch, overshoots are limited to 3.3V (as outlined in IDT’s 5V/3V interfacing application note).
In addition to the FPGA, there’s also a USB-enabled Atmel AVR microcontroller (an ATmega32U2). Its primary responsibility is configuring the FPGA.
The AVR’s SPI port is connected to the FPGA’s JTAG port, giving it complete control over the configuration of the FPGA. For post-configuration communication, 5 I/Os from the FPGA are connected to the AVR’s USART (which can be used as a UART or an SPI master). The FPGA’s JTAG port could also be used for communication by instantiating a special boundary-scan primitive in the design.
Also connected to the AVR’s USART is a 2MB SPI flash device. I had originally planned to use a micro-SD card for non-volatile storage, but the physical size of the holder was prohibitive. The flash device’s main purpose is to store configuration bitstreams for the FPGA. The AVR will read the configuration from the flash and write it to the FPGA at power-up.
While the FPGA is actually capable of natively loading its configuration from a variety of SPI flashes, it requires the connection of several additional signals to the FPGA. I didn’t have any room left to route them without sacrificing some of the board’s expansion capabilities. Having the AVR perform configuration is a small price to pay for improved expandability.
Most of the FPGA’s configuration signals are on a 2.5V domain, so level translation is required for interfacing with the 3.3V AVR. The speed-requirements aren’t especially high, as the AVR is limited to running at 8 MHz. Thus, one of TI’s bidirectional direction-sensing translators was selected (a TXS0108E). Due to the presence of pull-ups on some of the JTAG lines, I had to be especially careful in my translator selection.
Many voltage-level translators (like the similar TXB0108) rely on internal weak-keeper circuits to hold the logic level on either side of the device; if there are any pull-ups/downs on the line, they can override the keepers and cause the translator to incorrectly believe that a device is trying to actively drive the bus (when, in fact, the bus is still being actively driven from the other side). This rarely ends well.
Both the AVR and the FPGA have programming/JTAG interfaces that need to be externally accessible for development purposes. The AVR has a proprietary serial programming interface (accessed by the same pins that comprise its SPI port), while the FPGA has a true JTAG interface. They both use very similar signals (a reset/mode line, a clock, a transmit line, and a receive line), and can both be connected to with a no-frills parallel-port cable.
One interesting consequence of having the AVR’s SPI/programming port connected to the FPGA’s JTAG port is that they can pretty easily share a single JTAG header on the board, which is what I’ve done here. A jumper is provided that disables the level-translator for the FPGA’s JTAG port (and simultaneously enables a driver that hooks the JTAG header’s TMS line to the AVR’s reset line), thus allowing the AVR to be initially programmed. A smaller header is provided to give access to the AVR’s debugWire port (which can only be used after setting some of the AVR’s fuse bits via the conventional programming port).
To connect to the FPGA via JTAG, the aforementioned jumper is removed. This connects the JTAG header to the FPGA via the 3.3V to 2.5V translator chip. The AVR can optionally be held in reset, to prevent contention.
Once suitable AVR firmware is developed, it should be possible to conduct most development work via the AVR’s USB port (as the AVR can both reprogram itself, and configure the FPGA). The only need for the development headers would then be in-circuit-debug (e.g. using GDB with the AVR, or ChipScope Pro with the FPGA).
This board has no less than 5 different voltage levels on it:
- 5V – Wing expansion connector and Quickswitch translator
- 3.3V – AVR, flash memory, main oscillator, and Bank 1 of the FPGA (for the Wing I/O)
- 2.5V – FPGA auxiliary logic, and I/O banks 0 and 2 (for LVDS expansion)
- 1.8V – DDR2 memory and Bank 3 of the FPGA (connected to the DDR2 part)
- 1.2V – FPGA core logic
The 5V rail is easy, as it’s supplied externally. Power can either be sourced from the USB port, or form a separate 2-pin power header. A jumper makes the selection.
The 3.3V rail powers relatively slow and low-power logic, so it’s furnished by a simple SOT-223 packaged linear regulator (an MCP1826S).
The 2.5V and 1.2V rails are expected to be the most power-consumptive in typical applications; the 2.5V rail powers all of the high-speed off-board I/O, while the 1.2V rail powers all of the FPGA’s core logic. Thus, both of these rails are supplied by switching regulators. I didn’t want to spend a lot of time designing and laying out a couple of SMPS units (I’ve been there before), so I turned to National Semiconductor’s LMZ10503 – a member of their “simple switcher” series. While expensive, the device integrates all of the major components for an SMPS (including the controller, power switches – and even the inductor). I merely had to supply decoupling capacitors and a feedback network.
The 1.8V rail is only responsible for powering the DDR2 memory device and interface. For this, I’m using another MCP1826S linear regulator to drop the 2.5V rail down a little to 1.8V. While the power consumption of a fully active 800 Mbps DDR memory interface isn’t something to be idly ignored, there shouldn’t be too much of an overall efficiency loss from using a linear here. If I were designing a cell-phone, where efficiency is critical for battery-life, I would reconsider the use of an SMPS for this rail.
The AVR has control over the enable lines for the 1.2V and 2.5V switchers (and, thus, also controls the 1.8V rail). The 3.3V rail can’t be turned off, since it powers the AVR; nor can the 5V rail, since it powers everything. In order to allow the AVR to completely shut off power to the FPGA and expansion connectors, two small MOSFETs are used to provide switched versions of the 3.3V and 5V rails.
Many of the power rails are routed on the dedicated power plane (layer 3; adjacent to the bottom copper layer). The 1.8V, 2.5V and 3.3V (un-switched and switched) rails are all on the power plane. The power plane also hosts a small ground island for the DDR2 memory signals.
The 1.2V rail is routed exclusively on the bottom layer. It consists of a copper pour that extends under the FPGA, where vias connect it directly to the FPGA.
Quite a lot of attention went into making sure the power and ground planes were as contiguous as possible – especially in the vicinity of the DDR signals. The large vias required by low-cost 4-layer services tend to cause planes to be substantially broken when used in tight groupings (as one does under a BGA).
The EAGLE libraries I used to make this board are available in my eagle-lbr Mercurial repository (or directly download a ZIP file).
The EAGLE schematic and PCB layout for the Spartan-6 BGA test board is in my eagle_spartan6test repository (direct ZIP download).
For those without EAGLE, you can download PDF versions of the schematic and layout.
Again: I’ve only just sent this board out for fabrication. I should be receiving the completed boards in a week or so, but it may be a while before I actually get around to assembling them. Rest assured, however, that I’ll eventually be making a follow-up post about the assembly of the board – no matter how gruesome those BGA packages turn out to be…
I’m really enjoying your blog after finding it the other day after Googling to see if it was possible to toast an FMC connector.
I was planning to design something similar to this board as there are few inexpensive Spartan 6 development boards with high speed differential connectors available at the moment. I was holding out for TQFP Spartan 6 variants to become available, too, so I’m excited to see how you go.
I settled on the Digilent ATYLS, which is considerably more expensive than the now-discontinued Avnet $49 Spartan 3A board I started with. It has a rather more obscure and expensive VHDCI connector and a bunch of peripherals I don’t use (HDMI, AC’97, Gigabit Ethernet that I would totally use if I could convince the manufacturer to give me the data sheet). Was the cost to move to 6 layers through someone like PCBCart prohibitive? I’ll be interested to hear what the total cost per board ends up being in small quantities.
My target application uses a 4-channel ADC with ten LVDS outputs and one LVDS clock input so I’d probably need to stick to a mezzanine or VHDCI-esque connector rather than trying to equalise a bunch of SATA cables.
Looking forward to hearing how it all comes together!
Thanks Joel – glad you’re finding it useful!
Yeah, it’s a shame that the selection of TQFP FPGAs is so poor these days – Altera’s Cyclone III devices actually go a lot further than Xilinx’s offerings, density wise (you can get up to 40K LEs/LUTs in a 240-pin PQFP with the EP3C40), but their I/O capabilities are pretty limited (and Altera always seems to have a lot more restrictive/arbitrary I/O banking rules – e.g., you can only do LVDS transmission on the left and right I/O banks). I’ve been tempted to make designs using one of those before.. but, now, the Spartan 6s are just too amazing to pass up.
I didn’t spend a lot of time investigating alternative PCB fabs for this particular board; Laen’s 4 layer PCB service is hard to beat for small one-off designs like this (it was just $65 total for 3 copies of this board). The specs are pretty restrictive (very hard to fully utilize BGAs with), but for what I wanted to achieve with this board (largely an assembly test, with some high-speed routing thrown in), it was adequate. For a more serious final design, I’d certainly be looking at higher-spec 6-layer services (it’d be really nice to have dense vias on 1.0mm centers without them obliterating the power/ground planes!).
Yeah; I wouldn’t want to use SATA cables for this either (plus, I’d imagine that an ADC doesn’t need to be remotely mounted, as I had wanted with my camera boards). Really, I would have preferred something with a few more signals per cable – but there aren’t a lot of great options. Most wide high-speed cables tend to be really bulky or really expensive (or both).
As am I! I’ve received the PCBs now (need to take some photos of them), but I don’t know when I’ll find the time to assemble them..
Ooh, how about Mini-SAS/Molex iPASS? They’re 36-way and Digikey sell the connectors for $5.34, and a one metre cable for $19.08. There are a few other interesting options if you search for Molex iPASS.
Another remarkable post!!
Looking forward to hearing more about it
How much time did you use to design the board, including specification reading and all?
It’s hard to say exactly how long it took to design; it was designed in my spare time over the course of a couple months. Most of the tricky high-speed layout work (FPGA, SATA/LVDS, DDR) was done over the course of about a week, while the remainder was completed more sporadically. So, maybe 2-3 weeks of evenings/weekends in total.
Very cool – did you do any SI simulation on this? What tools did you use?
Sadly, I’m not aware of any free/cheap PCB SI simulators, so I wasn’t able to simulate it. We use those sorts of tools at my place of employment, but I prefer to keep my personal projects entirely in the free/open-source (or, failing that, relatively-affordable) realm, if at all possible.
Lacking such tools, I’ve tried to adhere to Xilinx’s recommended layout practices as much as possible (within reason; many concessions had to be made in using a cheap 4-layer board). Once I get it assembled, I’ll see how much performance I can actually get out of this layout.
“1050 Mbps SerDes blocks with variable input delays: Spartan-3s, again, had to use soft-logic for SerDes – and their LVDS I/Os were limited to lower speeds. Without run-time adjustable input delays, per-bit deskew was essentially impossible for Spartan-3s. Spartan-6s fix all of that, making 1+ Gbps source-synchronous interfaces practical.”
Could you please provide little more information about it ?
Are you using serdes 8:1 or less ? what type of encoding 6b8b is implemented?
You organized stream or frame data transfer ?
Can you show source code of these blocks ?
For my speed testing (with my FMC-LPC to SATA adapter board on an SP605), I kept things as simple as possible: no encoding, no frames/packets/etc. I used the SerDes blocks in an 8:1 mode.
The Spartan-6’s Phase Detector and IODELAY blocks were used to align the center of the data eye with the sampling clock’s edges.
I originally implemented a source-synchronous clocking scheme (a clock was forwarded with the data), but ran into some regional clock limitations. Rather than adjust the pins I was using, I switched to a system-synchronous clocking scheme (no forwarded clock; data sampled using internally generated clock with same frequency but unknown phase relationship).
Since the Spartan-6 IODELAY blocks require that the clock and data be reasonably well aligned (within ~half of a cycle), I had to apply some phase shifting to the receive clock to get the deserializers to lock. This was manually adjusted for testing, but it could be incorporated into the training sequence (or eliminated if using a source-synchronous clock).
At power-up, a training sequence was performance that used bit-slip to align the received data to an 8-bit boundary. The serializers transmit a known pattern, and the deserializers adjust bit-slip until they see the same known pattern. This requires some form of side-band signal for the deserializers to tell the serializers when they are done training.
After training, the serial link was tested using LFSRs to generate pseudo-random sequences (one on the transmit side, and one on the receive side to verify the received sequence).
I’m not prepared to release the code for this just yet. That’ll probably have to wait until I clean it up and put together a full writeup on the process.
Everything you need to know to implement something like this can be found in Xilinx documentation – e.g. the Spartan-6 FPGA SelectIO Resources User Guide (UG381), the Clocking Resources User Guide (UG382), and the Source-Synchronous Serialization and Deserialization application note (XAPP1064).
CoreGen has some support for generating various SerDes macros, as well. The resultant macros are just Verilog, so you can see how Xilinx implements these things.
How do you plan on soldering that the FPGA? Are you using stencils even for the bga?
My plan is to rely on the solder balls already present on the BGAs (no additional solder paste). Darrell Harmon has demonstrated this method on his DSPCARD, so I’m reasonably confident that it’ll work here too. Lead-free parts may be an issue (Darrell used leaded parts).
I’ll be using stencils and solder paste for the rest of the parts on the board.
I would recommend using paste for everything. It seems to work better for me with the lead free parts. I have been using stencils from here: http://www.ohararp.com/Stencils.html
I have also been using the PCB service from Laen. I’m impressed that you were able to route that memory interface. I had quite a challenge routing 4 MGT channels and 4 8 bit SERDES interfaces on an XC6SLX45T.
very nice work!
yet, you gave functionality up to size that is too bad. 16bit ddr, sd card init, etc.
Great job, would be interesting to see your progress so far. I have a question about the microstrip impedances. As you said 6 mil on your board works out to about 75 ohm however the SATA lines in your design don’t compute. I end up with an impedance of 61 ohm (not 50 as you state) using the formulas from IPC2141a as well as IPC2251. I’m using the following values: T=1.3mil, H=7.8mil, W=9.449mil, Er=4.2. Can you please explain how you calculated the 50 ohm impedance?
You may well be correct, Alex! I had originally used a quick online calculator to arrive at that width/spacing (though I don’t recall exactly which one). Using a different one now, I arrive at numbers that agree with yours (..the dangers of being lazy and not double-checking calculations!).
Since the PCB isn’t controlled-dielectric (much less controlled-impedance), these calculations are only approximate, anyway (perhaps I should have qualified those figures with a few more “approximately”s). For a production run, I’d work with the PCB house to get width/spacing figures that they know work well for their process. But for a one-off test board like this, I’m not too concerned about minor mismatches (~113 ohms instead of 100)..
I still haven’t assembled the board yet; the Verilog implementation/verification bits of my stereo vision module have been taking most of my focus lately, leaving little time for tinkering with real hardware..
Is there a reason why you havent used terminations on the Address lines of the DDR2. Like a few others, I am curious on board performance reports. Does it work as you expected ? thanks
If you haven’t already written the AVR/JTAG firmware, you might be interested in NeroJTAG (http://www.makestuff.eu/wordpress/?page_id=1046), a no-frills cross-platform multi-device open-source JTAG-over-USB protocol. It has an AVR implementation and should work OK on your ATmega32U2 (you might need to change it to use the correct ports you have chosen for the JTAG lines).
For host-side code to load the FPGA design I suggest FPGALink (http://www.makestuff.eu/wordpress/?page_id=1400). Although it is primarily designed for use with the Cypress FX2LP, it does work (albeit with limited functionality) with the AVR NeroJTAG firmware. Everything is open source, and Windows, Linux and MacOS binaries are available for download.
Very nice job, have you built the board yet? I’m considering desigin/building a Spartan 6 board using Eagle and a XC6SLX75T-FGG484 to play with the MGT’s. But since finding your blog, I may wait to see your outcome. Not sure if I can do it in 4 layers…….
I’ve done a 4-layer spartan6 design on Laen’s service as well and had no problem soldering it. I would not want to go bigger than FTG256 on 4 layers though, I plan to use an FGG484 packaged device in my next board and that will have to be six layers and smaller vias. Probably 5/5 mil design rules.
It is possible with Laen’s service if you don’t need all the IO. The transceiver area is tricky, but I got all 4 routed out.
Hey Dan, I just stumbled on your site. Some time has passed, but no report on the board : did you ever get it assembled ?
I’m tempted to try a board with a Spartan 6, SDRAM, and PCIe (a slimmed-down SP605) and would love to hear more about how your test board worked out.
very interesting, is your board assembled and running? is there more to come?
How did this turn out? Curious about the DD2 interface turned out, as I’m about to attempt the same thing on a four layer board, at home. Thanks!
This looks amazing! I’ve been wanting to use JTAG to program a spartan 6 (I’ve been buying modules from opal kelly rather than doing the whole thing myself), so would love to put together a simple atmega32 and flash board to program the FPGA. Did you manage to find the code for the ATmega32 to do this, or where you planning on writing yourself? I’m sure I can put this together myself, but if there’s code to do that already available it would be great to find!
Hope the rest of the bring up is going well
Hi, thanks for the design tips, I’m trying to design a spartan 6 board with 4-lane PCIe and ddr controller but bga makes things slower for me. Is your card running ok?
With the “new” Artix-7 FPGAs, is there any specific reason why you stick with the Spartan-6 for this project?
I am currently looking into using the Artix-7 for a vision project and the extended DSP and logic resources will come handy. :)
I see that you ordered the PCBs from Laen as well, I did impedance measurements on a board from him and it held +/- 11%.
This is not super high speed, but could be worth thinking about. I use PCBCart for my controlled jmpedance boards now a days.
Please come with a report on how well you memory interface is working. :)