Abstract
We present a mechanism-verification study of unsupervised Spiking Neural Network (SNN) learning on a $99 Xilinx Zynq-7020 FPGA. Over 12 RTL iterations in 7 days, we scaled from 8 neurons to 200 LIF neurons processing full 28ร28 MNIST digits โ achieving 62% classification accuracy with pure on-chip STDP, without any labels, backpropagation, or GPU assistance. The goal was not to maximize accuracy but to prove that STDP self-organization works on real hardware and to expose the failure modes that software simulators hide. We discovered and fixed 11 hardware bugs โ including a critical finding that the Diehl & Cook (2015) algorithm fails catastrophically when ported to fixed-point RTL. The project was paused at 62% once the core mechanism was validated, with a clear path to 90%+ through neuron scaling (the BRAM budget is only 28% utilized). This work establishes the verified RTL foundation for the VGO NoyronLink ASIC.
1. Introduction
The neuromorphic computing community has produced thousands of SNN papers โ nearly all evaluated in software simulators. Brian2, NEST, BindsNET, and snnTorch run on GPUs with float32 precision, infinite memory, and forgiving timing. The gap between "SNN paper accuracy" and "SNN on real silicon" remains vast and largely unexplored.
Commercial neuromorphic chips like Intel's Loihi 2 ($10,000+) and IBM's TrueNorth remain inaccessible to independent researchers. We asked a simple question: can a single person, with a $99 FPGA board, verify that unsupervised SNN learning actually works on real hardware?
This work serves a dual purpose. First, it validates the fundamental mechanisms โ LIF dynamics, Winner-Take-All competition, and STDP synaptic plasticity โ in synthesized RTL on real FPGA fabric. Second, it establishes the verified design foundation for the VGO NoyronLink ASIC: a custom 28nm neuromorphic chip targeting millions of neurons at ~28 pJ/spike. Every RTL module verified here becomes a building block of that silicon roadmap.
1.1 Related Work and Positioning
On-chip STDP learning on FPGAs is an active research area. Ali et al. (2021) achieved 93% MNIST accuracy with 384 neurons using reward-modulated STDP on a custom 28nm ASIC โ not a commodity FPGA, and with a reward signal guiding learning. The HEENS architecture (2024, Frontiers in Neuroscience) demonstrated SNN emulation with synaptic plasticity on Zynq FPGAs, reporting "comparable performance" without specifying accuracy. NeuroCoreX (Oak Ridge National Laboratory, 2025) provides an open-source FPGA SNN emulator with on-chip STDP on Artix-7, but focuses on the framework rather than benchmark results. Fan & Levy (2024) built an open-source SNN framework for low-end FPGAs achieving ~90% MNIST accuracy.
Our work differs in two fundamental ways. First, we report not just successes but complete failure documentation: 11 hardware bugs, including the discovery that Diehl & Cook's (2015) Weight-Dependent STDP fails catastrophically on fixed-point hardware (v9.0, -35% accuracy regression). Academic papers rarely publish negative results in hardware SNN; we document every failure. Second, this is explicitly a mechanism verification, not an accuracy competition. The project was paused at 62% once we confirmed that STDP self-organization, WTA competition, and NCA self-repair all function correctly on silicon โ the BRAM budget sits at only 28%, leaving clear headroom to scale to 400+ neurons and 90%+ accuracy in future work.
Our benchmark follows Diehl & Cook (2015), who achieved 82.9% MNIST accuracy with 100 neurons and 95% with 6,400 neurons โ in software. With our hardware-constrained 200 neurons and 8-bit fixed-point weights, we target the same algorithmic baseline and report what actually survives the translation to silicon.
2. System Architecture
2.1 Platform
The Xilinx Zynq-7020 (XC7Z020) combines a dual-core ARM Cortex-A9 (PS) with Artix-7 programmable logic (PL) containing 85K logic cells, 140 BRAM blocks (4.9 Mbit), and 220 DSP slices. The ARM PS runs PetaLinux and controls the SNN core via AXI4-Lite registers. The SNN core operates at 50 MHz on the PL fabric.
2.2 SNN Core RTL
The core RTL (vgo_snn_chip_v8.sv, 1,322 lines SystemVerilog) implements a complete SNN processing pipeline as a 20-state FSM:
- PH_INPUT: ARM writes 784 pixels via serial FIFO (25 AXI words)
- PH_INTEGRATE: Each neuron accumulates weighted inputs into membrane potential
- PH_FIRE: Neurons exceeding threshold emit spikes, enter refractory period
- PH_WTA: K=2 Winner-Take-All selects top-2 active neurons per timestep
- PH_LEARN: STDP updates winner synapses (LTP for active inputs, LTD for silent)
- PH_HOMEO: Homeostatic regulation adjusts firing thresholds to prevent monopolization
2.3 4-Bank True Dual-Port BRAM
The most critical design decision โ and the source of our most subtle bug โ is the weight storage architecture. 200 neurons ร 784 inputs = 156,800 synaptic weights, each 8-bit. We partition this across 4 BRAM banks (byte lanes 0-3), where each bank is a Xilinx-compliant True Dual-Port BRAM:
wbank0[0:39199] โ inputs 0, 4, 8, 12, ... (byte lane 0) wbank1[0:39199] โ inputs 1, 5, 9, 13, ... (byte lane 1) wbank2[0:39199] โ inputs 2, 6, 10, 14, ... (byte lane 2) wbank3[0:39199] โ inputs 3, 7, 11, 15, ... (byte lane 3) Port A: FSM read/write (STDP learning) Port B: AXI read/write (ARM inspection / PSD loading)
3. Evolution: 12 Versions, 11 Bugs, 7 Days
The following timeline documents every RTL version, with precise timestamps from our commit log. The entire journey from hardware bring-up to 62% accuracy took 7 days.
4. Hardware Bugs: Lessons from Silicon
The 11 bugs we encountered are not random defects โ they reveal systematic patterns in SNN hardware design that no simulator can expose. We highlight the three most instructive:
Bug #5 โ Vivado BRAM Inference Failure
The FSM writes 1-byte weights via Port A; the AXI bus reads 4-byte words via Port B. Vivado could not infer a true dual-port BRAM from this asymmetric access pattern โ it silently created two independent memory copies. STDP learning wrote to copy A; ARM readback read from copy B. All training appeared to produce zero effect. Fix: 4-Bank BRAM architecture with symmetric per-bank access. This same bug recurred as Bug #11 (v9.0) when named blocks in the LEARN phase broke BRAM inference again.
Bug #7 โ Byte Order Inversion (learn_cfg)
A single endianness mistake: writing 0x45330000 instead of 0x00004533 to the LEARN_CFG register. The RTL maps bits[15:0] to {scale, tax, anti_hebb, div}. The wrong byte order set all competition parameters to zero โ effectively disabling all learning. 12/200 neurons monopolized all responses. 0% accuracy. Fix: Corrected byte order. One line change.
Bug #9 โ Sparse Input LTP/LTD Imbalance
With 784 inputs and ~150 active per image, the original LTP=8/LTD=3 created a fatal asymmetry: 150ร8=1,200 total potentiation vs. 634ร3=1,902 total depression. LTD > LTP โ all weights collapsed to w_min. This is a fundamental property of sparse binary inputs that doesn't manifest in float32 simulators. Fix: LTP=15, LTD=2 โ 2,250 > 1,268. Per Diehl & Cook (2015): LTP >> LTD for sparse codes.
5. Results
| Version | Neurons | Inputs | Storage | Accuracy | Key Achievement |
|---|---|---|---|---|---|
| v4.3 | 8 | 64 | LUT-RAM | โ | 6/8 neurons active, 2 patterns learned |
| v5.6 | 10 | 64 | LUT-RAM | 28.6% | First MNIST accuracy measurement |
| v6.2 | 100 | 64 | 4-Bank BRAM | 21.8% | 100-neuron STDP differentiation |
| v7.2 | 200 | 784 | BRAM | 23.6% | Full MNIST, "Resurrection" recovery |
| v8.2 | 200 | 784 | BRAM | 62% | Stable. Specialized neurons. No oscillation. |
| v9.2 | 200 | 784 | BRAM | 26.7% | WD-STDP failed: N85 monopoly |
| PSD-A (GPU) | 200 | 784 | โ | 81.2% | Gradient ceiling (same architecture, Int8) |
The STDP result achieves 76.4% of the gradient-trained ceiling โ remarkable given that STDP uses zero labels, zero backpropagation, and 8-bit fixed-point arithmetic hardwired in FPGA fabric. Per-digit analysis reveals specialization: digit "1" (vertical strokes) reaches 43% accuracy while digit "0" (round shapes) remains challenging at 7% โ consistent with how local STDP learns edge-like features more easily than distributed circular patterns.
6. Key Engineering Findings
1. BRAM inference is fragile. Vivado's BRAM inference silently fails when always blocks contain named begin/end blocks with local reg declarations, or when Port A and Port B have asymmetric data widths. This creates invisible duplicate memories โ the most dangerous class of hardware bug because all functional tests pass.
2. Sparse inputs require LTP >> LTD. With 784 binary inputs and ~19% activation density, the inactive inputs applying LTD outnumber active inputs applying LTP by 4:1. Any balanced LTP/LTD ratio leads to weight collapse. This is a mathematical inevitability that software simulators mask with float32 precision.
3. Homeostasis is non-optional. Without adaptive threshold adjustment, Winner-Take-All degenerates into "Winner-Take-Everything" โ a single neuron monopolizes all responses (v9.2: neuron #85 won 30% of all test images). The homeostasis target must scale with neuron count.
4. The gap between paper and silicon is real. We implemented the same Diehl & Cook 2015 algorithm in v9.0 โ and it failed catastrophically on hardware. Three regression bugs, 8.5 hours of wasted training. The lesson: every RTL change must pass a STDP gate test before training begins.
7. From FPGA to VGO NoyronLink ASIC
This FPGA verification serves as the precursor to the VGO NoyronLink ASIC โ a custom neuromorphic chip designed to scale from hundreds to millions of neurons. Every RTL module verified on the Zynq-7020 becomes a proven building block for the ASIC tape-out.
| Stage | Platform | Neurons | Energy/Spike | Status |
|---|---|---|---|---|
| iCE40 Proof | Lattice iCE40 HX8K | 8 | ~28 pJ | โ Bitstream generated (132 KB) |
| FPGA Full | Zynq-7020 | 200 | ~50 pJ | โ 62% MNIST (this work) |
| FPGA Scale | Zynq-7020 | 400 | ~50 pJ | ๐ฎ v11.0 (49% BRAM, target 90%+) |
| NoyronLink ASIC | 28nm Custom | 1M+ | <10 pJ | ๐ฎ Design phase (RTL verified) |
The ASIC design leverages VGO's proprietary AER (Address-Event Representation) spike bus, achieving 429ร bandwidth compression compared to dense matrix operations. At 28nm, the NoyronLink SoC targets under 10 pJ/spike and millions of neurons โ making it a potential replacement for GPU-based LLM inference at a fraction of the power budget.
The critical insight: you don't need to simulate a brain to build one. By verifying each mechanism on real hardware โ LIF dynamics, STDP plasticity, WTA competition, NCA self-repair โ we build confidence that the ASIC will work on first silicon. The FPGA is our risk-reduction engine.
8. Conclusion
This work asked a narrow question โ does STDP self-organization actually work on real, $99 FPGA hardware? โ and answered it affirmatively. The 62% MNIST accuracy with pure on-chip STDP, while below state-of-the-art software benchmarks, confirmed that LIF neurons self-specialize, WTA competition prevents collapse, and NCA self-repair maintains population health โ all on 8-bit fixed-point silicon with zero labels and zero GPU involvement. Equally important, we documented what doesn't work: the Diehl & Cook (2015) WD-STDP algorithm fails catastrophically when ported to fixed-point RTL, producing a -35% accuracy regression (v9.0). This negative result, rarely reported in the literature, is itself a contribution.
The 11 hardware bugs exposed during this verification โ particularly the Vivado BRAM inference failure and the sparse-input LTP/LTD imbalance โ represent engineering knowledge that no software simulator can provide. These are not theoretical edge cases; they are systemic traps that any team building neuromorphic hardware will encounter.
8.1 Limitations and Future Work
We deliberately paused this project at 62% once the core mechanism was validated, shifting resources to the NoyronLink ASIC design phase. The current design uses only 28% of available BRAM, leaving clear headroom to scale from 200 to 400+ neurons โ which, following the Diehl & Cook (2015) scaling curve, should approach 90%+ accuracy with pure STDP. We also did not implement reward-modulated STDP (R-STDP), which has demonstrated 93% accuracy on custom ASICs (Ali et al., 2021). Real-time power measurement was not performed; energy figures are estimates from Vivado reports. These are engineering decisions, not fundamental limitations.
The verified RTL modules โ LIF neurons, 4-Bank BRAM, WTA competition, STDP learning, NCA self-repair โ form the proven core of the VGO NoyronLink ASIC. From a $99 development board to a custom neuromorphic chip: the path from software to silicon is validated, and the bugs that would have cost months on an ASIC were found for the price of a development board.
Explore the VGO Ecosystem
VGO SNN on FPGA is one pillar of the OPEN-Yongnian project โ an initiative spanning biological computing, neuromorphic hardware, and AI cognitive architectures.
More Research โ