VGO SNN on FPGA: From 0% to 62% MNIST Accuracy with Pure STDP on a $99 Zynq-7020

Platform

Zynq-7020

Board Cost

~$99

Neurons

200 LIF

Inputs

784

Synapses

156,800

BRAM

28%

Best STDP

62%

RTL Versions

Abstract

We present a mechanism-verification study of unsupervised Spiking Neural Network (SNN) learning on a $99 Xilinx Zynq-7020 FPGA. Over 12 RTL iterations in 7 days, we scaled from 8 neurons to 200 LIF neurons processing full 28×28 MNIST digits — achieving 62% classification accuracy with pure on-chip STDP, without any labels, backpropagation, or GPU assistance. The goal was not to maximize accuracy but to prove that STDP self-organization works on real hardware and to expose the failure modes that software simulators hide. We discovered and fixed 11 hardware bugs — including a critical finding that the Diehl & Cook (2015) algorithm fails catastrophically when ported to fixed-point RTL. The project was paused at 62% once the core mechanism was validated, with a clear path to 90%+ through neuron scaling (the BRAM budget is only 28% utilized). This work establishes the verified RTL foundation for the VGO NoyronLink ASIC.

1. Introduction

The neuromorphic computing community has produced thousands of SNN papers — nearly all evaluated in software simulators. Brian2, NEST, BindsNET, and snnTorch run on GPUs with float32 precision, infinite memory, and forgiving timing. The gap between "SNN paper accuracy" and "SNN on real silicon" remains vast and largely unexplored.

Commercial neuromorphic chips like Intel's Loihi 2 ($10,000+) and IBM's TrueNorth remain inaccessible to independent researchers. We asked a simple question: can a single person, with a $99 FPGA board, verify that unsupervised SNN learning actually works on real hardware?

This work serves a dual purpose. First, it validates the fundamental mechanisms — LIF dynamics, Winner-Take-All competition, and STDP synaptic plasticity — in synthesized RTL on real FPGA fabric. Second, it establishes the verified design foundation for the VGO NoyronLink ASIC: a custom 28nm neuromorphic chip targeting millions of neurons at ~28 pJ/spike. Every RTL module verified here becomes a building block of that silicon roadmap.

1.1 Related Work and Positioning

On-chip STDP learning on FPGAs is an active research area. Ali et al. (2021) achieved 93% MNIST accuracy with 384 neurons using reward-modulated STDP on a custom 28nm ASIC — not a commodity FPGA, and with a reward signal guiding learning. The HEENS architecture (2024, Frontiers in Neuroscience) demonstrated SNN emulation with synaptic plasticity on Zynq FPGAs, reporting "comparable performance" without specifying accuracy. NeuroCoreX (Oak Ridge National Laboratory, 2025) provides an open-source FPGA SNN emulator with on-chip STDP on Artix-7, but focuses on the framework rather than benchmark results. Fan & Levy (2024) built an open-source SNN framework for low-end FPGAs achieving ~90% MNIST accuracy.

Our work differs in two fundamental ways. First, we report not just successes but complete failure documentation: 11 hardware bugs, including the discovery that Diehl & Cook's (2015) Weight-Dependent STDP fails catastrophically on fixed-point hardware (v9.0, -35% accuracy regression). Academic papers rarely publish negative results in hardware SNN; we document every failure. Second, this is explicitly a mechanism verification, not an accuracy competition. The project was paused at 62% once we confirmed that STDP self-organization, WTA competition, and NCA self-repair all function correctly on silicon — the BRAM budget sits at only 28%, leaving clear headroom to scale to 400+ neurons and 90%+ accuracy in future work.

Our benchmark follows Diehl & Cook (2015), who achieved 82.9% MNIST accuracy with 100 neurons and 95% with 6,400 neurons — in software. With our hardware-constrained 200 neurons and 8-bit fixed-point weights, we target the same algorithmic baseline and report what actually survives the translation to silicon.

2. System Architecture

2.1 Platform

The Xilinx Zynq-7020 (XC7Z020) combines a dual-core ARM Cortex-A9 (PS) with Artix-7 programmable logic (PL) containing 85K logic cells, 140 BRAM blocks (4.9 Mbit), and 220 DSP slices. The ARM PS runs PetaLinux and controls the SNN core via AXI4-Lite registers. The SNN core operates at 50 MHz on the PL fabric.

2.2 SNN Core RTL

The core RTL (vgo_snn_chip_v8.sv, 1,322 lines SystemVerilog) implements a complete SNN processing pipeline as a 20-state FSM:

PH_INPUT: ARM writes 784 pixels via serial FIFO (25 AXI words)
PH_INTEGRATE: Each neuron accumulates weighted inputs into membrane potential
PH_FIRE: Neurons exceeding threshold emit spikes, enter refractory period
PH_WTA: K=2 Winner-Take-All selects top-2 active neurons per timestep
PH_LEARN: STDP updates winner synapses (LTP for active inputs, LTD for silent)
PH_HOMEO: Homeostatic regulation adjusts firing thresholds to prevent monopolization

2.3 4-Bank True Dual-Port BRAM

The most critical design decision — and the source of our most subtle bug — is the weight storage architecture. 200 neurons × 784 inputs = 156,800 synaptic weights, each 8-bit. We partition this across 4 BRAM banks (byte lanes 0-3), where each bank is a Xilinx-compliant True Dual-Port BRAM:

wbank0[0:39199]  — inputs 0, 4, 8, 12, ...  (byte lane 0)
wbank1[0:39199]  — inputs 1, 5, 9, 13, ...  (byte lane 1)
wbank2[0:39199]  — inputs 2, 6, 10, 14, ... (byte lane 2)
wbank3[0:39199]  — inputs 3, 7, 11, 15, ... (byte lane 3)

Port A: FSM read/write (STDP learning)
Port B: AXI read/write (ARM inspection / PSD loading)

3. Evolution: 12 Versions, 11 Bugs, 7 Days

The following timeline documents every RTL version, with precise timestamps from our commit log. The entire journey from hardware bring-up to 62% accuracy took 7 days.

Feb 5, 01:10

Hardware Bring-Up

Zynq-7020 board powered on. PetaLinux booted. AXI4-Lite bus verified. Ready for RTL.

Feb 8, 22:00 → Feb 9, 01:23

Steps 3-5: AXI ↔ SNN ↔ STDP Pipeline

In 3 hours: PS-PL communication verified, ARM-controlled pattern recognition working, on-chip STDP learning confirmed. The fundamental SNN pipeline runs on silicon.

Feb 9, 02:53

v3: 8×8 WTA + STDP (0xBABE0010)

8 neurons, 8 inputs. First WTA unsupervised competition on FPGA. Neuron 1 auto-converges to w=254 with no labels. Energy: 448,000 pJ.

Feb 9, 13:30

v4.3: Multi-Neuron Differentiation (0xBABE0021)

8 neurons, 64 inputs. 6/8 neurons alive. Learned 2 distinct patterns autonomously. Measured: 28 pJ/spike.

Feb 9, 17:19

🏆 v5.3: First MNIST on FPGA (0xBABE0053)

10 neurons, 64 inputs. The first time MNIST digits are classified by SNN on FPGA hardware. 4 neurons differentiated. Later v5.6 reaches 28.6% accuracy.

Feb 10

v6.0-v6.2: The BRAM Century — 100 Neurons

Scaling from LUT-RAM to BRAM. Encountered 6 critical bugs: FSM deadlock, WTA conflicts, address misalignment, weight overflow, BRAM inference failure, reset corruption. All fixed. v6.2 achieves 100-neuron STDP differentiation — 21.8% accuracy.

Feb 10, 15:00

v7.0: Full MNIST — 200 Neurons × 784 Inputs

Full 28×28 resolution. Serial input loading. 156,800 synaptic weights. First attempt: total collapse — 4 more bugs discovered (byte order, polling, LTP/LTD imbalance, homeostasis).

Feb 11, 16:30

v7.2: "Resurrection" — SNN Lives Again

Aggressive STDP parameters (LTP=45:LTD=1) break the winner-take-all death spiral. 34/200 neurons active. 23.6% accuracy — far above 10% random chance.

Feb 12, 19:00

🎉 v8.2: Stable — 62% Accuracy

NCA self-repair + K=2 WTA + calibrated STDP (LTP=30, LTD=2, DIV=7). No oscillation. 96/200 neurons alive. Specialized neurons: N4/N12/N38 → digit '8' (100% pure), N18/N31 → digit '9'.

Feb 15-17

v9.0-v9.2: WD-STDP Experiment — Failed

Attempted Diehl & Cook 2015's Weight-Dependent STDP with lateral inhibition. 3 regression bugs (BRAM inference broken again, reset corruption, LTD bypass). 8.5 hours of training wasted. Final accuracy: 26.7% — worse than v8.2. Lesson documented.

Feb 17

Experiment A: GPU Gradient Baseline — 81.2%

Same 200-neuron LIF architecture trained with surrogate gradients on CPU. Float32: 81.63%, Int8 quantized: 81.22%. This establishes the theoretical ceiling for our hardware constraints. STDP v8.2 achieves 76.4% of gradient performance.

4. Hardware Bugs: Lessons from Silicon

The 11 bugs we encountered are not random defects — they reveal systematic patterns in SNN hardware design that no simulator can expose. We highlight the three most instructive:

Bug #5 — Vivado BRAM Inference Failure

The FSM writes 1-byte weights via Port A; the AXI bus reads 4-byte words via Port B. Vivado could not infer a true dual-port BRAM from this asymmetric access pattern — it silently created two independent memory copies. STDP learning wrote to copy A; ARM readback read from copy B. All training appeared to produce zero effect. Fix: 4-Bank BRAM architecture with symmetric per-bank access. This same bug recurred as Bug #11 (v9.0) when named blocks in the LEARN phase broke BRAM inference again.

Bug #7 — Byte Order Inversion (learn_cfg)

A single endianness mistake: writing 0x45330000 instead of 0x00004533 to the LEARN_CFG register. The RTL maps bits[15:0] to {scale, tax, anti_hebb, div}. The wrong byte order set all competition parameters to zero — effectively disabling all learning. 12/200 neurons monopolized all responses. 0% accuracy. Fix: Corrected byte order. One line change.

Bug #9 — Sparse Input LTP/LTD Imbalance

With 784 inputs and ~150 active per image, the original LTP=8/LTD=3 created a fatal asymmetry: 150×8=1,200 total potentiation vs. 634×3=1,902 total depression. LTD > LTP → all weights collapsed to w_min. This is a fundamental property of sparse binary inputs that doesn't manifest in float32 simulators. Fix: LTP=15, LTD=2 → 2,250 > 1,268. Per Diehl & Cook (2015): LTP >> LTD for sparse codes.

5. Results

Version	Neurons	Inputs	Storage	Accuracy	Key Achievement
v4.3	8	64	LUT-RAM	—	6/8 neurons active, 2 patterns learned
v5.6	10	64	LUT-RAM	28.6%	First MNIST accuracy measurement
v6.2	100	64	4-Bank BRAM	21.8%	100-neuron STDP differentiation
v7.2	200	784	BRAM	23.6%	Full MNIST, "Resurrection" recovery
v8.2	200	784	BRAM	62%	Stable. Specialized neurons. No oscillation.
v9.2	200	784	BRAM	26.7%	WD-STDP failed: N85 monopoly
PSD-A (GPU)	200	784	—	81.2%	Gradient ceiling (same architecture, Int8)

FPGA STDP (v8.2)

62%

Unsupervised, on-chip, $99

GPU Gradient (PSD-A)

81.2%

Supervised, off-chip, RTX 3090

The STDP result achieves 76.4% of the gradient-trained ceiling — remarkable given that STDP uses zero labels, zero backpropagation, and 8-bit fixed-point arithmetic hardwired in FPGA fabric. Per-digit analysis reveals specialization: digit "1" (vertical strokes) reaches 43% accuracy while digit "0" (round shapes) remains challenging at 7% — consistent with how local STDP learns edge-like features more easily than distributed circular patterns.

6. Key Engineering Findings

1. BRAM inference is fragile. Vivado's BRAM inference silently fails when always blocks contain named begin/end blocks with local reg declarations, or when Port A and Port B have asymmetric data widths. This creates invisible duplicate memories — the most dangerous class of hardware bug because all functional tests pass.

2. Sparse inputs require LTP >> LTD. With 784 binary inputs and ~19% activation density, the inactive inputs applying LTD outnumber active inputs applying LTP by 4:1. Any balanced LTP/LTD ratio leads to weight collapse. This is a mathematical inevitability that software simulators mask with float32 precision.

3. Homeostasis is non-optional. Without adaptive threshold adjustment, Winner-Take-All degenerates into "Winner-Take-Everything" — a single neuron monopolizes all responses (v9.2: neuron #85 won 30% of all test images). The homeostasis target must scale with neuron count.

4. The gap between paper and silicon is real. We implemented the same Diehl & Cook 2015 algorithm in v9.0 — and it failed catastrophically on hardware. Three regression bugs, 8.5 hours of wasted training. The lesson: every RTL change must pass a STDP gate test before training begins.

7. From FPGA to VGO NoyronLink ASIC

This FPGA verification serves as the precursor to the VGO NoyronLink ASIC — a custom neuromorphic chip designed to scale from hundreds to millions of neurons. Every RTL module verified on the Zynq-7020 becomes a proven building block for the ASIC tape-out.

Stage	Platform	Neurons	Energy/Spike	Status
iCE40 Proof	Lattice iCE40 HX8K	8	~28 pJ	✅ Bitstream generated (132 KB)
FPGA Full	Zynq-7020	200	~50 pJ	✅ 62% MNIST (this work)
FPGA Scale	Zynq-7020	400	~50 pJ	🔮 v11.0 (49% BRAM, target 90%+)
NoyronLink ASIC	28nm Custom	1M+	<10 pJ	🔮 Design phase (RTL verified)

The ASIC design leverages VGO's proprietary AER (Address-Event Representation) spike bus, achieving 429× bandwidth compression compared to dense matrix operations. At 28nm, the NoyronLink SoC targets under 10 pJ/spike and millions of neurons — making it a potential replacement for GPU-based LLM inference at a fraction of the power budget.

The critical insight: you don't need to simulate a brain to build one. By verifying each mechanism on real hardware — LIF dynamics, STDP plasticity, WTA competition, NCA self-repair — we build confidence that the ASIC will work on first silicon. The FPGA is our risk-reduction engine.

8. Conclusion

This work asked a narrow question — does STDP self-organization actually work on real, $99 FPGA hardware? — and answered it affirmatively. The 62% MNIST accuracy with pure on-chip STDP, while below state-of-the-art software benchmarks, confirmed that LIF neurons self-specialize, WTA competition prevents collapse, and NCA self-repair maintains population health — all on 8-bit fixed-point silicon with zero labels and zero GPU involvement. Equally important, we documented what doesn't work: the Diehl & Cook (2015) WD-STDP algorithm fails catastrophically when ported to fixed-point RTL, producing a -35% accuracy regression (v9.0). This negative result, rarely reported in the literature, is itself a contribution.

The 11 hardware bugs exposed during this verification — particularly the Vivado BRAM inference failure and the sparse-input LTP/LTD imbalance — represent engineering knowledge that no software simulator can provide. These are not theoretical edge cases; they are systemic traps that any team building neuromorphic hardware will encounter.

8.1 Limitations and Future Work

We deliberately paused this project at 62% once the core mechanism was validated, shifting resources to the NoyronLink ASIC design phase. The current design uses only 28% of available BRAM, leaving clear headroom to scale from 200 to 400+ neurons — which, following the Diehl & Cook (2015) scaling curve, should approach 90%+ accuracy with pure STDP. We also did not implement reward-modulated STDP (R-STDP), which has demonstrated 93% accuracy on custom ASICs (Ali et al., 2021). Real-time power measurement was not performed; energy figures are estimates from Vivado reports. These are engineering decisions, not fundamental limitations.

The verified RTL modules — LIF neurons, 4-Bank BRAM, WTA competition, STDP learning, NCA self-repair — form the proven core of the VGO NoyronLink ASIC. From a $99 development board to a custom neuromorphic chip: the path from software to silicon is validated, and the bugs that would have cost months on an ASIC were found for the price of a development board.

Explore the VGO Ecosystem

VGO SNN on FPGA is one pillar of the OPEN-Yongnian project — an initiative spanning biological computing, neuromorphic hardware, and AI cognitive architectures.

More Research →

VGO SNN on FPGA: From 0% to 62% MNIST Accuracywith Pure STDP on a $99 Zynq-7020