AI Research

VGO: A Multimodal Spiking Neural Network
Cognitive Architecture on a Single Consumer GPU

Mar 12, 2026 ยท 25 min read ยท by Robert Hu โ€” OPEN-Yongnian

Abstract. We present VGO, a complete multimodal cognitive system implemented entirely in Spiking Neural Networks (SNN) running on a single consumer-grade GPU (NVIDIA RTX 3090, 24 GB). VGO integrates 39 SNN modules spanning auditory perception (CER = 8.1%), visual recognition (80-class COCO object detection, 7-class emotion), speech synthesis, episodic memory (50,000 entries), a 7-stage cognitive reasoning pipeline, curiosity-driven active learning, and embodied motor control via MuJoCo simulation. The entire system occupies 535 MB of VRAM (2.2% of available memory) and achieves end-to-end inference energy of 100โ€“500 picojoules per sample โ€” approximately 1010 times more efficient than frontier LLMs. VGO was developed by a single engineer in 30 days. We argue that SNN-native multimodal cognition represents a viable, radically more efficient alternative to the Large Language Model paradigm.

Development Timeline: 30 Days, 1 Person, 1 GPU

From zero to a 39-module multimodal cognitive system โ€” every milestone on a single RTX 3090.

Feb 5, 2026 โ€” Day 0
Project Start
STDP cortex framework initialized. First neurons fire.
Feb 13 โ€” Day 8
"3-Hour Blitz"
Organ transplant from ResurrectionBrain. 72% sensory encoding done in one session.
Feb 19 โ€” Day 14
Genesis v3 Full Training
186K multimodal samples trained through STDP cortex.
Feb 22โ€“23 โ€” Day 17โ€“18 โšก
7 Breakthroughs in 24 Hours
v12.6โ†’v12.12: Auditory semantics (cos 0.39โ†’0.545), QA separation, unified brain, episodic memory, naming, spatial memory, personality core.
Feb 25 โ€” Day 20
Vocoder Breakthrough
Vocos 24kHz fine-tuned (STFT=0.8529). 3ร— faster than HiFi-GAN.
Mar 5 โ€” Day 28 ๐Ÿ†
SNN CTC v2 + Language Model
SNN-native speech recognition: CER=8.1% (6M params + 3-gram LM). Iron Rule #8 achieved.
Mar 6 โ€” Day 29
Cognitive Bug Fixes + Decoder Training
retrieve() bug fixed, language matching integrated. AudioDecoder training started (25.5M).
Mar 12 โ€” Day 35 ๐Ÿ“„
Technical Report Published
This paper. 39 modules. 870K+ training samples. 535 MB VRAM. The journey continues.
The Builder
One person with zero CS degree โ€” an artist, not an engineer
The Method
Human intuition ร— AI pair-programming โ€” every line co-created with LLM agents
The Lesson
You don't need a PhD. You need a vision and the right AI collaborator.

Table of Contents

  1. Introduction
  2. Motivation: Why SNN, Why Now
  3. System Architecture
  4. Sensory Perception Layer
  5. Cognitive Reasoning Pipeline
  6. Memory and Learning Systems
  7. Embodied Intelligence: VGOBot
  8. Performance Benchmarks
  9. Comparison with Existing Systems
  10. Limitations and Future Work
  11. Conclusion

1. Introduction

The dominant paradigm in artificial intelligence โ€” the autoregressive Transformer trained on trillions of tokens โ€” has achieved remarkable capabilities at extraordinary cost. Training a frontier LLM is estimated to require over $100 million in compute; inference consumes approximately 10,000 joules per request. This energy profile is ten billion times higher than the biological neural circuits that evolution has refined over 500 million years.

We present VGO (Vectorized Growth Optimization), a brain-inspired cognitive architecture that demonstrates an alternative path. Built entirely on Spiking Neural Networks (SNN) with Leaky Integrate-and-Fire (LIF) neurons and Spike-Timing-Dependent Plasticity (STDP), VGO is the first system to achieve multimodal cognition โ€” hearing, seeing, speaking, learning, and reasoning โ€” within a single SNN framework running on consumer hardware. Perhaps most remarkably, VGO was built by a non-engineer with no computer science degree, working entirely through AI-assisted pair programming โ€” proving that the barrier to frontier research is no longer credentials, but vision.

VGO is not a toy demonstration. It is a functioning cognitive system with:

2. Motivation: Why SNN, Why Now

2.1 The Energy Problem

The scaling laws of Transformer-based models project ever-increasing compute requirements. Epoch AI estimates that by 2030, training frontier models will require the energy output of a small power plant. This trajectory is physically unsustainable.

The human brain processes multimodal information on approximately 20 watts. A fruit fly navigates complex 3D environments with a brain consuming 10 microwatts. Both achieve this through spike-based computation: information is encoded in the timing of discrete electrical pulses, not in continuous-valued matrix multiplications.

100 pJ
VGO energy per simple inference
10,000 J
LLM energy per request (estimated)
1010ร—
efficiency advantage of VGO over LLMs

2.2 The Learning Problem

LLMs cannot learn in real-time. When you tell ChatGPT your name, it forgets it next session. VGO achieves online learning through STDP โ€” every interaction strengthens or weakens synaptic connections. Tell VGO "this is Wangcai" while showing it a dog, and it will recognize that dog by name the next time โ€” without retraining, without fine-tuning.

2.3 Recent Context: The Eon Digital Fruit Fly

In March 2026, Eon Systems demonstrated a virtual fruit fly driven by 125,000 emulated neurons. While visually impressive, the Eon fly is a static simulation: it cannot learn, cannot adapt, and cannot generate novel behavior. VGO takes the opposite approach: rather than copying biology's structure, we adopt biology's learning rules.

3. System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           Client Layer (Flutter App / WebSocket)     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         Omni Server v3.0 (FastAPI, Mixin design)     โ”‚
โ”‚   /ask (text+image+audio)  /teach  /health  /ws      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                  Cognitive Core                       โ”‚
โ”‚  PARSE โ†’ ROUTE โ†’ RECALL โ†’ THINK โ†’ VERIFY โ†’           โ”‚
โ”‚  SEARCH โ†’ RE-THINK โ†’ OUTPUT                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Sensory Perception Layer                 โ”‚
โ”‚  ๐Ÿ‘‚ CochleaEncoder โ†’ AuditoryCortex (384d)          โ”‚
โ”‚  ๐Ÿ‘๏ธ RetinaEncoder โ†’ VisualCortex (4096d)           โ”‚
โ”‚  ๐Ÿ—ฃ SNN CTC v2 + 3-gram LM (CER = 8.1%)            โ”‚
โ”‚  ๐Ÿ˜Š EmotionSNN (7-class, audioร—0.6 + visualร—0.4)    โ”‚
โ”‚  ๐Ÿ“ฆ ObjectDetectorSNN (80-class COCO)                โ”‚
โ”‚  ๐Ÿ‘ค FaceDetector + VoiceprintSNN (cross-modal)       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚               Memory & Learning Layer                โ”‚
โ”‚  STDPCortex (4,096 neurons ร— 64 columns, 3.7M syn)  โ”‚
โ”‚  Hippocampus (50K entries, MiniLM 384d, QA-split)    โ”‚
โ”‚  EpisodicMemory (vision + audio + text binding)      โ”‚
โ”‚  CuriosityLoop (surprise > 0.5 โ†’ ask โ†’ learn)       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚          Single NVIDIA RTX 3090 (535 MB / 24 GB)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4. Sensory Perception Layer

4.1 SNN Auditory System (CER = 8.1%)

VGO's hearing pipeline is fully SNN-native. The architecture: CochleaEncoder (64-channel Gammatone filterbank) โ†’ AuditoryCortex v15.4 (Conv1D spatiotemporal distillation, cos=0.545) โ†’ SNN CTC Decoder v2 (Conv1D+BiGRU+CTC, 6.0M params, AISHELL-1 trained) โ†’ 3-gram LM (beam search, CER 13%โ†’8.1%, +37.2% relative improvement).

For comparison: Whisper Large-v3 achieves CER ~5% with 764M parameters (127ร— larger).

4.2 SNN Visual System

RetinaEncoder (224ร—224 DVS + Gabor V1) โ†’ VisualCortex (V1โ†’V2โ†’IT, 4096d, CLIP-distilled cos=0.75) โ†’ ObjectDetectorSNN (80 COCO classes, YOLO-distilled) โ†’ FaceDetectorSNN (512d face encoding) โ†’ ColorVision (128d opponent channels) โ†’ EmotionSNN (7-class, multimodal fusion).

4.3 SNN Speech Synthesis

Text โ†’ Pinyin โ†’ Mel (SNN Acoustic, 7.3M params) โ†’ Waveform (Vocos 24kHz, STFT=0.8529, 3ร— faster than HiFi-GAN).

5. Cognitive Reasoning Pipeline

Every text input passes through a 7-stage cognitive pipeline:

  1. PARSE โ€” Intent + entity + language detection
  2. ROUTE โ€” Simpleโ†’fast path; complexโ†’slow path; mathโ†’SymbolicReasoner
  3. RECALL โ€” Semantic retrieval from 50K hippocampus with language-preference filter
  4. THINK โ€” Reflection-verification-correction loop (up to 3 iterations)
  5. VERIFY โ€” Hallucination detection via entropy + surprise thresholds
  6. SEARCH โ€” WebReflex v2.1: autonomous internet search when uncertain
  7. RE-THINK โ†’ OUTPUT โ€” KnowledgeDigester integrates search results, re-runs cognition

6. Memory and Learning Systems

6.1 STDPCortex

GPU-accelerated cortex: 4,096 LIF neurons ร— 64 columns, 3,751,300 synapses. Learning occurs through STDP modulated by surprise signals from PredictiveCoder.

6.2 Curiosity-Driven Active Learning Loop

VGO's most distinctive capability โ€” the curiosity loop: Active observation (surprise > 0.5) โ†’ autonomous question โ†’ user answers โ†’ naming binding (regex extraction โ†’ tri-modal bind) โ†’ spatial memory (auto-deduplicate) โ†’ next encounter โ†’ episodic recall. This closed loop has no equivalent in any LLM or connectome-based system.

7. Embodied Intelligence: VGOBot

Dual-Brain Architecture: LLM Coach (cerebral cortex) provides high-level commands; SNN Cerebellum (61.1M params) converts to joint-level control. Training follows an 8-phase curriculum: static standing (98% success) โ†’ dynamic balance (87%) โ†’ walking โ†’ turning โ†’ squatting โ†’ running โ†’ grasping โ†’ stairs.

8. Performance Benchmarks

QueryLatencySpikesEnergyResult
"ไฝ ๅฅฝ" (Hello)~20 ms00 pJโœ… Smart reply
"1+2+3+4+5 = ?"372 ms00 pJโœ… 15 (symbolic)
"What is SNN?"~1500 ms112102.8 pJโœ… Correct
"sqrt(144)+7ยฒ"301 ms00 pJโœ… 61 (exact)
Multimodal (image+text)~2000 ms~200~300 pJโœ… Naming recall
"2024 Nobel Physics?"~3000 ms~300~500 pJโœ… WebReflex

Energy Comparison

SystemInference EnergyVGO Advantage
VGO (SNN)100โ€“500 pJ/sampleโ€”
Frontier LLM (estimated)~10,000 J/request~1010ร—
GPT-2 (measured)~0.5 J/token~106ร—
Intel Loihi 2~23 pJ/spike~same class

9. Comparison with Existing Systems

SystemTeamHardwareModalitiesSNNOnline Learning
VGO1 person1ร— RTX 3090Allโœ… Fullโœ…
SpikingBrain-1.0Large teamMetaX clusterTextโœ…โŒ
Intel Loihi 3Intel LabsCustom ASICVision+Tactileโœ…Limited
Eon Digital FlyEon SystemsGPU clusterMotorLIF simโŒ
Frontier LLMs1000+ peopleThousands H100AllโŒโŒ

10. Limitations and Future Work

11. Conclusion

VGO demonstrates that multimodal cognition does not require the brute-force matrix multiplication paradigm that defines modern AI. A single engineer, working for 30 days on a single consumer GPU, has produced a cognitive system that:

The implication is profound: intelligence is not a function of compute. It is a function of architecture. The biological brain proved this 500 million years ago. VGO is our attempt to relearn that lesson.

"They copied biology's wiring. We adopted biology's learning rules. The difference is between a photograph and a living organism."

VGO is open-source under the OPEN-Yongnian project. All code, weights, and training scripts are available at the project repository.

About the Author

Robert Hu is the sole developer of the OPEN-Yongnian ecosystem โ€” 41 sub-projects spanning SNN AI, neuromorphic computing, bionic robotics, and digital life cloning. VGO is Pillar III (AI) of this ecosystem. Follow the journey on X: @RobertHuBuild

References

  1. Maass, W. (1997). Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10(9).
  2. Bi, G. & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons. J. Neuroscience, 18(24).
  3. Shiu, P. et al. (2024). Whole-brain computational model of Drosophila melanogaster. Nature.
  4. Davies, M. et al. (2018). Loihi: A neuromorphic manycore processor. IEEE Micro, 38(1).
  5. Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
  6. Du, J. et al. (2024). SpikeVoice: Neural vocoder with spiking neural networks. ACL 2024.
  7. Eon Systems (2026). The First Multi-Behavior Brain Upload. Technical Demonstration.