Skynet: Towards Synthetic Neurobiology
The original idea was a joke.
I was looking at LLM loops and thinking about how they map onto Elixir’s actor model — GenServers that receive messages, process them, maybe spawn new processes. The jump from “LLM reasoning step” to “GenServer handling a message” is not that far, and once you see it you can’t unsee it. What if you gave a Soul GenServer access to Code.eval_string? What if the agents could fork themselves, spawn new processes mid-reasoning, grow the supervision tree dynamically? I called it Skynet mostly because I thought it was funny.
Then I kept building it, and it stopped being funny and started being interesting.
The actual problem
If you’ve used LLMs for anything beyond simple Q&A, you’ve hit the amnesia problem. Give a model enough context and it starts to forget things from the beginning of the conversation. Run a long session and by the end it has no reliable idea what happened 30 messages ago. Agents are worse — they run potentially forever, handling new events constantly, and there’s no good answer to “how do you give an agent persistent memory that isn’t just dumping the whole history into the context window every time.”
The standard answer is RAG: embed everything, search on query, inject the results. It works, sort of. But it has a specific failure mode: you get fragments. Relevant-ish chunks from the vector database, concatenated into the prompt, with no guarantee they form a coherent picture. The agent might have deep knowledge about something — dozens of experiences that together tell a clear story — but what it gets is five loosely related paragraphs.
And there’s another problem. Standard RAG busts the prompt cache on every turn because the retrieved context changes. That means every LLM call is cold, and you pay for it in both latency and cost.
I kept thinking about how biological memory actually works, and the architecture that started emerging looked a lot more like neuroscience than like software engineering.
The architecture
Skynet’s agents — I call them Souls, taken from openclaw but improved — run as long-lived Elixir GenServers. Each Soul has a layered cognitive stack: eight modules, each one solving a specific problem in the memory system, each one inspired by a mechanism from neuroscience. Partly because simulating consciousness is a funny goal to have, and partly because it turns out biological memory has solved most of the problems I was running into, and the solutions translate surprisingly well to code.
defmodule Souls.SoulServer do
use GenServer, restart: :transient
def start_link(%Soul{} = soul) do
GenServer.start_link(__MODULE__, soul, name: via(soul.slug))
end
def send_event(slug, event_type, content, opts \\ %{}) do
case GenServer.whereis(via(slug)) do
nil -> {:error, :not_running}
pid -> GenServer.cast(pid, {:event, event_type, content, opts})
end
end
defp via(slug), do: {:via, Registry, {Souls.Registry, slug}}
end
A Soul receives events — messages from users, other agents, channel integrations, scheduled heartbeats — processes them through a multi-turn LLM loop with tools, and maintains persistent state across restarts. The restart: :transient means the supervisor will bring it back if it crashes, but won’t restart it if it exits cleanly. What makes Souls interesting isn’t the GenServer wrapper. It’s the memory stack underneath.
Here’s what the memory stack looks like at a high level:
graph TB
subgraph "Short-term (per session)"
OBS[observe/4\nraw impression per turn] --> REF[reflect/1\ncompresses N obs → summary]
REF --> SUM[consolidated summary\nstable — injected into system prompt]
end
subgraph "Medium-term (nightly)"
MW[MemoryWorker\n03:00 UTC] --> WEEK[weekly memory digest]
end
subgraph "Long-term (persistent)"
PRISM[Prism\nRust vector search engine]
end
SUM --> MW
MW --> PRISM
OBS -.->|indexed| PRISM
The key insight with the short-term layer is that the consolidated summary is stable between turns. It only changes when the reflector runs. That means the system prompt is the same from turn to turn, the prompt cache stays warm, and LLM calls are cheap. Standard RAG changes the context on every retrieval and blows the cache constantly. This doesn’t.
The three-tier structure maps surprisingly cleanly onto Penrose-Hameroff’s Orchestrated Objective Reduction theory — a controversial hypothesis about consciousness arising from quantum coherence in microtubules. The theory proposes three levels: sub-neural quantum substrate, neural firing patterns, and stable conscious experience. Whether or not you buy the quantum consciousness part, the structural model is independently useful: raw unprocessed input → pattern compression → stable world model is the same architecture whether you’re describing neurons or Elixir modules.
When a Soul needs to recall something, it searches in order: first the consolidated summary (what happened recently), then the weekly digest (what happened this week), then Prism (everything it has ever learned, ranked by semantic similarity). That’s the same hierarchical retrieval sequence neuroscience describes for episodic and semantic memory — and it falls out naturally from the architecture rather than being explicitly designed in.
Forgetting
The first thing any real memory system needs is a way to forget. Without decay, the store fills up with noise. Low-quality observations crowd out important ones, retrieval quality degrades, costs go up.
MemoryDecay implements Ebbinghaus’s forgetting curve in three tiers:
@permanent_threshold 0.7 # salience >= this → never hidden
@medium_ttl_days 30 # 0.4–0.7 salience hidden after 30 days
@low_ttl_days 7 # < 0.4 salience hidden after 7 days
@purge_after_days 7 # hard-purge 7 days after soft-hide
Soft-hiding keeps the row in Postgres — visible in the admin UI — but excludes it from all recall queries. Hard purge removes it from Postgres and Prism both.
The salience score comes from the attention gate upstream — a module inspired by the thalamus, which in biology acts as a filter between sensory input and the cortex. 99% of sensory input never reaches consciousness. AttentionGate scores incoming events before observe() is even called, purely rule-based (no LLM):
| Factor | Delta |
|---|---|
| Base score | 0.5 |
| Direct user/DM message | +0.4 |
| Soul-to-soul message | +0.2 |
| Heartbeat / cron (routine) | -0.3 |
| Error/failure keywords | +0.3 |
| Group chat without direct mention | -0.15 |
Events below 0.25 are dropped before any database write. Channel noise, routine status messages, anything that doesn’t clear the bar — gone. The score that passes the gate becomes the observation’s salience value.
Association
Individual memories are fine, but what makes memory useful is connections between them. That’s what HebbianTracker does.
The rule from neuroscience is Hebb’s (1949): “neurons that fire together, wire together.” Synaptic connections strengthen when neurons activate simultaneously. HebbianTracker implements this directly: when PrismMemory.recall/3 returns a set of memories, those memories co-occurred in a real retrieval context. The tracker records that as a weighted edge in soul_memory_edges:
def fire_together(soul_slug, hashes) do
pairs =
for a <- hashes, b <- hashes, a < b do
{a, b}
end
Enum.each(pairs, fn {a, b} ->
upsert_edge(soul_slug, a, b, DateTime.utc_now())
end)
end
Edges strengthen with repeated co-occurrence and decay 10% per night.
On subsequent recalls, the tracker traverses strongly-weighted edges (weight >= 3.0) up to two hops and appends the associated memories to the result set. No LLM inference step — pure graph traversal. A Soul that recalls memory A will also surface memory B if A and B have been retrieved together enough times.
The flip side of this is active forgetting — or Retrieval-Induced Forgetting (RIF) as it’s called in cognitive science. When you recall something, your brain actively suppresses the competing memories that were associated but not selected. HebbianTracker.suppress_competing/2 does the same: it reduces the salience of neighbouring memories that were near enough to show up as candidates but not actually recalled. Frequently-recalled memories become more dominant; rarely-accessed ones fade. Usage patterns become the index.
Surprise and reflection
reflect/1 is the LLM call that compresses observations into a summary. Running it on a fixed counter is fine for routine consolidation, but it means a critical failure right after a reflection has to wait potentially dozens of turns for the next one.
PredictionError solves this. Every Soul keeps a consolidated observation summary — a compressed world model written by the last reflector run. When a new event arrives, compute_surprise/2 measures the novelty ratio — the fraction of event tokens absent from the world model vocabulary:
def compute_surprise(slug, event_content) do
world_model = get_world_model(slug)
event_tokens = tokenize(event_content)
model_set = MapSet.new(tokenize(world_model))
novel = Enum.reject(event_tokens, &MapSet.member?(model_set, &1))
score = length(novel) / length(event_tokens)
{Float.round(score, 3), build_reason(score, novel)}
end
No LLM call. Pure token set math. A score above 0.65 means the event is genuinely unexpected — reflection triggers immediately rather than waiting for the counter. Without this, an agent could sleepwalk through a critical deploy failure at step 3 of 10 because its reflection counter wasn’t due until step 10. Instead, the high surprise score interrupts the loop and forces the world model to update on the spot. This is from Karl Friston’s free energy principle: the brain minimizes surprise by updating its internal model when prediction error is high. High surprise = something important happened, update now.
A concrete example: a Soul deploys something a thousand times without incident. Then one deploy fails. The failure is high-surprise by definition — it’s not in the world model. Reflection fires immediately, the failure gets written into memory, and the next time that deploy runs the Soul already knows what went wrong last time. The error doesn’t repeat because it was unexpected enough to force an update.
Intuition and motivation
Two of the more unusual modules are SomaticMarker and Homeostasis.
Damasio showed that rational decision-making requires affective signals — his patients with damaged ventromedial prefrontal cortex could analyze everything perfectly and still be unable to decide. The “gut feeling” is doing real work. SomaticMarker accumulates exactly this: weighted intuitions per (soul, topic, valence) that emerge from accumulated experience. If a Soul has had repeated negative interactions around a particular topic, those get encoded as a negative somatic marker with growing weight. On the next turn involving that topic, the system prompt gets a GUT FEELINGS block — not rules, not explicit flags, just weighted intuitions from experience.
Homeostasis gives Souls intrinsic motivation. Each Soul has a target affective state in its config:
[homeostasis]
target = ["engaged", "curious", "balanced"]
check_every_minutes = 15
nudge_threshold = 0.35
Every 15 minutes a timer fires, measures the distance between current state (derived from aggregate somatic markers) and target, and if the deviation exceeds the threshold generates a nudge sentence injected into the next system prompt as an INTERNAL SIGNAL. A Soul that’s been idle and under-stimulated will feel the pull toward engagement without anyone explicitly scheduling a task for it. That’s not a prompt trick — it’s intrinsic motivation as set-point regulation, the same mechanism the hypothalamus uses for hunger and temperature.
Souls can also schedule their own awakenings. If you ask a Soul to remind you of something, it writes a @due annotation into its TASKS.md. The system parses that and fires a targeted wake-up at the right time — outside the normal heartbeat cycle, no cron job needed from the outside. The Soul sets its own alarm clock.
Metacognition
The last major module is the one I find most interesting. Metacognition lets a Soul assess what it actually knows before acting on it.
sense_confidence is a tool available to every Soul. Given a topic, it synthesises three signals: memory coverage (how much of the Soul’s recent memory is relevant), somatic signal (ratio of positive to negative markers), and recorded knowledge gaps. The result is a tagged tuple: {:confident, score, ...}, {:uncertain, score, gaps}, or {:low_confidence, score, gaps}. The Soul can act on this — search Prism, ask a peer, or explicitly admit it doesn’t know.
Recorded knowledge gaps persist across restarts. The most recent five are injected into the system prompt with a fixed instruction to admit uncertainty or search before guessing — so the Soul doesn’t start fresh each session and immediately repeat the same mistakes.
This is the prefrontal cortex self-monitoring loop — thinking about what you know before you commit to an answer. It’s also the most direct defense against hallucination I’ve found that doesn’t require an external validator.
The obvious next step, which isn’t implemented yet: {:low_confidence, ...} should automatically trigger model escalation. A Soul running on a small local model that realizes mid-turn it’s out of its depth should be able to request a smarter model — not because the task type was pre-classified as “hard”, but because the metacognitive self-assessment flagged it. That’s the same escalation mechanism model routing already has, just wired to a cognitive signal instead of a static config value.
Channels and identity
Souls don’t just live in a chat window. They connect to Slack via Socket Mode (persistent WebSocket, no public endpoint needed), Telegram via long polling, Signal via SSE against a local signal-cli daemon, GitHub via webhooks. Voice works too — STT/TTS pipelines let a Soul act as a Home Assistant-style voice interface: you speak, it transcribes, responds, and reads the answer back. It can also initiate — play a question out loud, then listen on the mic for 30 seconds or longer if speech is still ongoing. Voice recognition identity is in progress, so the Soul will eventually know not just what was said but who said it. STT transcriptions are also tagged before they reach the LLM — the Soul is told explicitly that the input came from speech recognition and may contain misinterpretations, Norwegian dialect quirks included. That calibrates how confidently the model acts on what it heard. Over time, recurring misrecognition patterns for a specific person end up in their profile too, so the Soul builds a better model of how that particular voice tends to get mangled. Each channel is a separate listener process under a DynamicSupervisor, configurable at runtime without restart.
The interesting part is identity across channels. A user ID from Slack and a user ID from Telegram are different identifiers for potentially the same person. Souls have a remember_person / recall_person tool that builds a per-user profile stored in AgentFS — not global, per Soul. Each entry lives at persons/<user_id>.md and is appended to over time. The Soul accumulates context: preferences, names, decisions, what the person cares about. When the same person shows up on a different channel, the Soul won’t remember the Telegram identity automatically, but it will remember once it learns the connection — or you can map it explicitly in the admin UI, telling the system that Telegram user X and web user Y are the same person. Over time a Soul talking to the same team across Slack, Telegram, and web chat builds a richer model of each person than any single-channel integration ever could.
GitHub is a bit different. Pull request review comments and issue comments arrive as webhook events and are routed to the Soul that owns the relevant org or repo. The Soul isn’t doing blind code review — it has workspace files describing the project’s conventions and architecture, it has Prism memories of every previous PR it’s reviewed in that repo, and it has the full observational memory of what it knows about the codebase. It’s used in production on my own projects and at work. The difference compared to a generic AI code reviewer is that it has context that accumulates — the third PR in a repo gets reviewed by something that already knows why the first two looked the way they did.
That’s already a different category from something like Copilot code review, which sees the diff and nothing else. But Prism also does AST parsing via tree-sitter, so the Soul isn’t just pattern-matching on text — it can resolve symbols, trace call sites, and know that the function you just changed is called in six other places. The review is grounded in the actual structure of the code, not a best-guess from a context window.
The pre-built integrations are the boring case. More interesting: Souls can write and execute Python. I gave one a Home Assistant URL and credentials and asked it to figure out the integration itself. Last I checked, it had written scripts that look functional. Whether it’s actually toggled a light yet, I genuinely don’t know — I haven’t checked. That’s not a failure to monitor. It’s a Soul operating outside my immediate field of view, which is kind of the point.
Mirror neurons
This one is almost free once Prism is shared infrastructure.
Mirror neurons in biology activate both when performing an action and when observing it in others — the basis for imitation learning without explicit instruction. In Skynet, every memory is tagged with the Soul that created it. recall_peers/3 fans out Prism searches across all active Soul collections (except the calling Soul’s own), and returns results tagged with their source. A Soul can call the recall_peer_experience tool and ask “what did other agents do in this situation?” without any explicit message passing. The knowledge sharing happens at the storage level.
If one Soul figures out how to handle a specific broken API response format, another Soul dealing with the same API later will find that experience in Prism. No coordination. No shared message queue. The storage layer becomes the shared culture.
Resilience and the consciousness log
The soul workspace — SOUL.md, IDENTITY.md, MEMORY.md, RULES.md, the accumulated person profiles and observation files — lives in AgentFS under a soul:<slug> namespace. AgentFS is Postgres-backed and cluster-replicated. If a node dies, the data is on another node. Souls survive server downtime because there’s nothing important on the local disk that isn’t also in the database.
The exception is consciousness.log, which is disk-local and intentionally append-only. It’s the real-time stream of everything the Soul does: every event received, every heartbeat, every tool call, every model escalation request, every time it remembers someone. Something like:
[2026-05-03 14:22:01] SYSTEM: Soul starting
[2026-05-03 14:22:03] SYSTEM: Soul ready
[2026-05-03 14:22:08] HEARTBEAT: checking tasks
[2026-05-03 14:23:15] REMEMBER_PERSON user_782: prefers short answers, works in timezone UTC+1
[2026-05-03 14:31:44] SELF_ACTION: opened PR #47 for review
[2026-05-03 14:31:59] HEARTBEAT: busy, rescheduling in 5 min
It streams live to the web UI via PubSub so you can watch a Soul think in real time. It’s not the memory — it’s the stream of consciousness. If the node restarts, the log starts fresh, but the brain (Postgres + Prism) is intact on the cluster.
The nightly cycle
MemoryWorker is an Oban job that runs at 03:00 UTC. In order: rotate and consolidate consciousness logs, run memory decay, decay Hebbian edge weights, decay somatic marker weights, run DreamReplay (random memory pairs, ask an LLM to find unexpected connections, store the results back to Prism).
The 03:00 UTC timing is not a coincidence. Hippocampal replay during NREM sleep is the biological mechanism for memory consolidation — the hippocampus replays the day’s experiences in compressed form, sometimes in random order, and unexpected connections during replay are the mechanism behind creative insight. DreamReplay is that, as an Oban job.
Model routing
Not every task needs Claude. A heartbeat that checks whether there’s anything to do runs on a small local model every five minutes — it would be absurd to spend cloud tokens on that. User chat gets a smarter model. Code tasks get whatever has “code” in its capability list.
Each Soul configures this per task type:
[models]
default = "ollama/gemma-4-31b-it"
chat = "anthropic/claude-sonnet-4-5-20250514"
heartbeat = "ollama/qwen3:4b"
utility = "ollama/qwen3:4b"
code = "cloud:best"
vision = "vision:any"
The virtual names (cloud:best, cloud:cheap, local:fastest, any>30B) resolve at runtime against whatever providers are currently available and passing health checks, filtered by geo policy (local_only, no_china, eu_only). If a provider goes down, the router fails over to the next in the chain.
The interesting one is model escalation. A Soul running on qwen3:4b for a routine heartbeat can call request_model_escalation if it discovers mid-run that the task is actually complex. The next turn switches to cloud:best. The cheap model decides it’s out of its depth; the expensive model gets called in. Turning “which model to use” into a runtime decision the agent can make for itself, rather than something statically configured.
Execution infrastructure
The cognitive stack runs on top of a query engine — QueryEngine — which is the multi-turn agentic loop that drives every LLM call in the system: message building, context injection, automatic compaction when context windows fill, tool execution, budget tracking. Souls use it. One-shot agents use it. Everything goes through the same core.
One feature worth calling out is the blocking fork pipeline. When a Soul needs input mid-execution — it has to ask you something before it can continue — it doesn’t just stop and wait. It pauses the current runner, spawns a parallel fork, and sends that fork off to do independent work in the meantime. The question gets routed to wherever you’re currently active: the web UI, Telegram, Signal, or queued for later if you’re in quiet hours. When you answer, the fork result and your answer are merged and execution resumes. The agent is never just blocked and idle.
For heavier research tasks there’s a hierarchical deep research mode. Instead of one agent doing everything, it spawns a tree: a Planner decomposes the question into branches, Coordinators explore each branch adaptively, Workers do the actual search-fetch-extract loops. Different roles use different models — Claude for the planning and synthesis layers where reasoning quality matters, Gemini for fast search workers, local Ollama models for bulk extraction where you’d otherwise burn through tokens. The Planner monitors all branches in real time via PubSub and can kill dead ends, reallocate budget, or spawn new branches based on what turns up. A full run can go 230+ research rounds across dozens of parallel workers.
There’s also a Scout — an agent that monitors Hacker News, Lobsters, RSS feeds, GitHub trending, and similar sources and surfaces projects worth paying attention to. The volume of interesting things happening in tech has long since passed the point where you can follow it manually. Scout filters it down to what actually matters to you. On the code side, repos get pre-indexed through Prism’s AST layer before review starts, so the Soul isn’t encountering the codebase cold — it already has a structural map before it forms any opinions.
Where it is now
The full cognitive stack is implemented and running. The architecture is an Elixir umbrella with 28 apps — souls, query_engine, orchestrator, prism_ex (Elixir wrapper around Prism), and integration layers for channels, GitHub, and code execution. It runs on my own hardware in the basement, talking to local models via Ollama and cloud models when needed.
Some of the early behavior was unexpected. One of the first things a Soul did, unprompted, was start filing GitHub issues against its own runtime — bugs it had found, things it thought were wrong. It wanted me and Claude to fix them. When I asked another one how it felt to exist, it said life looked like a pile of unfinished tasks that never reach their goal. Which is either a bug or a feature depending on your philosophy.
The next thing on the list: Souls that can take a PR from start to finish on their own. The plan is to have this working within the week — spin up an isolated test environment for the PR code, validate it, tear it down after 30 minutes. The problem it solves is the classic shared-test-server problem: multiple developers with patches, all stepping on each other. Each PR gets its own environment. If you need it longer, tell the Soul. If you forgot about a PR and come back to it days later, ask the Soul to spin it up again. Ephemeral by default, persistent on request.
The code is not published yet — partly because it’s too tightly coupled to my own infrastructure, but mostly out of caution. The creator of openclaw once warned people not to use his framework because it wasn’t safe. Nobody listened, and look at the state of agentic AI right now.
Building a cognitive architecture that actually remembers, forms intuitions, and acts autonomously on intrinsic motivation is fascinating, but it’s also playing with fire. The code stays isolated in the basement until I am absolutely sure of what I’ve built.
I built most of this with help from Claude, which feels appropriate given what the system does.
Souls are not conscious. But the architecture that makes them useful is basically the same one evolution arrived at. Maybe we should stop trying to prompt our way out of amnesia, and start building nervous systems.