Architecture Overview¶
Design philosophy¶
Substrate's architecture is built around one principle: the graph should always reflect reality. Every node and edge is derived from actual source-code analysis — never from manually maintained diagrams.
The two graph layers (future vision)¶
Intended Graph (G_I)¶
What should exist — architectural intent captured from: - Policies (Rego rules) — planned - ADRs (Architecture Decision Records) — planned - Approved topology and golden paths — planned - Declared infrastructure — planned
Observed Graph (G_R)¶
What actually exists — runtime reality captured from: - Live code dependencies (GitHub, tree-sitter AST) — implemented - Running services (Kubernetes API) — planned - Deployed infrastructure (Terraform state) — planned
Current implementation¶
Today, Substrate builds G_R from GitHub repositories. The ingestion pipeline:
- Shallow-clones the target repository
- Discovers and classifies every file
- Parses imports via per-language
substrate-graph-builderplugins (tree-sitter) - AST/semantic-chunks file contents; embeds each chunk with a contextual breadcrumb
- Writes nodes and edges into PostgreSQL + Apache AGE
- Serves the merged graph through a read-only REST API behind the Gateway
Service boundaries¶
Gateway service¶
Single ingress point for all API traffic after nginx.
- JWT validation via Keycloak JWKS (cached with 5-minute TTL, background refresh)
- Request routing to downstream services (
/api→ graph for reads, ingestion for sync mutations) - SSE fan-out at
GET /api/events— no WebSockets - CORS configuration driven by
.env.<mode>
Shared httpx.AsyncClient for connection pooling; app-level retry on idempotent methods with connection errors.
Ingestion service¶
Sync orchestrator that transforms source code into graph data.
| Capability | Status |
|---|---|
| GitHub connector (shallow clone) | Implemented |
| File classification (source / config / doc / script / asset / …) | Implemented |
Tree-sitter import extraction (15 languages via substrate-graph-builder) | Implemented |
| AST-aware chunking + semantic (markdown/text) + line-greedy fallback | Implemented |
| Embedding pipeline (file summaries + chunks, 896-dim) | Implemented |
Sync scheduling (sync_schedules, poller) | Implemented |
Ingestion writes to the same substrate_graph database the graph service reads. No substrate_ingestion database, no message bus.
Graph service¶
Read-only query layer over the code-knowledge graph.
- Serves merged graph snapshots across multiple syncs
- Semantic search via pgvector cosine distance over file-level embeddings
- Enriched summary pipeline — full file reconstruction + top-K edge neighbors → dense LLM → cached
file_embeddings.description - Source metadata CRUD (for connected repositories)
- Sync history + schedule reads (writes live in ingestion)
Frontend¶
React dashboard for graph exploration and source management.
- Cytoscape.js canvas (WASM engine planned)
- OIDC via
react-oidc-context - Server state: TanStack Query; client state: Zustand
- Nginx inside the container proxies
/api,/auth,/ingestto the gateway on thesubstrate_internalbridge
Request flow¶
flowchart TB
subgraph Browser["Browser"]
UI[React Frontend]
end
subgraph Edge["Frontend container: nginx"]
PROX[Static assets + proxy]
end
subgraph Gateway["Gateway :8080"]
AUTH[JWT validation]
FAN[SSE fan-out]
PROXY[HTTP proxy]
end
subgraph Services["Services"]
ING[Ingestion :8081]
GRAPH[Graph :8082]
end
subgraph Data["PostgreSQL"]
REL[(Relational + pgvector)]
AGE[(AGE graph 'substrate')]
SSE[(sse_events + LISTEN/NOTIFY)]
end
UI -->|HTTP + SSE| PROX
PROX -->|/api, /auth, /ingest| Gateway
AUTH --> PROXY
PROXY -->|/api/graph/*, /api/sources/* GET| GRAPH
PROXY -->|/api/syncs POST/DELETE, /api/schedules writes, /ingest/*| ING
FAN -->|LISTEN substrate_sse| SSE
FAN -->|replay| SSE
ING -->|Write| REL
ING -->|Write| AGE
ING -->|NOTIFY| SSE
GRAPH -->|Read| REL
GRAPH -->|Read| AGE Data flow: GitHub repository → graph¶
flowchart LR
GH[GitHub Repo] -->|git clone --depth 1| ING
ING -->|discover + classify| DISC[Files]
DISC -->|tree-sitter plugin| PARSE[Imports + symbols]
DISC -->|AST/semantic chunker| CHK[Chunks with breadcrumbs]
ING -->|file_embeddings row| REL[file_embeddings]
ING -->|content_chunks rows| REL2[content_chunks]
ING -->|:File nodes + edges| AGE[AGE graph]
ING -->|embed summaries| EMB[Embedding LLM]
EMB -->|backfill| REL
ING -->|embed chunks| EMB
EMB -->|backfill| REL2 Summary pipelines (two, clearly separate)¶
Ingestion-side file summary — cheap, embedded once at sync time: - Template: "path: <p>\ntype: <t>\nlanguage: <l>\n\n<first 100 lines>" - Prefixed with search_document:, truncated to 1400 chars, sent to embedding LLM - 896-dim vector stored in file_embeddings.embedding - Never stored as text
Graph-side enriched summary — on-demand via GET /api/graph/nodes/{id}/summary: - Full file reconstructed from content_chunks (line-overlap dedup, cap 5 MB) - Top-K edge neighbors (summary_edge_neighbors=10) ranked by cosine similarity of file embeddings - Each neighbor's first 8 lines + cached description attached - Total prompt budget: 100 000 chars (88 % file, 10 % neighbors); context-overflow retries at 0.5× then 0.25× budget - Dense LLM call (temperature=0.2, max_tokens=400, enable_thinking=false) - Result cached in file_embeddings.description + description_generated_at - Never embedded — this is English text for humans
Security architecture¶
Authentication¶
- Keycloak OIDC with PKCE for the SPA
- JWT access tokens (RS256) validated by the Gateway
- JWKS fetched with 5-minute TTL cache and stale-while-revalidate refresh
verify_aud=Falseat the gateway (audience not enforced today)
Authorization¶
- Currently: authentication only. No RBAC.
- Fine-grained RBAC and OPA policy evaluation are planned.
Data protection¶
- All source analysis happens locally on host
- No repository data leaves the infrastructure
- Embeddings and dense summaries generated by local llama.cpp servers
- Prod: TLS handled upstream by home-stack NPM (Let's Encrypt)
Scalability¶
Current scaling characteristics¶
| Component | Approach |
|---|---|
| Gateway | Stateless; can run multiple instances behind a load balancer |
| Ingestion | Single scheduler + runner; syncs processed sequentially per source |
| Graph | Stateless; horizontally scalable |
| PostgreSQL | Vertical; read replicas possible |
Performance targets (current)¶
| Metric | Target | Notes |
|---|---|---|
| Graph query | < 500 ms | Depends on snapshot size + AGE query complexity |
| Sync completion | Minutes | Varies with repo size |
| Search | < 1 s | Embed query → pgvector cosine search |
| Enriched summary | 5-60 s | Dense LLM call; file + neighbors prefill dominates |
Monitoring and observability¶
Structured logging¶
All services emit JSON via structlog:
{
"timestamp": "2026-04-20T14:23:01Z",
"level": "info",
"service": "graph",
"event": "snapshot_query",
"sync_count": 2,
"node_count": 150,
"duration_ms": 45
}
4xx responses log at info; 5xx at error.
Health checks¶
Every service exposes GET /health → {"status":"ok"}.
make doctor¶
Probes each layer (Postgres, AGE, Keycloak, pgadmin, LLM endpoints, service /healths) and prints PASS/FAIL per probe. 15 probes in the current set.