RAG Architecture
Hybrid retrieval, vector storage, conversation memory, citations, and rerank — the whole RAG pipeline in @inferagraph/core 0.8.0.
Overview
Core 0.8.0 ships a complete RAG pipeline as library API. Hosts wire three pieces — an EmbeddingStore, an InferredEdgeStore, and (optionally) a ConversationStore — and the rest is automatic: indexing via GraphIndexer, hybrid retrieval, cross-encoder rerank, multi-turn memory with pronoun resolution, citation rendering hooks, and a first-class diagnostic surface for ops visibility.
- Library-grade — every consumer of @inferagraph/core gets the same RAG infrastructure. No host-specific orchestration in the library.
- Provider-agnostic — EmbeddingStore, InferredEdgeStore, and ConversationStore are interfaces. In-memory defaults ship in core; production impls ship in datasource / cache packages.
- Domain-blind — content keys, embedding model, dimensions, distance function are all configuration. The library has zero knowledge of your domain schema.
Embedding the graph: GraphIndexer
The GraphIndexer walks the in-memory GraphStore, computes a content-priority text per node via embeddingText(node, { contentKeys }), hashes it, and persists vectors via the configured EmbeddingStore. Idempotent — the indexer skips nodes whose embeddingHash matches the stored value, so re-running is free when source data hasn't changed.
import { GraphIndexer } from '@inferagraph/core';
import {
cosmosVectorEmbeddingStore,
cosmosInferredEdgeStore,
} from '@inferagraph/cosmosdb';
// One call after data load handles the whole RAG pipeline.
// Storage factories own SDK construction — pass endpoint + key directly.
const indexer = new GraphIndexer({
store: graphStore, // loaded from any DataAdapter
provider: llmProvider, // any LLMProvider with embed()
embeddingStore: cosmosVectorEmbeddingStore({
endpoint: process.env.COSMOS_ENDPOINT,
key: process.env.COSMOS_KEY,
database: 'inferagraph',
container: 'units',
}),
inferredEdgeStore: cosmosInferredEdgeStore({
endpoint: process.env.COSMOS_ENDPOINT,
key: process.env.COSMOS_KEY,
database: 'inferagraph',
}),
contentKeys: ['content'], // drives the embedding body
embeddingModel: 'text-embedding-3-large',
embeddingDimensions: 3072,
});
await indexer.embedAll({
onProgress: (stage, done, total) => console.log(stage, done, total),
});
await indexer.computeInferredEdges();
await indexer.reconcile(); // optional — drops orphansembedAll(opts?)
Walks every node, batches changed texts in chunks of 16 to amortize provider round-trips. Optional onProgress(stage, done, total) callback for UI / logs.
computeInferredEdges(opts?)
For each pair of nodes whose pairwise cosine similarity exceeds inferredEdgeThreshold (default 0.75) and that have no explicit edge, generates a description via provider.complete(), embeds it, and persists. Per-node cap (default 5) bounds the count. inferredEdgeBatchSize (default 1, 0.9.0+) bundles K candidate pairs into one LLM completion that returns K descriptions; the indexer falls back to per-pair calls if the batch JSON is malformed. Trade slightly more parsing surface for fewer LLM round-trips on large graphs.
recomputeInferredEdgesFor(nodeId)
Targeted re-run for one node — useful after a single document edit so you don't reindex the whole corpus.
reconcile()
Drops embeddings whose source nodes no longer exist and inferred edges whose source / target nodes were removed. Run after upstream deletions.
contentKeys defaults to ['content', 'description', 'body', 'summary'] — the indexer inlines the first matching attribute as the embedding's body, with the rest of the node's attributes as a header. Override per host (for example, ['content'] only) to keep embeddings tight.
Storage abstractions
Three interfaces, each with an in-memory default in core and a production impl in a separate package. Swap implementations without touching engine code.
EmbeddingStore
lookup + set by (nodeId, model, modelVersion, hash); optional searchVector(query, top, container?) for vector-native stores.
Default: inMemoryEmbeddingStore() in core.
Production: cosmosVectorEmbeddingStore in @inferagraph/cosmosdb, sqlVectorEmbeddingStore in @inferagraph/sql, or redisVectorEmbeddingStore in @inferagraph/redis.
InferredEdgeStore
set / get / delete + searchInferredEdges(query, top) for retrieval-time matching of inferred relationships.
Default: in-memory impl in core.
Production: cosmosInferredEdgeStore in @inferagraph/cosmosdb, sqlInferredEdgeStore in @inferagraph/sql, or redisInferredEdgeStore in @inferagraph/redis.
ConversationStore
getTurns(id, limit) / appendTurn(id, turn) / clear(id). Each turn carries retrievedNodeIds[] for pronoun resolution.
Default: inMemoryConversationStore() in core.
Production: redisConversationStore in @inferagraph/redis, cosmosConversationStore in @inferagraph/cosmosdb, or sqlConversationStore in @inferagraph/sql.
Hybrid retrieval + rerank
At chat time the engine runs three retrievers in parallel and merges them with a weighted-sum ranker. Top-N candidates then go through a cross-encoder rerank pass before reaching the LLM.
import { AIEngine } from '@inferagraph/core';
const engine = new AIEngine({
store: graphStore,
provider: llmProvider,
embeddingStore, // same instance the indexer wrote to
inferredEdgeStore,
embeddingContentKeys: ['content'],
chatRerankEnabled: true,
chatRerankCandidates: 20, // hybrid retrieval top-N
chatRerankTopK: 8, // kept after rerank
priorTurnLimit: 8, // multi-turn memory window
});
for await (const ev of engine.chat('Tell me about Cain', {
conversationId: 'session-1',
})) {
if (ev.type === 'text') process.stdout.write(ev.delta);
if (ev.type === 'debug') console.debug('[chat]', ev.phase, ev.counters);
}Semantic
embeddingStore.searchVector(queryEmbedding, top). Cosine-similarity ranking against pre-computed node embeddings. Weight 0.6 in the merged score.
Keyword
The built-in SearchEngine matches the query against node attributes (title, tags, etc.). Weight 0.3.
Graph expansion
1-hop neighbors of the top semantic seeds. Captures relational follow-ups ("who lived with Eve?") that pure vector search misses. Weight 0.3.
The debug stream reports merge sizes (vector-search) and rerank counters (rerank) so hosts can render badges and ops dashboards.
Citations
The engine handles citations server-side as of @inferagraph/core@^0.12.0. The model writes naturally; after the stream completes, the engine scans the assistant text against every node in the store and rewrites each title occurrence into [[token|matched-text]]. The host renders the answer through <ChatText>, supplying a renderCitation(token, matchedText) callback to wire each citation to a clickable link. The library is URL-agnostic — the host owns slug resolution and routing.
// 0.12.0 — the engine injects citations server-side as [[token|matched-text]].
// Use <ChatText> (library — @inferagraph/core/react) to render the answer; it
// parses the wire and dispatches each citation through renderCitation(token,
// matchedText). The host owns slug resolution and routing.
import { ChatText } from '@inferagraph/core/react';
<ChatText
text={message.text}
renderCitation={(token, matchedText) => {
const resolved = slugResolver.resolve(token);
if (!resolved) return <span className="warn">{matchedText}</span>; // unknown citation
return <a href={`/${resolved.type}/${resolved.slug}`}>{matchedText}</a>; // model casing wins
}}
/>Filter the token through your slug resolver so a hallucinated token renders the matched text as plain prose plus an "unknown citation" warning rather than a broken link. Citations resolve against the whole store, not just the per-turn rerank top-K — entities outside that turn's relevant set still cite when their title appears in the response.
Slug-routed hosts: citationKey (0.12.0)
On hosts whose node.id is a UUID but whose URLs use a slug, set AIEngineConfig.citationKey: 'slug' (or whichever attribute holds the citation token). The engine emits [[slug|matched-text]] tokens after the model stream completes; highlight() / focus() tool calls still take the canonical id. Full pattern in Chat API → citationKey.
Hallucinated ids: onToolCallOutcome (0.9.3+)
When the model emits a highlight() or focus() tool call referencing an id the renderer hasn't seen, SceneController now returns {appliedIds, unknownIds} and the chat hook flows the partition into onToolCallOutcome. Render an "N highlighted, M unknown" chip beside the answer to surface dropped ids — see Visualization → Diagnostics.
Conversation memory
Pass conversationId on each chat() call. The engine fetches the prior priorTurnLimit turns from the configured ConversationStore, builds the messages array as [system, ...priorTurns, user], and after the stream closes appends the user turn + assistant turn. Each turn records retrievedNodeIds[] so a follow-up "tell me more about him" can resolve "him" against the prior turn's retrieved set without re-running an ambiguous retrieval.
// In-memory store ships in core for dev / tests / single-process.
import { inMemoryConversationStore } from '@inferagraph/core/data';
engine.setConversationStore(inMemoryConversationStore());
// Production: Redis-backed store survives restarts and is shareable across
// processes. Same ConversationStore interface — one-line swap. Cosmos and
// SQL equivalents (cosmosConversationStore, sqlConversationStore) ship in
// @inferagraph/cosmosdb and @inferagraph/sql respectively.
import { redisConversationStore } from '@inferagraph/redis';
engine.setConversationStore(redisConversationStore({
url: process.env.REDIS_URL,
keyPrefix: 'inferagraph:conversation', // default
ttlSeconds: 86_400, // 24h, refreshed on every appendTurn
}));See Chat API → Multi-turn memory for the request-side wiring and Caching for the Redis impl's storage layout, TTL, and defensive parsing.
Diagnostic surface
Every phase of the RAG pipeline emits a debug ChatEvent the host can render. Phases include vector-search, rerank, pronoun-resolve, retrieval-empty, substitution-fired, engine-empty, and warmup-failed (0.9.0+, fires on the next chat() call after a background warmup error). Use the onDiagnostic callback prop on <InferaGraph> to surface them as collapsible badges beneath chat bubbles.
Full event shape and an example consumer rendering badges live in Visualization → Diagnostics.
Provider configuration
The new RAG knobs on AIEngineConfig — all additive, all with sensible defaults.
| Option | Default | When to tune |
|---|---|---|
| embeddingContentKeys | ['content', 'description', 'body', 'summary'] | Set to ['content'] when only one attribute holds the document body and the rest are short metadata. |
| chatRerankEnabled | true | Disable on slow networks where the extra N parallel completions add too much latency. |
| chatRerankCandidates | 20 | Top-N from hybrid retrieval that get re-ranked. Increase for higher recall on broad queries. |
| chatRerankTopK | 8 | Top-K kept after rerank, fed to the chat-completion call. Balance prompt size vs. recall. |
| priorTurnLimit | 8 | Multi-turn memory window. Higher values increase prompt cost; lower values lose context faster. |
For per-package setup details (vector index policy, container provisioning, Redis layout) see Datasources → CosmosDB and Caching.