RAG Architecture

Hybrid retrieval, vector storage, conversation memory, citations, and rerank — the whole RAG pipeline in @inferagraph/core 0.8.0.

Overview

Core 0.8.0 ships a complete RAG pipeline as library API. Hosts wire three pieces — an EmbeddingStore, an InferredEdgeStore, and (optionally) a ConversationStore — and the rest is automatic: indexing via GraphIndexer, hybrid retrieval, cross-encoder rerank, multi-turn memory with pronoun resolution, citation rendering hooks, and a first-class diagnostic surface for ops visibility.

  • Library-grade — every consumer of @inferagraph/core gets the same RAG infrastructure. No host-specific orchestration in the library.
  • Provider-agnosticEmbeddingStore, InferredEdgeStore, and ConversationStore are interfaces. In-memory defaults ship in core; production impls ship in datasource / cache packages.
  • Domain-blind — content keys, embedding model, dimensions, distance function are all configuration. The library has zero knowledge of your domain schema.

Embedding the graph: GraphIndexer

The GraphIndexer walks the in-memory GraphStore, computes a content-priority text per node via embeddingText(node, { contentKeys }), hashes it, and persists vectors via the configured EmbeddingStore. Idempotent — the indexer skips nodes whose embeddingHash matches the stored value, so re-running is free when source data hasn't changed.

import { GraphIndexer } from '@inferagraph/core';
import {
  cosmosVectorEmbeddingStore,
  cosmosInferredEdgeStore,
} from '@inferagraph/cosmosdb';

// One call after data load handles the whole RAG pipeline.
// Storage factories own SDK construction — pass endpoint + key directly.
const indexer = new GraphIndexer({
  store: graphStore,                          // loaded from any DataAdapter
  provider: llmProvider,                       // any LLMProvider with embed()
  embeddingStore: cosmosVectorEmbeddingStore({
    endpoint: process.env.COSMOS_ENDPOINT,
    key: process.env.COSMOS_KEY,
    database: 'inferagraph',
    container: 'units',
  }),
  inferredEdgeStore: cosmosInferredEdgeStore({
    endpoint: process.env.COSMOS_ENDPOINT,
    key: process.env.COSMOS_KEY,
    database: 'inferagraph',
  }),
  contentKeys: ['content'],                   // drives the embedding body
  embeddingModel: 'text-embedding-3-large',
  embeddingDimensions: 3072,
});

await indexer.embedAll({
  onProgress: (stage, done, total) => console.log(stage, done, total),
});
await indexer.computeInferredEdges();
await indexer.reconcile();   // optional — drops orphans

embedAll(opts?)

Walks every node, batches changed texts in chunks of 16 to amortize provider round-trips. Optional onProgress(stage, done, total) callback for UI / logs.

computeInferredEdges(opts?)

For each pair of nodes whose pairwise cosine similarity exceeds inferredEdgeThreshold (default 0.75) and that have no explicit edge, generates a description via provider.complete(), embeds it, and persists. Per-node cap (default 5) bounds the count. inferredEdgeBatchSize (default 1, 0.9.0+) bundles K candidate pairs into one LLM completion that returns K descriptions; the indexer falls back to per-pair calls if the batch JSON is malformed. Trade slightly more parsing surface for fewer LLM round-trips on large graphs.

recomputeInferredEdgesFor(nodeId)

Targeted re-run for one node — useful after a single document edit so you don't reindex the whole corpus.

reconcile()

Drops embeddings whose source nodes no longer exist and inferred edges whose source / target nodes were removed. Run after upstream deletions.

contentKeys defaults to ['content', 'description', 'body', 'summary'] — the indexer inlines the first matching attribute as the embedding's body, with the rest of the node's attributes as a header. Override per host (for example, ['content'] only) to keep embeddings tight.

Storage abstractions

Three interfaces, each with an in-memory default in core and a production impl in a separate package. Swap implementations without touching engine code.

EmbeddingStore

lookup + set by (nodeId, model, modelVersion, hash); optional searchVector(query, top, container?) for vector-native stores.

Default: inMemoryEmbeddingStore() in core.
Production: cosmosVectorEmbeddingStore in @inferagraph/cosmosdb, sqlVectorEmbeddingStore in @inferagraph/sql, or redisVectorEmbeddingStore in @inferagraph/redis.

InferredEdgeStore

set / get / delete + searchInferredEdges(query, top) for retrieval-time matching of inferred relationships.

Default: in-memory impl in core.
Production: cosmosInferredEdgeStore in @inferagraph/cosmosdb, sqlInferredEdgeStore in @inferagraph/sql, or redisInferredEdgeStore in @inferagraph/redis.

ConversationStore

getTurns(id, limit) / appendTurn(id, turn) / clear(id). Each turn carries retrievedNodeIds[] for pronoun resolution.

Default: inMemoryConversationStore() in core.
Production: redisConversationStore in @inferagraph/redis, cosmosConversationStore in @inferagraph/cosmosdb, or sqlConversationStore in @inferagraph/sql.

Hybrid retrieval + rerank

At chat time the engine runs three retrievers in parallel and merges them with a weighted-sum ranker. Top-N candidates then go through a cross-encoder rerank pass before reaching the LLM.

import { AIEngine } from '@inferagraph/core';

const engine = new AIEngine({
  store: graphStore,
  provider: llmProvider,
  embeddingStore,                              // same instance the indexer wrote to
  inferredEdgeStore,
  embeddingContentKeys: ['content'],
  chatRerankEnabled: true,
  chatRerankCandidates: 20,                  // hybrid retrieval top-N
  chatRerankTopK: 8,                          // kept after rerank
  priorTurnLimit: 8,                          // multi-turn memory window
});

for await (const ev of engine.chat('Tell me about Cain', {
  conversationId: 'session-1',
})) {
  if (ev.type === 'text') process.stdout.write(ev.delta);
  if (ev.type === 'debug') console.debug('[chat]', ev.phase, ev.counters);
}

Semantic

embeddingStore.searchVector(queryEmbedding, top). Cosine-similarity ranking against pre-computed node embeddings. Weight 0.6 in the merged score.

Keyword

The built-in SearchEngine matches the query against node attributes (title, tags, etc.). Weight 0.3.

Graph expansion

1-hop neighbors of the top semantic seeds. Captures relational follow-ups ("who lived with Eve?") that pure vector search misses. Weight 0.3.

The debug stream reports merge sizes (vector-search) and rerank counters (rerank) so hosts can render badges and ops dashboards.

Citations

The engine handles citations server-side as of @inferagraph/core@^0.12.0. The model writes naturally; after the stream completes, the engine scans the assistant text against every node in the store and rewrites each title occurrence into [[token|matched-text]]. The host renders the answer through <ChatText>, supplying a renderCitation(token, matchedText) callback to wire each citation to a clickable link. The library is URL-agnostic — the host owns slug resolution and routing.

// 0.12.0 — the engine injects citations server-side as [[token|matched-text]].
// Use <ChatText> (library — @inferagraph/core/react) to render the answer; it
// parses the wire and dispatches each citation through renderCitation(token,
// matchedText). The host owns slug resolution and routing.
import { ChatText } from '@inferagraph/core/react';

<ChatText
  text={message.text}
  renderCitation={(token, matchedText) => {
    const resolved = slugResolver.resolve(token);
    if (!resolved) return <span className="warn">{matchedText}</span>;     // unknown citation
    return <a href={`/${resolved.type}/${resolved.slug}`}>{matchedText}</a>; // model casing wins
  }}
/>

Filter the token through your slug resolver so a hallucinated token renders the matched text as plain prose plus an "unknown citation" warning rather than a broken link. Citations resolve against the whole store, not just the per-turn rerank top-K — entities outside that turn's relevant set still cite when their title appears in the response.

Slug-routed hosts: citationKey (0.12.0)

On hosts whose node.id is a UUID but whose URLs use a slug, set AIEngineConfig.citationKey: 'slug' (or whichever attribute holds the citation token). The engine emits [[slug|matched-text]] tokens after the model stream completes; highlight() / focus() tool calls still take the canonical id. Full pattern in Chat API → citationKey.

Hallucinated ids: onToolCallOutcome (0.9.3+)

When the model emits a highlight() or focus() tool call referencing an id the renderer hasn't seen, SceneController now returns {appliedIds, unknownIds} and the chat hook flows the partition into onToolCallOutcome. Render an "N highlighted, M unknown" chip beside the answer to surface dropped ids — see Visualization → Diagnostics.

Conversation memory

Pass conversationId on each chat() call. The engine fetches the prior priorTurnLimit turns from the configured ConversationStore, builds the messages array as [system, ...priorTurns, user], and after the stream closes appends the user turn + assistant turn. Each turn records retrievedNodeIds[] so a follow-up "tell me more about him" can resolve "him" against the prior turn's retrieved set without re-running an ambiguous retrieval.

// In-memory store ships in core for dev / tests / single-process.
import { inMemoryConversationStore } from '@inferagraph/core/data';
engine.setConversationStore(inMemoryConversationStore());

// Production: Redis-backed store survives restarts and is shareable across
// processes. Same ConversationStore interface — one-line swap. Cosmos and
// SQL equivalents (cosmosConversationStore, sqlConversationStore) ship in
// @inferagraph/cosmosdb and @inferagraph/sql respectively.
import { redisConversationStore } from '@inferagraph/redis';
engine.setConversationStore(redisConversationStore({
  url: process.env.REDIS_URL,
  keyPrefix: 'inferagraph:conversation', // default
  ttlSeconds: 86_400,                  // 24h, refreshed on every appendTurn
}));

See Chat API → Multi-turn memory for the request-side wiring and Caching for the Redis impl's storage layout, TTL, and defensive parsing.

Diagnostic surface

Every phase of the RAG pipeline emits a debug ChatEvent the host can render. Phases include vector-search, rerank, pronoun-resolve, retrieval-empty, substitution-fired, engine-empty, and warmup-failed (0.9.0+, fires on the next chat() call after a background warmup error). Use the onDiagnostic callback prop on <InferaGraph> to surface them as collapsible badges beneath chat bubbles.

Full event shape and an example consumer rendering badges live in Visualization → Diagnostics.

Provider configuration

The new RAG knobs on AIEngineConfig — all additive, all with sensible defaults.

Option Default When to tune
embeddingContentKeys ['content', 'description', 'body', 'summary'] Set to ['content'] when only one attribute holds the document body and the rest are short metadata.
chatRerankEnabled true Disable on slow networks where the extra N parallel completions add too much latency.
chatRerankCandidates 20 Top-N from hybrid retrieval that get re-ranked. Increase for higher recall on broad queries.
chatRerankTopK 8 Top-K kept after rerank, fed to the chat-completion call. Balance prompt size vs. recall.
priorTurnLimit 8 Multi-turn memory window. Higher values increase prompt cost; lower values lose context faster.

For per-package setup details (vector index policy, container provisioning, Redis layout) see Datasources → CosmosDB and Caching.