Dynamic Persona MoE RAG: Building a Memory-Driven Synthetic Intelligence with Knowledge Graphs and Persona Evolution

Introduction

What I am building with this blog system is not a publishing pipeline in the conventional sense, but a continuously evolving synthetic intelligence that treats writing itself as a form of memory. The archive of markdown files is not merely content to be rendered, indexed, or searched. It is the long-term memory substrate of the system, a historical record of thought that can be reasoned over, recomposed, and transformed as new information arrives. Each post becomes both an artifact and a structural element in a larger cognitive system whose primary purpose is synthesis rather than retrieval.

Short-Term Memory: The Mutable Persona State

Short-term memory is implemented as a mutable persona state file that captures the current configuration of personality weights. These weights represent traits, priorities, tone, and behavioral tendencies that are fixed in the present moment but continuously adjustable in response to new events. Any incoming stimulus, whether user input, system signals, or external data, can trigger updates to these weights through arbitrary update functions. These functions may be heuristic, learned, or reinforcement-driven, allowing the persona to shift gradually rather than reset between interactions. In this way, short-term memory acts as a living state vector that reflects the chatbot's immediate context and recent history.

The short-term persona is grounded in a knowledge graph derived from source documents, structured data, or predefined schemas. These graphs may be generated dynamically or partially pre-constructed with constraints and parameters that define allowable structures. Instead of modifying system prompts directly, higher-level queries can be issued to adjust the persona weights themselves, effectively changing how the system interprets and composes context. Long-form documents in the knowledge base can be treated as time-indexed personas, capturing snapshots of perspective that evolve as new information arrives. This enables the system to reason not only over content, but over how its interpretive stance has changed across time.

A central goal of this architecture is the generation of new personas as first-class artifacts. As interactions accumulate and new data is ingested, the system synthesizes updated or entirely new persona configurations. These personas are persisted within a file or graph-based structure, allowing them to be recalled, compared, or analyzed longitudinally. Time series data associated with persona evolution can be surfaced to the user interface as reports, visualizations, or signals that influence other backend processes. Reinforcement learning signals and continuous analysis of incoming data streams, such as user input or RSS feeds, drive the selective strengthening, weakening, or branching of these personas.

The persona's primary operational role is to function as a lens through which the final language model inference is executed. This lens encodes the variables, constraints, and stylistic parameters that shape generation. By abstracting these variables away from any single model provider, the system can supply a consistent set of inputs to different LLMs, achieving a degree of deterministic behavior across inference engines. While outputs will never be perfectly identical, the persona ensures that the same conceptual and stylistic biases are applied regardless of provider. In more advanced implementations, this lens also includes parameters governing agentic behavior, such as planning depth, tool usage preferences, or context assembly strategies.

As architectures scale beyond simple vector-based retrieval augmented generation, the persona lens expands to include instructions for composing context from heterogeneous sources. These may include symbolic reasoning outputs, structured database queries, procedural memories, or dynamically generated subgraphs. The persona therefore not only influences generation, but actively shapes how context is assembled before inference. It becomes a coordinating structure that mediates between memory, reasoning, and language.

Long-Term Memory: The Temporal Knowledge Graph

Long-term memory is realized through the continuous ingestion of data into a persistent knowledge graph. This graph accumulates information over time and encodes relationships between entities, events, concepts, and prior interactions. From the current state of this graph, a persona lens can be derived dynamically, reflecting the system's accumulated experience rather than its immediate conversational state. The knowledge graph is composed of subgraphs that correspond to specific usage contexts, queries, or temporal windows, allowing the system to reason over both structure and recency.

These subgraphs may be constructed on demand at query time or incrementally maintained as structured representations that evolve with continued use. Nodes and relationships gain or lose salience based on how frequently and how recently they are accessed. Time series information is therefore intrinsic to the graph, enabling decay functions, reinforcement effects, and temporal heuristics. The persona lens generated from long-term memory is informed by this temporal structure, weighting recent and relevant knowledge more heavily while still retaining access to deeper historical context. In this way, long-term memory provides continuity and identity, while short-term memory provides adaptability and situational awareness, both unified through the evolving persona framework.

Core Abstractions

Persona as a State Vector

A persona at a given moment in time can be described as a collection of interpretable trait keys, where each key has an associated numeric weight that changes over time. Each key represents a specific behavioral or stylistic dimension such as tone, epistemic stance, verbosity, or abstraction level. The full persona is therefore the complete set of these key-weight pairs at that moment. The set of keys is fixed across the system, while the weights evolve as the system interacts with new data and events.

This representation is not prompt text. It is a structured control surface that governs how context is assembled and how generation is shaped.

Short-Term Memory as a State Transition System

Short-term memory can be understood as a process that transforms the current persona into a new persona in response to an event. The persona at the next moment in time is produced by applying an update function to the current persona and the triggering event. The event may be a user message, a retrieved document, or an internal system signal.

In a typical update rule, each persona weight at the next moment is computed as a combination of its previous value and a contribution derived from the event. A decay factor controls how much of the old value is retained, while an event influence factor controls how strongly the event pushes the weight in a new direction. This ensures smooth adaptation rather than abrupt shifts.

Python
class PersonaState:
    def __init__(self, weights: dict[str, float]):
        self.weights = weights

    def update(self, event_features: dict[str, float], alpha=0.9, beta=0.1):
        for k, delta in event_features.items():
            self.weights[k] = alpha * self.weights.get(k, 0.0) + beta * delta

This is your short-term memory: volatile, contextual, and continuously rewritten.

Long-Term Memory as a Temporal Knowledge Graph

Knowledge Graph Definition

Long-term memory is represented as a directed, labeled graph with time-aware metadata. The graph consists of a set of nodes, a set of edges connecting those nodes, and a timing function that assigns timestamps and decay-related metadata to edges.

Nodes represent entities such as documents, concepts, users, or personas. Edges represent labeled relationships between those entities. Each node and edge stores semantic embeddings, symbolic attributes, and usage statistics that track how often and how recently they are accessed.

Python
class KGNode:
    def __init__(self, node_id, embedding, metadata):
        self.id = node_id
        self.embedding = embedding
        self.metadata = metadata
        self.last_used = None
        self.use_count = 0

Subgraph Extraction as Time-Conditioned Retrieval

Rather than returning isolated text chunks, retrieval returns a subgraph of the knowledge graph. A node is included in the retrieved subgraph if its relevance to the query, multiplied by its recency, exceeds a threshold.

Relevance is computed using embedding similarity or symbolic matching. Recency is computed using an exponential decay function that decreases as the time since last access increases. Nodes accessed more recently therefore contribute more strongly to the retrieved context.

Python
import math
from datetime import datetime

def recency_weight(node, now=None, lambda_=0.01):
    if now is None:
        now = datetime.now().timestamp()
    if node.last_used is None:
        return 0.5
    return math.exp(-lambda_ * (now - node.last_used))

Persona Generation from Graph State

Persona as a Projection of the Knowledge Graph

A persona can be generated from long-term memory by projecting the retrieved subgraph into persona space. This projection aggregates information from each node in the subgraph into persona dimensions.

For each persona dimension, the final weight is computed by summing contributions from all nodes in the subgraph. Each contribution is the product of the node's recency weight and a feature-mapping function that translates node properties into that persona dimension.

Python
def persona_from_subgraph(nodes, feature_maps):
    weights = {}
    for node in nodes:
        r = recency_weight(node)
        for k, fn in feature_maps.items():
            weights[k] = weights.get(k, 0.0) + r * fn(node)
    return PersonaState(weights)

This is how long-term memory produces a persona lens.

Dynamic Persona Mixture of Experts (MoE)

Expert Personas

The system maintains a set of expert personas. Each expert corresponds to a distinct reasoning style, domain specialization, rhetorical mode, or historical snapshot of perspective.

Gating Function

A gating mechanism selects and weights expert personas based on the current query, the active short-term persona, and the relevant region of the knowledge graph. The output of this gating process is a normalized set of weights that sum to one, indicating how much each expert persona should contribute.

Python
import numpy as np
from scipy.spatial.distance import cosine

def gate_experts(query_embedding, persona, experts):
    scores = []
    for expert in experts:
        score = 1 - cosine(query_embedding, expert.embedding)
        score += persona.weights.get(expert.bias_key, 0.0)
        scores.append(score)
    scores = np.array(scores)
    exp_scores = np.exp(scores - np.max(scores))
    return exp_scores / exp_scores.sum()

This is mixture-of-experts applied to personas rather than models.

Expert Composition

The final persona is constructed by computing a weighted sum of expert personas. Each expert's persona weights are multiplied by its gating weight, and the results are summed across all experts to produce a single composite persona.

Python
def mix_personas(experts, weights):
    final = {}
    for expert, w in zip(experts, weights):
        for k, v in expert.weights.items():
            final[k] = final.get(k, 0.0) + w * v
    return PersonaState(final)

Persona as an Inference Lens

Deterministic Context Assembly

The persona deterministically controls how context is assembled, including system instructions, memory ordering, compression strategies, and agent behaviors.

Python
def build_llm_context(persona, subgraph, user_query):
    instructions = persona_to_instructions(persona)
    memory = serialize_subgraph(subgraph, persona)
    return {
        "system": instructions,
        "context": memory,
        "query": user_query
    }

This provides a provider-invariant interface: same structure, same variables, different engines.

Reinforcement and Evolution

Persona Reinforcement

When feedback is received, persona weights are adjusted in the direction encouraged by that feedback. Each weight is incremented proportionally to the reward signal and a learning rate. Conceptually, this reinforces behaviors that led to positive outcomes and weakens those that did not.

Python
def reinforce(persona, reward, lr=0.01):
    for k in persona.weights:
        persona.weights[k] += lr * reward

Persistence and Time-Series Personas

Each persona snapshot is saved as a timestamped file, enabling replay, regression analysis, visualization, and diagnostics. Personas become first-class temporal objects rather than ephemeral prompts.

System Overview

At a high level, the system is composed of five persistent backend subsystems:

Markdown Memory Store - the canonical long-term memory substrate
Knowledge Graph Builder - structural representations extracted from memory
Persona Engine - short-term and long-term perspective modeling
Dynamic Persona MoE RAG Orchestrator - expert routing and synthesis
Generation and Ingestion Pipeline - the output-to-memory feedback loop

The defining principle of the system is that every output feeds forward into future structure. Generated text is not discarded after inference. Instead, it is re-ingested, embedded, structured, and allowed to influence future persona formation and retrieval behavior. Nothing is thrown away; information is only reweighted over time.

Implementation Guide

1. Long-Term Memory: Markdown as Canonical Storage

File System Layout

Markdown files act as the authoritative source of truth. A traditional database is intentionally avoided as the primary store in order to preserve immutability and temporal traceability.

/memory
  /posts
    2024-01-12-graph-rag.md
    2024-03-09-persona-evolution.md
  /personas
    2024-03-09T12-30-00.json
  /snapshots
    kg_2024-03-09.pkl

Once a markdown file is published, it is never edited. Any revision creates a new file. This ensures that time is represented explicitly in the memory substrate, enabling causal reasoning and historical reconstruction.

Markdown Ingestion

Each markdown document is parsed into multiple representations: raw text, section hierarchy, metadata, embeddings, and symbolic entities. The ingestion pipeline assigns each document a timestamp derived from frontmatter or filename conventions.

Python
import frontmatter
import os
from datetime import datetime

def ingest_markdown(path):
    with open(path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    post = frontmatter.loads(content)
    text = content.split('---')[-1].strip()
    embedding = embed(text)
    entities = extract_entities(text)
    
    return {
        "text": text,
        "embedding": embedding,
        "entities": entities,
        "timestamp": extract_date(post.get('date', os.path.basename(path))),
        "metadata": post.metadata
    }

The timestamp extraction logic attempts multiple date formats and defaults to the current time only if no valid temporal signal is available. This guarantees that every document participates in the temporal structure of memory.

2. Knowledge Graph Construction

Graph Schema

The knowledge graph is append-only and explicitly temporal. Nodes represent entities such as documents, concepts, personas, agents, or topics. Edges represent labeled relationships between those entities, such as references, elaborations, contradictions, or evolutionary transitions.

Each node stores:

A semantic embedding
The last time it was accessed
The number of times it has been accessed
Its creation timestamp

These fields allow the graph to encode both semantic structure and usage history.

Incremental Graph Updates

The graph is updated whenever new markdown is ingested, new personas are created, or new synthesis outputs are generated. Graph growth is incremental and monotonic.

Python
class KnowledgeGraph:
    def __init__(self):
        self.nodes = {}
        self.edges = {}

Node access timestamps and counts are updated whenever a node participates in retrieval or synthesis. This provides the raw data necessary for recency-based weighting and decay.

3. Persona Engine

Persona Definition

A persona is defined as a structured vector of fixed dimensions. Each dimension corresponds to a behavioral or interpretive axis, such as tone, abstraction level, epistemic caution, synthesis depth, or agentic autonomy.

JSON
{
  "tone": 0.7,
  "abstraction": 0.9,
  "epistemic_caution": 0.4,
  "synthesis_depth": 0.95,
  "agentic_autonomy": 0.6
}

The dimensional schema is global and invariant across the system. Only the values evolve.

Short-Term Persona State

The short-term persona exists only for the duration of a session. It is initialized from a base persona and then updated incrementally as events occur.

Each update blends the previous value with an event-derived signal, ensuring continuity rather than abrupt change.

Python
persona_t = PersonaState(base_persona.weights.copy())
persona_t.update(event_features)

Although short-term personas are reset between sessions, they are logged for later analysis.

Long-Term Persona Derivation

Long-term personas are derived by aggregating signals from relevant regions of the knowledge graph. Each node contributes to persona dimensions proportionally to how recently and how frequently it has been accessed.

Python
def persona_from_graph(subgraph, feature_maps):
    weights = {k: 0.0 for k in PERSONA_KEYS}

This mechanism allows accumulated experience, rather than immediate interaction, to shape perspective.

4. Dynamic Persona MoE RAG

Subgraph Retrieval (Not Vector RAG)

Retrieval returns a structured subgraph rather than isolated chunks of text. Each node is scored based on a combination of semantic relevance and temporal recency.

Relevance is computed via embedding similarity. Recency is computed via exponential decay based on time since last access. The final score is the product of these two signals.

score = relevance * recency

This ensures that old but important knowledge does not vanish, while recent and relevant knowledge is prioritized.

Persona Experts

Each persona expert is a persisted snapshot with known behavioral tendencies and historical context. Experts are associated with specific graph regions and agent behaviors.

Experts are selected based on overlap between their relevance criteria and the retrieved subgraph.

Gating and Composition

If multiple experts are applicable, a gating mechanism assigns weights to each expert based on query alignment and persona bias. These weights are normalized so that they sum to one.

The final persona is produced by computing a weighted average of expert persona weights across all dimensions.

Python
final_persona = mix_personas(experts, weights)

This produces a smooth interpolation rather than a hard switch.

5. Agentic Context Assembly

Agents do not generate text. They prepare structure.

Each agent operates under constraints imposed by the persona. For example, an abstraction agent may produce higher-level summaries when abstraction weight is high, or defer entirely when it is low.

Python
class SummarizerAgent:
    def run(self, subgraph):
        level = self.persona.weights.get("abstraction", 0.5)
        return summarize_nodes(subgraph, level)

Agents collectively assemble structured context, which is then passed to the language model.

6. LLM Inference as Execution

The language model receives three things:

System instructions derived from persona weights
Structured context produced by agents
The user or synthesis prompt

The persona determines tone, abstraction level, and analytical depth deterministically.

Python
def persona_to_system(persona):
    tone = "formal" if persona.weights.get("tone", 0.5) > 0.6 else "informal"

The language model is treated as an execution engine, not a memory store.

7. Feedback Loop: Output to Memory

Every generated output is:

Saved as markdown
Embedded
Parsed
Added to the knowledge graph
Used to update persona trajectories

Python
def feedback_loop(response):
    filename = generate_timestamped_filename()

This closes the loop. Output becomes memory. Memory shapes future perspective. Perspective shapes future output.

Time-Series and Diagnostics

The system tracks persona drift by measuring how much persona weights change between successive snapshots. Large drift indicates instability or exploration. Low drift indicates convergence or rigidity.

Topic entropy is computed by examining the distribution of document topics over time. High entropy indicates exploration across domains. Low entropy indicates thematic narrowing.

These diagnostics are essential for understanding and debugging synthetic cognition.

Why This Works

This architecture avoids prompt fragility, stateless memory, vector-only retrieval collapse, and provider lock-in.

Instead, it provides explicit memory, inspectable cognition, deterministic synthesis, and identity that evolves over time.

Summary: Architectural Identity

This system is not chat memory.

It is a stateful control system built on a temporal knowledge graph, using persona-based mixture-of-experts routing to deterministically shape inference.

Dynamic Persona MoE RAG is a system where retrieval retrieves structure, memory generates perspective, and personas shape cognition over time.