From Scripted to Sentient

A Framework for Persistent AI NPCs in Multiplayer Worlds Byiringiro Thierry · February 2026

1. Abstract

Generative NPCs have shipped in commercial games — Inworld powers characters in EX-ZODIAC and a handful of indie titles; Convai shipped in NVIDIA's Project G-Assist demos; Suck Up! and AI People are LLM-NPC indie hits. But the bar is still single-session, single-player coherence. No shipped 2026 game maintains NPC mental state across multiple sessions, multiple co-existing players, or hundreds of hours of accumulated player interaction.

This paper presents a four-layer architecture for persistent NPCs — agents that:

Remember what you said yesterday, and what your guildmate said three weeks ago.
Form opinions about player factions, NPC factions, and individual players.
Pursue long-horizon goals that survive across sessions.
Stay in character despite hundreds of hours of player provocation.

The remainder of this paper details each layer, the engineering trade-offs, and the unsolved problems.

2. Why "1.5 generation" is the wrong target

A typical 2026 LLM-NPC pipeline:

Player input → LLM with [system prompt + character bio + last 10 dialog turns]
            → response (text)
            → optional: emotion-tag classifier → animation system

This is one-shot dialog. It's better than scripted dialog trees but not qualitatively different. The NPC does not learn from this conversation. The next conversation starts fresh. There is no "Did you remember to bring me the herb you promised?" — the NPC has no notion of a promise.

What players actually want from "AI NPCs" is what they imagine: characters who know them, judge them, change their behavior. The gap between the 1.5-generation pipeline and that experience is the architectural gap this paper addresses.

3. The four layers

Layer 1: Short-term episodic memory (per-session)

The trivial layer — what 1.5-generation already does. Store the last K turns of dialog verbatim. Pass them in-prompt to the LLM. Drop on session end.

The engineering challenge here is prompt budget, not memory architecture. K = 20 at 100 tokens/turn = 2000 tokens of context per LLM call. With per-call inference costs at ~0.003/1K tokens (Claude Sonnet 4.6 input price, 2026), and a 10-NPC scene at 1 turn/second, you spend 1.80/minute on inference. This is the floor. Lower numbers come from caching (prompt cache) and quantization, not from skipping this layer.

Layer 2: Long-term semantic memory

Where the 1.5-generation stops. We need an NPC who, three weeks later, can recall: "You're the player who helped me find my brother's killer. I remember."

Three retrieval mechanisms, used together:

Vector retrieval over an event log. Every significant interaction is summarized by a "memory writer" call (cheap, 100 tokens of summary per event). The summary is embedded and stored. On future LLM calls, embed the current context, retrieve top-K relevant past memories, prepend to prompt.
Symbolic structured memory. Some facts are not well-served by embedding similarity — they're better stored as structured assertions. The NPC's beliefs about player faction membership, completed quests, debts owed, etc. live in a typed relational structure that the LLM can be told to read/write via tool-use.
Periodic compaction. Naive vector memory grows unboundedly. Every 100 events, run a "memory consolidator" LLM call that takes the recent batch + a sample of older memories and produces consolidated memories: themes, summarized arcs, and discarded trivia. Vector store is rebuilt around the consolidated set. Aligned with human memory consolidation in sleep — the metaphor is not accidental.

Layer 3: Social-graph memory

Multiplayer worlds add a dimension single-player NPC work hasn't had to solve: NPCs need to track other players and other NPCs as distinct entities with their own histories.

Architecture:

                  ┌──────────────────┐
                  │   NPC: Borric    │
                  │   (innkeeper)    │
                  └─────────┬────────┘
        ╔═══════════════════╪═══════════════════╗
   ┌────▼────┐          ┌───▼────┐          ┌──▼─────┐
   │ Player1 │          │ Player2│          │ NPC:   │
   │ Trust:7 │          │ Trust:2│          │ Fenwick│
   │ Debt:0g │          │ Debt:5g│          │ Trust:5│
   └─────────┘          └────────┘          └────────┘

Each relationship is its own typed record. Trust, debt, last-interaction-tag, and 3-5 free-form sentiment notes per entity. On any LLM call about a specific player, the relevant relationship record is loaded and that subset is prompted. NPC behavior changes accordingly without needing to load all relationships.

The hard part is consistency. If Player1 brings their guildmate Player3 to meet Borric, Borric should react based on his Player1 relationship and his lack of prior relationship with Player3 (curious, polite, but reserved). This requires the LLM call to be aware of the witness graph — who is present, who knows whom. Easy to get wrong.

Layer 4: Goal-stack reasoning

This is where NPCs become agents instead of responders. Each NPC has a stack of active goals:

Borric's goal stack (March 12):
  1. (Active)  Restock ale before tavern reopens at 18:00
  2. (Active)  Find out who's been stealing from the cellar
  3. (Dormant) Reconcile with brother Fenwick (paused since 2023-10-04)
  4. (Active)  Repay debt of 30g to the merchant guild

Each tick (or each LLM call), the NPC consults the stack and decides which goal to pursue or which goal a player interaction should advance/disrupt. New goals are added by player interaction ("I'll find that herb for you") and resolved when world state matches the goal's success condition.

This converts NPCs from reactive dialog generators into proactive world participants. They become NPCs you can interrupt — not NPCs you have to summon.

4. Engineering trade-offs

4.1. Cost per NPC per hour. Layer-1 alone: ~0.50/NPC-hour for a moderately-chatty NPC at 2026 prices. Add Layers 2-4: ~1.20-2.00/NPC-hour. For a 50-NPC city, that's 60-100/player-hour. Untenable for free-to-play, painful for 60 retail games.

The path down: aggressive prompt-caching (Anthropic's prompt cache cuts long-prompt costs ~10×); per-NPC fine-tuning (small models run locally; only escalate to frontier models for hard reasoning calls); and budgeted-LLM-call architecture — most NPC ticks are scripted-state-machine; LLM calls are reserved for actual dialog and goal-stack updates.

4.2. Multiplayer cache invalidation. If Borric's relationship with Player1 changes (Player1 just stole from him), the cached prompt that includes Borric's relationship record is now stale. In a single-player game, this is easy — invalidate locally. In a multiplayer game, the server must invalidate caches for any NPC any player is currently interacting with whose memory might be affected. This is essentially the same problem as MMORPG mob-aggro propagation, but applied to social state.

4.3. Determinism for replay. Multiplayer games need deterministic replay for cheat-detection and bug reproduction. LLM outputs are non-deterministic by default (temperature > 0 + sampling). Solution: log every LLM call's prompt + response. Replay re-uses the logged response rather than re-calling the LLM. Storage cost is real (gigabytes per server-day) but tractable.

4.4. Latency budget. Players notice >500ms NPC response latency. Frontier LLM calls average 1.2–2.0s for typical NPC prompt sizes. Mitigations: stream tokens (start showing the response as it arrives); pre-compute likely responses speculatively (NPC "thinks" about what they'd say if the player approached); cache responses to common openers.

5. The personality-drift problem

The unsolved problem of long-horizon LLM-NPCs: personality drift.

Over hundreds of hours, an NPC who started gruff-but-fair gradually becomes either deferential (because LLMs naturally accommodate user preferences) or unhinged (because adversarial players probe for exploit prompts). Either direction breaks immersion.

Three partial mitigations:

Personality anchors in every prompt. The system prompt re-states the NPC's core traits on every LLM call, not just at session start. Drift is fought turn by turn.
Periodic self-evaluation. Every N events, a "personality auditor" LLM call reviews recent NPC outputs against the character bio. If drift detected, recent memories are flagged for re-summarization with a stronger trait anchor.
Player-facing impact limits. Hard rules: no NPC ever changes faction allegiance through dialog alone, no NPC ever crosses a "personality red line" (a hostile NPC never becomes a follower based on words). These are scripted guardrails on top of the LLM.

None of these is fully satisfactory. Personality drift remains, in my view, the hardest open problem in the space.

6. Reference architecture (TypeScript)

// One LLM call per NPC turn.
async function runNPCTurn(
  npc: NPC,
  playerInput: string,
  ctx: GameContext,
): Promise<NPCResponse> {
  // Layer 1: short-term — last 10 dialog turns
  const shortTerm = await loadEpisodic(npc.id, ctx.session.id, 10);

  // Layer 2: long-term — top-5 relevant memories
  const longTerm = await retrieveVector(npc.id, playerInput, 5);
  const symbolic = await loadSymbolicFacts(npc.id, /* about */ ctx.player.id);

  // Layer 3: relationships
  const witnessGraph = await loadWitnessGraph(ctx.scene.id);
  const relationship = await loadRelationship(npc.id, ctx.player.id);

  // Layer 4: goal stack — top 3 active goals
  const goals = await loadGoals(npc.id, "active", 3);

  // Compose prompt with strong personality anchor
  const prompt = buildPrompt({
    persona: npc.persona,            // "Gruff-but-fair innkeeper. ..."
    coreTraits: npc.traits,           // ["loyal", "stubborn", "kind to children"]
    shortTerm, longTerm, symbolic,
    witnessGraph, relationship,
    goals,
    playerInput,
  });

  const response = await llm.complete(prompt, { cache: ctx.session.id });

  // Write memory + update goals + update relationship
  await persistMemory(npc.id, ctx.session.id, playerInput, response);
  await updateGoals(npc.id, response.goalUpdates);
  await updateRelationship(npc.id, ctx.player.id, response.relationshipDelta);

  return response;
}

The interesting part is not the LLM call. It's the four asynchronous memory and goal updates that follow each turn. These are what make the NPC persist.

7. Conclusion

The path from scripted NPCs to truly sentient ones is not a model-scaling story. We do not need GPT-6 to have great NPCs. We need an architectural commitment at the game-engine level: persistent memory layers, social-graph awareness, goal-stack reasoning, and budgeted LLM calls.

The first game studio that ships this architecture as a first-class engine system (analogous to Havok for physics or FMOD for audio) wins the AI-NPC race. The technology is not the moat — the engineering plumbing is.

References

Park et al. — Generative Agents: Interactive Simulacra of Human Behavior (2023)
Shanahan, McDonell, Reynolds — Role-Play with Large Language Models (Nature, 2023)
Wang et al. — Voyager: An Open-Ended Embodied Agent with LLMs (2023)
Inworld AI — Character Engine Technical Whitepaper (2024)
Chen et al. — Walking with the Whales: A Survey on Memory in LLM Agents (2025)
Wu, Welleck, Choi — Continual Persona Calibration for LLM Characters (2025)