try to craft my own wiki of AI era.
Theory and Foundation Layer
- math fundamentals
- CS and programming fundamentals
- AI fundamentals
- LLM fundamentals
Basic Theory and Patterns Evolution
| Stage | Technical Focus | Capability Shift | Core Architectural Constraint |
|---|---|---|---|
| Stage 1 | Transformers / MoE | Large-scale language processing | Lack of intent alignment or reasoning |
| Stage 2 | Instruction Fine-Tuning | Improved alignment with user goals | Brittle across diverse or novel tasks |
| Stage 3 | RLHF | Human-centric value alignment | Highly dependent on human evaluation |
| Stage 4 | Tool Integration | Active capability via external APIs | Lack of autonomous planning/memory |
| Stage 5 | RAG | Real-time factual grounding | Static knowledge and grounding issues |
| Stage 6 | Single-Agent Autonomy | Autonomous planning and execution | Limited to sequential, linear problem-solving |
| Stage 7 | Multi-Agent Collaboration | Distributed, specialized orchestration | High coordination and state complexity |
| Stage 8 | Persistent Expert Agents | Long-term learning and domain expertise | Ongoing research into self-evolving memory |
LLM Ops
- MLOps: Abstract out the common computing/storage layer, taking care of capacity, scheduling, scaling, and load balancing.
- compute layer: GPU cluster setup and management to fully utilize the hardware
- Scaling: Automatic and seamless scaling up and down, from on-premise to cloud when needed
- Scaling: Support multiple models with dynamic model loading
- Operations: Monitor usage of computing resources, and status of training and inference jobs
- Operations: Generate data for usage stats and metrics dashboard, and alert when anomaly detected
- Platform: Training / Fine-tuning: improve training throughput, reliability and efficiency
- Platform: Inference: Leverage the latest and most efficient open source framework for LLM inference to reduce latency and improve throughput
- Platform: Evaluation and Benchmarking: automatically evaluate models' performance on datasets of interests
- Platform: A/B Testing: capability for online A/B testing to compare features
- Unified AI Gateway: Abstract out the common API/SDK layer, taking care of authentication, authorization, rate limiting, error handling, logging, monitoring, and alerts.
LLM Train
- pre-training
- post-training
- LLM knowledge distillation
LLM Inference
- GPU resource management
- API / SDK encapsulation
- rate limiting
- error handling
- logging
- monitoring
- alerts
- notifications
LLM Fine-tune
- prefix fine-tuning, prompt tuning, variants
- SFT
- RLHF / RLAIF / DPO variants
- LoRA and QLoRA variants
LLM RAG
basic patterns
- dense vector-based RAG
- sparse vector-based RAG
- graph-based RAG
SOP
- ingest documents, chunking and embedding (structured data) with strategies
- recall with hybrid search
- format, references and citations
- re-rank, query-rewriting, multi-hop, graph or table augmentation
- composable and modular RAG system architecture
- domain-specific retrieval pipelines; continuous ingestion
- quality metrics, evals and quality dashboards
LLM Prompting Engineering
- prompting engineering BP for human -> write the best prompts for your tasks
- classic patterns: one / few shots, chain-of-thought, self-consistency, reAct etc.
- prompt management (version, testing, validation, safety, etc.)
- AI driven prompting optimization (prompting refine by AI and auto.)
- DSPy, textGuard, promptWizard, GRAD-SUM, ell, StarGo ...
Agentic System Context Engineering
Why it matters: attention cost grows with context length; when noise dominates, decision quality drops—often called context rot. Many “model weakness” issues trace to how context is packed, not raw window size.
- Layer by stability and frequency (keep each layer doing one job):
- Resident: identity, project rules, hard prohibitions—short, stable, executable every turn.
- On-demand: skills and domain playbooks—index in prompt, load full text only when matched.
- Runtime inject: time, channel IDs, user prefs—append after stable prefixes.
- Memory: cross-session facts (e.g.
MEMORY.md)—retrieve, do not dump everything by default. - System / hooks: deterministic checks (linters, guards)—not repeated prose in the prompt.
- Write context: memories · state · scratch-pad · file-backed artifacts for large tool JSON (filesystem as the context interface).
- Select context: tools retrieval · docs / knowledge retrieval · memory retrieval
- memo0 example for long-term memory management
- Compress context (pick strategy for the failure mode):
- sliding window (cheap, loses early decisions)
- LLM summary / branch summarization (keep architecture decisions, open work, constraints)
- tool-result compaction (replace bulky outputs with pass/fail + pointers; preserve identifiers verbatim)
- Prompt caching: stable prefixes (system prompt, tool defs, long docs) cache best; put volatile content after stable blocks; volatile tool sets hurt hit rate.
- Skills descriptors: treat them as routing conditions, not marketing copy—Use when / Don’t use when, concrete counterexamples; load one skill when clearly matched.
- Isolate context: in state · environment / sandbox · partitions among agents
Make agents select tools to organize and manage runtime context (CRUD can be agent-driven, but deterministic rules stay in code or hooks).
LLM Select
- pick and compose right LLMs for the task
- model family selection
- open-source LLMs family
- commercial LLMs family
- latency, cost, throughput, quality, etc.
- model family selection
- LLM parameters (tokens, top-p, temperature, etc.)
LLM Agentic Systems
Runtime shape: loop, workflow, and control
- Minimal agent loop: perceive → decide → act → feedback until the model stops with plain text. In mature stacks, the loop stays thin; new behavior is added via tools + handlers, prompt structure, and externalized state (files/DB), not by bloating the loop into a hand-written state machine. Let the model reason; let the harness own boundaries and state.
- Workflow vs. agent: if execution paths are fixed in code, you have a workflow; if the LLM chooses the next step, you have an agent. Labels are often blurred in products—pick the control model that fits risk and clarity, not hype.
- Common control patterns (usually combined): prompt chaining (linear stages + optional code gates); routing (classify input → specialized handlers/models); parallelization (shard work or run multiple samples for consensus); orchestrator–workers (decompose, delegate, merge); evaluator–optimizer (generate → score → revise until a quality bar is met).
Core Patterns
- reasoning: CoT · BDI (Belief, Desire, Intention) · ReAct
- goal: passive goal creator · proactive goal creator
- planning: single / multi-path plan generator · plan and execute framework · graph-based control flow
- retrieval: RAG · knowledge and RAG enhancements
- reflection: self-reflection and refinement · cross-reflection · human reflection
- cooperation: voting / role / debate based · tool / agent registry
- execution: serial vs. parallel tool execution · tool execution sandbox · agent evaluator · multi-modal guardrails
- optimization: prompt / response optimizer
reference practical patterns:
Memory
- Functional layers (not just storage media):
- Working memory: current
messages[]/ window—tight, actively curated. - Procedural memory: skills and SOPs—loaded on demand, not all at once.
- Episodic memory: append-only session logs (e.g. JSONL)—full trace for replay and search.
- Semantic memory: durable facts the agent curates (e.g.
MEMORY.md)—injected when relevant.
- Working memory: current
- short-term vs. long-term memory
- storage backends: vector store · graph DB · relational DB · file systems
- structure: graph-based vs. tree-based
- Consolidation: when summarizing or compacting, archive originals and only advance pointers—failed consolidation should be recoverable, not a silent loss of evidence.
- A-MEM: Dynamic and Self-Evolving memory
- context-sizing control
Harness, verification, and autonomy
- Harness (often beats “just use a bigger model” for code-like tasks): acceptance baselines (what “done” means), execution boundaries (sandbox, paths, permissions), feedback signals (tests, linters, traces), and rollback / retry paths. Push work toward clear goals + automatable checks; ambiguous goals with strong automation just fail faster in the wrong direction.
- Agent-first engineering habits (OpenAI-style): keep ground truth in the repo (short
AGENTS.mdindex + deep docs), encode rules in CI/linters/types instead of hoping prompts are read, aim for end-to-end autonomous repair loops where the agent can verify its own changes against telemetry. - Long tasks: externalize progress—structured files (JSON feature lists, progress logs), initializer vs. coding agent splits, one
in_progresstask at a time, resume from disk after crashes. Slow I/O: offload to background work + inject results between turns instead of blocking the core loop. - Security before features: allowlists, workspace path checks, audited shell, prompt-injection aware design (mark untrusted content, minimize dangerous tools, confirm sensitive sinks), provider fallbacks for outages.
Tools & Skills
- Tool design (ACI / agent-computer interface): shape tools around agent goals, not raw REST surface area—fewer, higher-level actions beat many micro-calls. Pair schemas with concrete examples; return structured errors with fix hints, not opaque strings.
- Evolving tool stacks: static giant tool dumps → tool search / discovery → programmatic orchestration (code glues tools; intermediate data stays out of the LLM) → example-rich definitions for reliability.
- Debugging order: when tools misfire, fix descriptions and boundaries first, then revisit model choice. Trim tools that are better as shell, static docs, or skills.
- Framework vs. LLM messages: keep rich internal events out of the model transcript; filter to standard roles/content before each API call.
- tool-call and skills management
- code execution · html / web-page generation · browser-use · VM use · web search
- multi-step workflow
- traditional multi-step workflow
- agent skills (fixed patterns as sub-agents) — skills BP for engineering
Agentic Flow & Interface
- agentic-flow prompting
- ReAct agent
- reflection Ă— planning Ă— action
- RPA loop: perception Ă— reasoning Ă— action
- Effective HITL (Human in the Loop)
- user-interface customization
- continuous learning loop (telemetry → evals → prompt / knowledge updates)
Reliability & Safety
- human-in-loop (HITL) · basic principles for agent build
- hallucination prevention and mitigation
- Evaluation discipline
- Objects: task (what to do) · trial (one run) · grader (how to score); separate transcript (what was said/done in the loop) from outcome (what changed in the environment). Cover both to catch “talked success” vs. real effects.
- Pass@k (capability probing with multiple samples) vs Pass^k (regression-style repeated checks)—don’t mix interpretations.
- Prefer code graders when answers are checkable; use model/human judges where semantics matter; calibrate automated judges with human spot checks.
- If scores move oddly, debug the harness first (flaky environments, grader bugs, stale tasks) before rewriting the agent—bad evals send you chasing ghosts.
- safety, security, compliance, governance
- content filters · PII redaction · secure key management
- prompt injection defenses · retrieval hygiene · tool permissioning
- policy layers (allow/deny lists) · sensitive actions with human approval
- compliance: data retention · audit trails · red-team exercises
Performance & Cost
- metrics: cost · latency · throughput · prompting logs · tool-call logs
- token budgeting · caching · short prompts · prompt cache
- reranking before generation · response compression · approximate search tuning
- distillation / routing to small models · speculative decoding
- SLAs with adaptive quality tiers · cost/perf dashboards
- Tracing & observability: persist full prompts, messages, tool calls/results, optional reasoning traces, tokens, and latency per run. Emit events (
tool_start/tool_end/turn_end) once and fan out to logs, UI, and eval queues. Blend human sampling (to learn failure modes) with LLM-based trace scoring (for scale), using the former to calibrate the latter.
Anti-patterns (engineering)
- mega system prompt as the knowledge base instead of skills/files
- tool sprawl and overlapping names → routing confusion
- no verifiable “done” definition per task class
- multi-agent without isolation, protocols, or worktrees—un-debuggable state drift
- skipping memory consolidation on long sessions
- shipping changes without evals; letting the suite saturate without harder cases
- constraints only in prose—use hooks, tools, and automated checks
Multi-Agent Systems (MAS)
- topology: centralized vs. decentralized · hierarchical vs. flat · serial vs. parallel · supervisor vs. peer
- Collaboration mechanics: agree on a structured protocol (append-only JSONL inboxes, explicit statuses) + task graph + isolation (worktrees) before optimizing parallelism. Sub-agents should return summaries to parents; keep search/debug chatter in the child context to avoid cross-agent hallucination cascades—add cross-checks (second agent, tests, compilers) where stakes are high.
- memory sharing: Blackboard Model · state-based vs. memory-based
- storage: vector store · graph DB · relational DB
- communication protocol: end-to-end · broadcast · shared-memory channels
- tool invocation protocol: MCP (Model Context Protocol)
- human roles in the agentic loop: supervisor · loop participant · meta-agent
LLM Product Engineering
Classic Protocols
- MCP (Model Context Protocol)
- A2A (Agentic to Agentic Protocol) with ADK
- A2UI Protocol widgets and components render from AI
- Ag-UI (Agentic UI Protocol)
- Agent to Editor (Client) Protocol
Frameworks
- ai-sdk (node / javascript)
- LangChain (python) / LangGraph (python)
- AutoGPT
- AgentOps
- MetaGPT
- CrewAI
- ...
| Feature | CrewAI | LangGraph | AutoGen |
|---|---|---|---|
| Primary Approach | Role-based / Team structure | Graph-based / State machine | Conversation-based interaction |
| State Management | Central Orchestrator | Strong-typed Stateful Graphs | Contextual Memory Engine |
| Task Allocation | Bidding Mechanism / Role | Predefined Node Transitions | Iterative Agent Dialogue |
| Complexity Level | Intuitive / Low-to-Moderate | Advanced / High Control | Modular / Moderate-to-High |
| Best Use Case | Cross-functional projects | Supply chain / Data pipelines | Software development / Coding |
Platforms
Model Services Vendors:
- Open Router
- Claude / Gemini / Grok / OpenAI / DeepSeek / ...
LLM Orchestration Platforms:
- OpenAI Agent Builder
- Dify / Coze
- n8n
- Gumloop (AgentHub)
Observation: Monitoring real-time agent actions, including tool usage and reasoning paths.
- LangSmith
Test and Evaluation
- Langfuse
- PromptFoo
LLM Deep Scenarios
AI First product systems
VibeCoding
- basic principles and manifesto
OpenSource research:
- Gemini CLI
- Cursor
Manus - General Agentic System
patterns:
- monolithic
- pipeline sub-systems
- multi-agent sub-systems (MoA)
- hybrid mixed
info resources:
- domain-specific / public information retrieval
context:
- memory management
- context management / compress and optimize
plan strategies
- static workflow
- intent to plan
- unified intent planning
OpenSource research:
- OpenManus
DeepResearch
- OpenResearch
NoteBook
- Google Notebook ML
MultiModal
- Gen Image
- Gen Video
- Gen Audio
- Gen 3D objects
Reference
trace
- (26-01-04) add more products and frameworks to the wiki
- (26-02-07) add more details about LLM Ops and Infra.
- (26-03-17) provide clean structure and content for the agentic section.
- (26-03-25) absorb agent architecture notes: loop vs workflow, harness, context layers, ACI tools, memory/consolidation, evals/traces, multi-agent protocols. post from X