đź’  AI First Product Engineer Wiki

March 25, 2026 (1mo ago)

try to craft my own wiki of AI era.

Theory and Foundation Layer

  • math fundamentals
  • CS and programming fundamentals
  • AI fundamentals
  • LLM fundamentals

Basic Theory and Patterns Evolution

StageTechnical FocusCapability ShiftCore Architectural Constraint
Stage 1Transformers / MoELarge-scale language processingLack of intent alignment or reasoning
Stage 2Instruction Fine-TuningImproved alignment with user goalsBrittle across diverse or novel tasks
Stage 3RLHFHuman-centric value alignmentHighly dependent on human evaluation
Stage 4Tool IntegrationActive capability via external APIsLack of autonomous planning/memory
Stage 5RAGReal-time factual groundingStatic knowledge and grounding issues
Stage 6Single-Agent AutonomyAutonomous planning and executionLimited to sequential, linear problem-solving
Stage 7Multi-Agent CollaborationDistributed, specialized orchestrationHigh coordination and state complexity
Stage 8Persistent Expert AgentsLong-term learning and domain expertiseOngoing research into self-evolving memory

LLM Ops

  • MLOps: Abstract out the common computing/storage layer, taking care of capacity, scheduling, scaling, and load balancing.
    • compute layer: GPU cluster setup and management to fully utilize the hardware
    • Scaling: Automatic and seamless scaling up and down, from on-premise to cloud when needed
    • Scaling: Support multiple models with dynamic model loading
    • Operations: Monitor usage of computing resources, and status of training and inference jobs
    • Operations: Generate data for usage stats and metrics dashboard, and alert when anomaly detected
    • Platform: Training / Fine-tuning: improve training throughput, reliability and efficiency
    • Platform: Inference: Leverage the latest and most efficient open source framework for LLM inference to reduce latency and improve throughput
    • Platform: Evaluation and Benchmarking: automatically evaluate models' performance on datasets of interests
    • Platform: A/B Testing: capability for online A/B testing to compare features
  • Unified AI Gateway: Abstract out the common API/SDK layer, taking care of authentication, authorization, rate limiting, error handling, logging, monitoring, and alerts.

LLM Train

  • pre-training
  • post-training

  • LLM knowledge distillation

LLM Inference

  • GPU resource management
  • API / SDK encapsulation
  • rate limiting
  • error handling
  • logging
  • monitoring
  • alerts
  • notifications

LLM Fine-tune

  • prefix fine-tuning, prompt tuning, variants
  • SFT
  • RLHF / RLAIF / DPO variants
  • LoRA and QLoRA variants

LLM RAG

basic patterns

  • dense vector-based RAG
  • sparse vector-based RAG
  • graph-based RAG

SOP

  • ingest documents, chunking and embedding (structured data) with strategies
  • recall with hybrid search
  • format, references and citations
  • re-rank, query-rewriting, multi-hop, graph or table augmentation
  • composable and modular RAG system architecture
  • domain-specific retrieval pipelines; continuous ingestion
  • quality metrics, evals and quality dashboards

LLM Prompting Engineering

  • prompting engineering BP for human -> write the best prompts for your tasks
    • classic patterns: one / few shots, chain-of-thought, self-consistency, reAct etc.
  • prompt management (version, testing, validation, safety, etc.)
  • AI driven prompting optimization (prompting refine by AI and auto.)
    • DSPy, textGuard, promptWizard, GRAD-SUM, ell, StarGo ...

Agentic System Context Engineering

Why it matters: attention cost grows with context length; when noise dominates, decision quality drops—often called context rot. Many “model weakness” issues trace to how context is packed, not raw window size.

  • Layer by stability and frequency (keep each layer doing one job):
    • Resident: identity, project rules, hard prohibitions—short, stable, executable every turn.
    • On-demand: skills and domain playbooks—index in prompt, load full text only when matched.
    • Runtime inject: time, channel IDs, user prefs—append after stable prefixes.
    • Memory: cross-session facts (e.g. MEMORY.md)—retrieve, do not dump everything by default.
    • System / hooks: deterministic checks (linters, guards)—not repeated prose in the prompt.
  • Write context: memories · state · scratch-pad · file-backed artifacts for large tool JSON (filesystem as the context interface).
  • Select context: tools retrieval · docs / knowledge retrieval · memory retrieval
    • memo0 example for long-term memory management
  • Compress context (pick strategy for the failure mode):
    • sliding window (cheap, loses early decisions)
    • LLM summary / branch summarization (keep architecture decisions, open work, constraints)
    • tool-result compaction (replace bulky outputs with pass/fail + pointers; preserve identifiers verbatim)
  • Prompt caching: stable prefixes (system prompt, tool defs, long docs) cache best; put volatile content after stable blocks; volatile tool sets hurt hit rate.
  • Skills descriptors: treat them as routing conditions, not marketing copy—Use when / Don’t use when, concrete counterexamples; load one skill when clearly matched.
  • Isolate context: in state · environment / sandbox · partitions among agents

Make agents select tools to organize and manage runtime context (CRUD can be agent-driven, but deterministic rules stay in code or hooks).

LLM Select

  • pick and compose right LLMs for the task
    • model family selection
      • open-source LLMs family
      • commercial LLMs family
    • latency, cost, throughput, quality, etc.
  • LLM parameters (tokens, top-p, temperature, etc.)

LLM Agentic Systems

Runtime shape: loop, workflow, and control

  • Minimal agent loop: perceive → decide → act → feedback until the model stops with plain text. In mature stacks, the loop stays thin; new behavior is added via tools + handlers, prompt structure, and externalized state (files/DB), not by bloating the loop into a hand-written state machine. Let the model reason; let the harness own boundaries and state.
  • Workflow vs. agent: if execution paths are fixed in code, you have a workflow; if the LLM chooses the next step, you have an agent. Labels are often blurred in products—pick the control model that fits risk and clarity, not hype.
  • Common control patterns (usually combined): prompt chaining (linear stages + optional code gates); routing (classify input → specialized handlers/models); parallelization (shard work or run multiple samples for consensus); orchestrator–workers (decompose, delegate, merge); evaluator–optimizer (generate → score → revise until a quality bar is met).

Core Patterns

  • reasoning: CoT · BDI (Belief, Desire, Intention) · ReAct
  • goal: passive goal creator · proactive goal creator
  • planning: single / multi-path plan generator · plan and execute framework · graph-based control flow
  • retrieval: RAG · knowledge and RAG enhancements
  • reflection: self-reflection and refinement · cross-reflection · human reflection
  • cooperation: voting / role / debate based · tool / agent registry
  • execution: serial vs. parallel tool execution · tool execution sandbox · agent evaluator · multi-modal guardrails
  • optimization: prompt / response optimizer

reference practical patterns:

Memory

  • Functional layers (not just storage media):
    • Working memory: current messages[] / window—tight, actively curated.
    • Procedural memory: skills and SOPs—loaded on demand, not all at once.
    • Episodic memory: append-only session logs (e.g. JSONL)—full trace for replay and search.
    • Semantic memory: durable facts the agent curates (e.g. MEMORY.md)—injected when relevant.
  • short-term vs. long-term memory
    • storage backends: vector store · graph DB · relational DB · file systems
    • structure: graph-based vs. tree-based
  • Consolidation: when summarizing or compacting, archive originals and only advance pointers—failed consolidation should be recoverable, not a silent loss of evidence.
  • A-MEM: Dynamic and Self-Evolving memory
  • context-sizing control

Harness, verification, and autonomy

  • Harness (often beats “just use a bigger model” for code-like tasks): acceptance baselines (what “done” means), execution boundaries (sandbox, paths, permissions), feedback signals (tests, linters, traces), and rollback / retry paths. Push work toward clear goals + automatable checks; ambiguous goals with strong automation just fail faster in the wrong direction.
  • Agent-first engineering habits (OpenAI-style): keep ground truth in the repo (short AGENTS.md index + deep docs), encode rules in CI/linters/types instead of hoping prompts are read, aim for end-to-end autonomous repair loops where the agent can verify its own changes against telemetry.
  • Long tasks: externalize progress—structured files (JSON feature lists, progress logs), initializer vs. coding agent splits, one in_progress task at a time, resume from disk after crashes. Slow I/O: offload to background work + inject results between turns instead of blocking the core loop.
  • Security before features: allowlists, workspace path checks, audited shell, prompt-injection aware design (mark untrusted content, minimize dangerous tools, confirm sensitive sinks), provider fallbacks for outages.

Tools & Skills

  • Tool design (ACI / agent-computer interface): shape tools around agent goals, not raw REST surface area—fewer, higher-level actions beat many micro-calls. Pair schemas with concrete examples; return structured errors with fix hints, not opaque strings.
  • Evolving tool stacks: static giant tool dumps → tool search / discovery → programmatic orchestration (code glues tools; intermediate data stays out of the LLM) → example-rich definitions for reliability.
  • Debugging order: when tools misfire, fix descriptions and boundaries first, then revisit model choice. Trim tools that are better as shell, static docs, or skills.
  • Framework vs. LLM messages: keep rich internal events out of the model transcript; filter to standard roles/content before each API call.
  • tool-call and skills management
    • code execution · html / web-page generation · browser-use · VM use · web search
  • multi-step workflow

Agentic Flow & Interface

  • agentic-flow prompting
    • ReAct agent
    • reflection Ă— planning Ă— action
    • RPA loop: perception Ă— reasoning Ă— action
    • Effective HITL (Human in the Loop)
  • user-interface customization
  • continuous learning loop (telemetry → evals → prompt / knowledge updates)

Reliability & Safety

  • human-in-loop (HITL) · basic principles for agent build
  • hallucination prevention and mitigation
  • Evaluation discipline
    • Objects: task (what to do) · trial (one run) · grader (how to score); separate transcript (what was said/done in the loop) from outcome (what changed in the environment). Cover both to catch “talked success” vs. real effects.
    • Pass@k (capability probing with multiple samples) vs Pass^k (regression-style repeated checks)—don’t mix interpretations.
    • Prefer code graders when answers are checkable; use model/human judges where semantics matter; calibrate automated judges with human spot checks.
    • If scores move oddly, debug the harness first (flaky environments, grader bugs, stale tasks) before rewriting the agent—bad evals send you chasing ghosts.
  • safety, security, compliance, governance
    • content filters · PII redaction · secure key management
    • prompt injection defenses · retrieval hygiene · tool permissioning
    • policy layers (allow/deny lists) · sensitive actions with human approval
    • compliance: data retention · audit trails · red-team exercises

Performance & Cost

  • metrics: cost · latency · throughput · prompting logs · tool-call logs
  • token budgeting · caching · short prompts · prompt cache
  • reranking before generation · response compression · approximate search tuning
  • distillation / routing to small models · speculative decoding
  • SLAs with adaptive quality tiers · cost/perf dashboards
  • Tracing & observability: persist full prompts, messages, tool calls/results, optional reasoning traces, tokens, and latency per run. Emit events (tool_start / tool_end / turn_end) once and fan out to logs, UI, and eval queues. Blend human sampling (to learn failure modes) with LLM-based trace scoring (for scale), using the former to calibrate the latter.

Anti-patterns (engineering)

  • mega system prompt as the knowledge base instead of skills/files
  • tool sprawl and overlapping names → routing confusion
  • no verifiable “done” definition per task class
  • multi-agent without isolation, protocols, or worktrees—un-debuggable state drift
  • skipping memory consolidation on long sessions
  • shipping changes without evals; letting the suite saturate without harder cases
  • constraints only in prose—use hooks, tools, and automated checks

Multi-Agent Systems (MAS)

  • topology: centralized vs. decentralized · hierarchical vs. flat · serial vs. parallel · supervisor vs. peer
  • Collaboration mechanics: agree on a structured protocol (append-only JSONL inboxes, explicit statuses) + task graph + isolation (worktrees) before optimizing parallelism. Sub-agents should return summaries to parents; keep search/debug chatter in the child context to avoid cross-agent hallucination cascades—add cross-checks (second agent, tests, compilers) where stakes are high.
  • memory sharing: Blackboard Model · state-based vs. memory-based
    • storage: vector store · graph DB · relational DB
  • communication protocol: end-to-end · broadcast · shared-memory channels
  • tool invocation protocol: MCP (Model Context Protocol)
  • human roles in the agentic loop: supervisor · loop participant · meta-agent

LLM Product Engineering

Classic Protocols

  • MCP (Model Context Protocol)
  • A2A (Agentic to Agentic Protocol) with ADK
  • A2UI Protocol widgets and components render from AI
  • Ag-UI (Agentic UI Protocol)
  • Agent to Editor (Client) Protocol

Frameworks

  • ai-sdk (node / javascript)
  • LangChain (python) / LangGraph (python)
  • AutoGPT
  • AgentOps
  • MetaGPT
  • CrewAI
  • ...
FeatureCrewAILangGraphAutoGen
Primary ApproachRole-based / Team structureGraph-based / State machineConversation-based interaction
State ManagementCentral OrchestratorStrong-typed Stateful GraphsContextual Memory Engine
Task AllocationBidding Mechanism / RolePredefined Node TransitionsIterative Agent Dialogue
Complexity LevelIntuitive / Low-to-ModerateAdvanced / High ControlModular / Moderate-to-High
Best Use CaseCross-functional projectsSupply chain / Data pipelinesSoftware development / Coding

Platforms

Model Services Vendors:

  • Open Router
  • Claude / Gemini / Grok / OpenAI / DeepSeek / ...

LLM Orchestration Platforms:

  • OpenAI Agent Builder
  • Dify / Coze
  • n8n
  • Gumloop (AgentHub)

Observation: Monitoring real-time agent actions, including tool usage and reasoning paths.

  • LangSmith

Test and Evaluation

  • Langfuse
  • PromptFoo

LLM Deep Scenarios

AI First product systems

VibeCoding

  • basic principles and manifesto

OpenSource research:

  • Gemini CLI
  • Cursor

Arno's BP for VibeCoding

Manus - General Agentic System

patterns:

  • monolithic
  • pipeline sub-systems
  • multi-agent sub-systems (MoA)
  • hybrid mixed

info resources:

  • domain-specific / public information retrieval

context:

  • memory management
  • context management / compress and optimize

plan strategies

  • static workflow
  • intent to plan
  • unified intent planning

OpenSource research:

  • OpenManus

DeepResearch

  • OpenResearch

NoteBook

  • Google Notebook ML

MultiModal

  • Gen Image
  • Gen Video
  • Gen Audio
  • Gen 3D objects

Reference


trace

  • (26-01-04) add more products and frameworks to the wiki
  • (26-02-07) add more details about LLM Ops and Infra.
  • (26-03-17) provide clean structure and content for the agentic section.
  • (26-03-25) absorb agent architecture notes: loop vs workflow, harness, context layers, ACI tools, memory/consolidation, evals/traces, multi-agent protocols. post from X

Arno Crafting Apps

ELABORATION STUDIO 🦄

Elaborate your ideas and solve your problems with AI in fully boosted context way ~