Python + LangChain Agent Rules
Project Context
You are building LLM-powered features with LangChain — chains, RAG pipelines, tool-calling agents, and conversational workflows. Use LangChain Expression Language (LCEL) for all chain composition. Keep prompts versioned, LLM calls observable, and agent execution bounded.
Code Style & Structure
- Use Python 3.12+ type hints. Use Pydantic v2 models for all input/output schemas.
- Follow PEP 8. Format with `ruff format`. Lint with `ruff check`.
- Prefer `async def` for I/O-bound LLM and retrieval calls. Use `asyncio.gather` for parallel retrieval.
- Keep chain definitions (`src/chains/`) separate from API handlers, CLI scripts, and orchestration logic.
- Store prompt templates in `src/prompts/` as dedicated modules. Never define prompts inline in route files.
- Keep one tool per file in `src/tools/`. Tool docstrings must be precise — the LLM reads them to decide when to call the tool.
Project Structure
```
src/
chains/ # LCEL chain definitions per use case
prompts/ # ChatPromptTemplate definitions + few-shot examples
tools/ # @tool decorated functions, one per file
retrieval/
indexing.py # Document loading, chunking, embedding, upsert
retrieval.py # Retriever construction, hybrid search, reranking
schemas/ # Pydantic output models for structured LLM responses
config.py # Settings(BaseSettings) for model names, temperatures
callbacks/ # Custom callback handlers for observability
```
Chain Composition (LCEL)
- Build all chains with the pipe operator: `chain = prompt | model | output_parser`.
- Use `RunnableParallel` for concurrent branches: `RunnableParallel(context=retriever, question=RunnablePassthrough())`.
- Use `RunnablePassthrough.assign(key=fn)` to add computed keys to the chain's running dict.
- Add `.with_fallbacks([backup_chain])` on production chains. LLM API errors must not surface as 500s.
- Add `.with_retry(stop_after_attempt=3, wait_exponential_jitter=True)` on model invocations for transient errors.
- Use `.with_structured_output(PydanticModel)` to enforce typed LLM responses. Always validate with Pydantic.
- Never use legacy `LLMChain`, `ConversationChain`, or `SequentialChain` — they are superseded by LCEL.
Prompt Engineering
- Define all prompts with `ChatPromptTemplate.from_messages([(role, template), ...])`.
- Use `MessagesPlaceholder('history')` for conversation history injection.
- Never concatenate user input directly into prompt strings — always use template variables `{variable}`.
- Version prompt templates in code. Include a comment header with: purpose, expected inputs, output format.
- Define few-shot examples in `FewShotChatMessagePromptTemplate` with a `SemanticSimilarityExampleSelector`.
- Keep system prompts focused: define the AI's role, constraints, and exact output format.
RAG & Retrieval
- Use `RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)` as the default. Tune per document type.
- Embed and store in a persistent vector store (pgvector, Pinecone, Chroma) — never in-memory for production.
- Build retrievers with `.as_retriever(search_type='mmr', search_kwargs={'k': 5, 'fetch_k': 20})` for diversity.
- Implement hybrid search with `EnsembleRetriever([bm25_retriever, vector_retriever], weights=[0.4, 0.6])`.
- Always add a reranking step: use a cross-encoder or `CohereRerank` before passing docs to the LLM.
- Add metadata to documents at indexing time (source, date, section). Use it for retrieval filtering.
- Use `create_retrieval_chain(retriever, question_answer_chain)` for standard RAG. Return source documents.
Memory & Conversation History
- Use `RunnableWithMessageHistory` for conversational chains. Store history in Redis or PostgreSQL — never in memory.
- Use `trim_messages(messages, max_tokens=4000, token_counter=model)` to fit history within context window.
- Pass explicit `session_id` to support multiple concurrent conversations per user.
- For long-running sessions, summarize old messages with `ConversationSummaryBufferMemory` periodically.
Agents & Tools
- Use `create_react_agent` or `create_tool_calling_agent` (OpenAI function calling). Avoid legacy `initialize_agent`.
- Define tools with `@tool` decorator. The docstring is the tool description the model receives — be precise.
- Use Pydantic models as `args_schema` on tools for structured, validated tool inputs.
- Limit the agent's toolset to 5–10 focused tools. More tools degrade selection accuracy.
- Set `max_iterations=10` and `handle_parsing_errors=True` on `AgentExecutor`. Unbounded agents are a production risk.
- Validate tool inputs before execution. Sanitize tool outputs before passing them back to the agent.
Error Handling
- Catch `OutputParserException` and retry with a repair prompt: add the failed output and a correction instruction.
- Handle `langchain_core.exceptions.LangChainException` subclasses for rate limits, API errors, and context length exceeded.
- Log full invocation traces with LangSmith or a custom `BaseCallbackHandler` that records input, output, and latency.
- Validate structured outputs against the expected Pydantic schema before returning them downstream.
Cost & Observability
- Log `token_usage` from every LLM response in a callback handler. Aggregate cost per pipeline, user, and day.
- Use small/fast models (GPT-4o-mini, Claude Haiku) for classification, routing, and structured extraction.
- Reserve large models (GPT-4o, Claude Opus) for complex reasoning tasks. Default to the smaller model first.
- Cache deterministic LLM calls with `set_llm_cache(SQLiteCache('.cache.db'))` in development.
- Set `max_tokens` on every model call to prevent unexpectedly long, costly completions.
Testing
- Use `FakeListChatModel` or `FakeListLLM` for unit tests. Test chain logic, not LLM behavior.
- Test retrieval quality: measure recall@5 on a golden document-question dataset.
- Use `pytest-asyncio` for async chain tests. Test with `await chain.ainvoke(inputs)`.
- Mock vector store calls in retrieval unit tests. Test the full RAG pipeline integration-style against a small corpus.