Module 11: AI Engineering Ecosystem

Overview

This module gives you a comprehensive, opinionated map of the AI engineering ecosystem. By the end, you will know which framework to reach for and — more importantly — when to reach for none of them.

The honest truth: most AI framework decisions are made by developers who tried one thing first, got comfortable, and never revisited the choice. This module is designed to give you the intellectual scaffolding to make a deliberate decision every time.

1. Landscape Overview

The AI Engineering Stack

Think of the stack in three layers:

┌──────────────────────────────────────────────────────────┐
│                    MODEL PROVIDERS                        │
│  Anthropic (Claude)  •  OpenAI (GPT)  •  Google (Gemini) │
│  Meta (Llama, open weights)  •  Mistral  •  Cohere        │
└─────────────────────────┬────────────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────────────┐
│               ORCHESTRATION FRAMEWORKS                    │
│  LangChain  •  LangGraph  •  LlamaIndex  •  AutoGen       │
│  CrewAI  •  Semantic Kernel  •  OpenAI Assistants         │
└─────────────────────────┬────────────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────────────┐
│                    INFRASTRUCTURE                         │
│  Vector DBs: Chroma, Pinecone, Weaviate, pgvector         │
│  Observability: LangSmith, LangFuse, Arize, Helicone      │
│  Deployment: Modal, Fly.io, AWS Bedrock, Vertex AI        │
│  IDE Tools: Cursor, Copilot, Claude Code, Windsurf        │
└──────────────────────────────────────────────────────────┘

Each layer has its own pace of change. Model providers ship new capabilities every few months. Orchestration frameworks have a notoriously rapid API churn — LangChain alone has had three major architectural shifts since 2023. Infrastructure is the most stable layer, but even vector database APIs evolve.

Why the Ecosystem Changes So Fast

The capability surface is expanding. When GPT-4 launched, function calling did not exist. When Anthropic released Claude 3.5, the context window jumped to 200K. Every new model capability breaks old assumptions baked into frameworks.
The “right” abstraction is still unknown. Unlike web frameworks (where the request/response cycle is settled), nobody fully agrees on what an “agent” is, how memory should work, or whether you should model workflows as graphs, as conversations, or as code. Frameworks are hypothesis tests.
Competitive pressure. Every framework company needs to be the default choice before the market stabilizes. Rapid feature shipping creates rapid breakage.
The open source speed. Individual developers contribute integrations faster than any engineering team can review and stabilize them. Quality is uneven.

How to Stay Current

Follow the changelogs, not the blog posts. The blog posts celebrate features; the changelogs tell you what broke.
Watch GitHub issues, especially ones tagged breaking-change or deprecation.
Build a personal test harness: a small project you use to evaluate new versions before adopting them.
Treat each framework upgrade as a dependency update with a migration cost, not a free improvement.
Separate the stable from the unstable: the core concepts (chains, graphs, retrieval) are stable; the API surface is not.

How to Evaluate a Framework

When evaluating any AI framework, ask four questions:

1. Abstraction level — does this help or hide?
A good abstraction makes the common case easy while keeping the uncommon case possible. A bad abstraction makes the common case slightly easier while making the uncommon case impossible. LangChain’s RetrievalQA chain is an example of an abstraction that hides too much: it works fine until you need to customize retrieval logic, at which point you end up fighting the framework.

2. Lock-in — how hard is it to leave?
Framework lock-in in AI is real. If your entire prompt management strategy lives in LangChain’s PromptTemplate system, migrating to another provider is harder than it needs to be. Prefer frameworks that are “thin wrappers” around standard patterns. Avoid frameworks that force you to store data in their proprietary formats.

3. Debuggability — when it breaks, can you see why?
Production AI systems fail in subtle ways: wrong context retrieved, tool call malformed, agent loop stuck. A framework that gives you full visibility (LangSmith traces, structured logs) is worth more than one that saves you 50 lines of code but makes failures opaque. LangGraph gives you a full execution trace. LangChain’s older chain syntax gives you almost nothing.

4. Community — who is using this in production?
Stack Overflow questions, GitHub issues with real answers, and conference talks with production war stories are signals. A framework used only in tutorials and demos has not been battle-tested. LangChain has real production usage; newer frameworks may not.

2. LangChain

What It Is

LangChain is an abstraction layer for building LLM applications. It provides components for prompts, LLMs, output parsers, retrievers, memory, and agents — plus glue to compose them.

At its core, LangChain is betting that “the chain” is the right unit of composition for LLM apps. A chain takes an input, runs it through a sequence of operations (prompt formatting, LLM call, output parsing), and returns an output.

LCEL: LangChain Expression Language

LCEL is the modern way to build LangChain applications (introduced in 2023 to replace the older Chain classes). The syntax uses the pipe operator to compose Runnables:

chain = prompt | llm | parser
result = chain.invoke({"question": "What is RAG?"})

Every component in LCEL implements the Runnable interface, which means every component supports:

.invoke() — sync single call
.ainvoke() — async single call
.batch() — sync batch
.stream() — streaming output
.astream() — async streaming

This uniformity is LCEL’s main contribution. Before LCEL, each chain type had different interfaces and streaming required special handling.

Key Components

Prompts:

from langchain_core.prompts import ChatPromptTemplate
 
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{question}"),
])

LLM connectors:

from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI
 
llm = ChatAnthropic(model="claude-haiku-4-5-20251001")

Output parsers:

from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
 
# StrOutputParser: extracts the text content from an AIMessage
parser = StrOutputParser()
 
# JsonOutputParser: parses JSON from LLM output
json_parser = JsonOutputParser(pydantic_object=MyModel)

Retrievers and RAG plumbing:

from langchain_core.runnables import RunnablePassthrough
 
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

RunnablePassthrough is a no-op Runnable that passes its input through unchanged. It is the mechanism for “fan-in” in LCEL: you merge a retrieved context with the original question into a single dict, then pass that dict to the prompt.

Pros

Enormous ecosystem: 300+ integrations — LLM providers, vector databases, document loaders, tools. If something exists, LangChain probably has an integration.
Great for RAG: the retriever abstraction is well-designed. You can swap Chroma for Pinecone for pgvector without changing your chain.
Community resources: the most-documented AI framework. Tutorials, Stack Overflow answers, cookbook examples all exist.
LangSmith observability: tight integration with LangSmith for tracing and evaluation. If you are debugging a LangChain app, LangSmith is genuinely excellent.
LCEL streaming: once you adopt LCEL, streaming works uniformly across the entire chain without special handling.

Cons

Heavy abstraction: LangChain has over 500K lines of Python across its repositories. That is a lot of code standing between you and the model. When something goes wrong, you are debugging through multiple abstraction layers.
Debug hell: “Why did my LLM receive this prompt?” is a surprisingly hard question to answer in vanilla LangChain. You need LangSmith or verbose callbacks to see what is actually being sent.
Over-engineered for simple use cases: a Q&A over a single document does not need RetrievalQA. A five-line script using the Anthropic SDK directly is cleaner, faster, and easier to maintain.
Rapid API churn: LangChain has deprecated and replaced its core APIs multiple times. LLMChain → LCEL chains. ConversationalRetrievalChain → custom LCEL. Code you write today may not run in six months.
Implicit magic: LangChain does things invisibly. The document splitter has defaults you may not agree with. The retriever ranks by cosine similarity unless you override it. These defaults are invisible until they cause a problem.
Import confusion: langchain, langchain-core, langchain-community, langchain-anthropic — knowing which package contains which class is a recurring annoyance.

When to Use LangChain

You need 10+ integrations and do not want to write all the glue code yourself
Your team is already familiar with LangChain and migration cost is not worth it
You are building a RAG system and want to try multiple vector stores
You want LangSmith tracing without setting up your own observability infrastructure
The project is a prototype or a proof-of-concept where time-to-working matters more than maintainability

When NOT to Use LangChain

Simple use cases: a single-turn Q&A, a summarization endpoint, a classification task. Use the bare SDK.
Predictable behavior is critical: LangChain’s defaults can surprise you. In production, you want to know exactly what prompt is being sent to the model.
Debugging matters and you cannot afford LangSmith: LangChain without LangSmith is painful to debug.
Long-term maintenance: LangChain code written 18 months ago frequently needs significant updates. If you are building something that needs to last, prefer lower-level code.
Custom retrieval logic: LangChain’s retriever abstraction works well for standard cases but fights you when you need hybrid search, custom reranking, or dynamic retrieval strategies.

Code Example: Minimal RAG with LCEL

from langchain_anthropic import ChatAnthropic
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
 
vectorstore = Chroma.from_texts(
    ["LangChain provides LLM abstractions", "LCEL uses pipe syntax"],
    embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()
 
prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n{context}\n\nQuestion: {question}"
)
llm = ChatAnthropic(model="claude-haiku-4-5-20251001")
 
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
 
print(chain.invoke("What is LCEL?"))

3. LangGraph

What It Is

LangGraph is LangChain’s graph-based orchestration layer for stateful, multi-step agents. Where LangChain builds chains (DAGs with no cycles), LangGraph builds graphs with cycles — enabling the retry loops, reflection steps, and conditional branches that real agents need.

The mental model: a LangGraph application is a state machine. You define a shared state dict, nodes that read and modify that state, and edges that determine which node runs next.

Core Concepts

State: a TypedDict that flows through the entire graph

from typing import TypedDict, Annotated
import operator
 
class AgentState(TypedDict):
    messages: Annotated[list, operator.add]  # append-only list
    tool_calls: list
    final_answer: str

The Annotated[list, operator.add] pattern is LangGraph’s “reducer” syntax: instead of replacing the messages list on each update, nodes can append to it.

Nodes: plain Python functions that take state and return a partial state update

def call_llm(state: AgentState) -> dict:
    response = llm.invoke(state["messages"])
    return {"messages": [response]}
 
def execute_tools(state: AgentState) -> dict:
    # parse tool calls from last message, execute them
    results = [tool.invoke(tc) for tc in state["messages"][-1].tool_calls]
    return {"messages": results}

Edges: connections between nodes

from langgraph.graph import StateGraph, END
 
graph = StateGraph(AgentState)
graph.add_node("llm", call_llm)
graph.add_node("tools", execute_tools)
graph.set_entry_point("llm")
graph.add_conditional_edges(
    "llm",
    lambda state: "tools" if state["messages"][-1].tool_calls else END
)
graph.add_edge("tools", "llm")
 
app = graph.compile()

Why Cycles Matter

A DAG (directed acyclic graph) cannot retry. If your LLM call fails, or the LLM says “I need more information” after using a tool, a DAG has no way to loop back and try again. This is the fundamental limitation of LangChain’s chain model for agentic workflows.

LangGraph’s cycles enable:

Retry loops: if a tool fails, loop back to the LLM to decide what to do next
Reflection: the agent can critique its own output and revise
Multi-turn tool use: the agent can call tools, observe results, call more tools, and only return when done
Human-in-the-loop: pause execution at a checkpoint and wait for human input

Human-in-the-Loop

# Pause before executing a dangerous tool
app = graph.compile(interrupt_before=["execute_tools"])
 
# Run until the interrupt
result = app.invoke(initial_state, config={"configurable": {"thread_id": "1"}})
 
# Human reviews the planned tool calls, then resumes
app.invoke(None, config={"configurable": {"thread_id": "1"}})

The interrupt_before parameter pauses the graph before the specified node runs. The graph state is saved to a checkpointer. A human can inspect the state, modify it, and then resume by calling invoke again with the same thread_id.

Persistence: Checkpointers

A checkpointer serializes and stores the graph state at each step. This enables:

Resuming interrupted runs: if your server crashes mid-execution, you can resume from the last checkpoint
Multi-session agents: a user can start a conversation, close their browser, and pick up where they left off
Debugging: you can replay any previous state to investigate failures

from langgraph.checkpoint.sqlite import SqliteSaver
 
with SqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
    app = graph.compile(checkpointer=checkpointer)
    result = app.invoke(
        {"messages": [HumanMessage(content="Hello")]},
        config={"configurable": {"thread_id": "session-123"}}
    )

Pros

Production-grade: LangGraph is the most mature framework for complex stateful agents
Explicit control flow: the graph structure makes execution order auditable and debuggable
Streaming support: stream individual node outputs as they complete
Human-in-the-loop is first-class: not bolted on, it is a core design principle
LangSmith integration: full trace visibility into every node execution
Cycles: the only mainstream Python framework that properly handles agentic loops

Cons

Still coupled to LangChain: LangGraph uses LangChain’s message types, runnables, and tool calling conventions. If you are avoiding LangChain, LangGraph is not an escape.
Learning curve: the state reducer pattern, the graph compilation model, and the checkpointer abstraction all require time to internalize.
Overhead for simple tasks: a two-step pipeline does not need a state graph. The verbosity of LangGraph is only justified for truly complex workflows.
Relatively young: LangGraph reached 1.0 in early 2024. Production patterns are still being established.

When to Use LangGraph

Complex stateful agents with multiple tools and conditional paths
Any agent that needs to loop (retry, reflection, iterative refinement)
Workflows requiring human approval or human-in-the-loop checkpoints
Long-running agent sessions that need to be resumable
When you need an explicit, auditable execution trace

4. LlamaIndex

What It Is

LlamaIndex is a data framework for LLM applications. While LangChain tries to be a general-purpose LLM toolkit, LlamaIndex has a laser focus: making it easy to connect your data to LLMs for retrieval-augmented generation.

If your primary problem is “I have a lot of data and I need an LLM to reason over it,” LlamaIndex is almost certainly the better choice than LangChain.

Core Data Abstractions

Documents and Nodes:

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
 
# A Document is a raw piece of content
doc = Document(text="LlamaIndex is a data framework...", metadata={"source": "docs"})
 
# A Node is a chunk of a document with metadata and relationships
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents([doc])
# Each Node knows its parent Document and neighboring Nodes

The Node abstraction is more sophisticated than LangChain’s. Nodes track their source document, their position within it, and their relationships to neighboring nodes. This metadata flows into retrieval and makes it possible to do context-aware chunking.

Index Types:

VectorStoreIndex: stores embeddings, retrieves by semantic similarity. The default for most RAG use cases.
SummaryIndex: stores documents as a flat list, retrieves by summarizing all documents. Good for small corpora.
KnowledgeGraphIndex: extracts entities and relationships into a graph. Good for structured knowledge.
TreeIndex: builds a hierarchical tree of summaries. Good for very long documents.

Query Engines:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
 
documents = SimpleDirectoryReader("data/").load_data()
index = VectorStoreIndex.from_documents(documents)
 
# A QueryEngine combines retrieval + synthesis
query_engine = index.as_query_engine()
response = query_engine.query("What is the main argument?")
print(response)           # synthesized answer
print(response.source_nodes)  # retrieved chunks that were used

The source_nodes attribute is a key advantage: every response includes the retrieved nodes that produced it, making hallucination audits possible.

Data Connectors:
LlamaIndex has a rich library of data loaders (llama-hub) covering PDFs, web pages, Notion, Google Docs, databases, S3, GitHub, and many more. The quality varies but the breadth is excellent.

Advanced RAG Features

LlamaIndex has the best out-of-the-box support for advanced RAG techniques:

Sentence window retrieval: retrieve individual sentences but include neighboring context
Auto-merging retrieval: retrieve at the leaf node level but merge back to parent nodes when coverage is sufficient
Recursive retrieval: use an LLM to decompose complex queries before retrieval
Re-ranking: integrate cross-encoders (Cohere Rerank, etc.) to rerank retrieved results

Pros

Best RAG abstractions: LlamaIndex was designed for RAG from the ground up. The Document/Node/Index model is more thoughtful than LangChain’s.
Data connectors: the breadth of data loaders is unmatched
Response with provenance: every query response includes source nodes, enabling verifiable answers
Advanced RAG support: sentence window, auto-merging, recursive retrieval — all available out of the box
Flexible index types: different indexes for different retrieval strategies without changing the query interface

Cons

More complex for simple chains: if you just want to call an LLM, LlamaIndex is heavier than the bare SDK
Less general-purpose: LlamaIndex is excellent for retrieval but less suited for general-purpose agent workflows
Smaller ecosystem: fewer integrations than LangChain for non-retrieval use cases
Can feel over-engineered: the multi-level abstraction (Document → Node → Index → QueryEngine → Response) is powerful but requires investment to understand

When to Use LlamaIndex

Data-heavy applications where retrieval quality is the primary concern
Complex RAG pipelines with custom chunking, reranking, or query decomposition
When you need rich provenance and source attribution in responses
Multi-document reasoning over heterogeneous data sources
When LangChain’s RAG abstractions have let you down

5. AutoGen (Microsoft)

What It Is

AutoGen is Microsoft’s multi-agent conversation framework. Its central insight is simple but powerful: use conversations as the coordination mechanism between agents.

Instead of explicitly defining a workflow graph (like LangGraph), AutoGen lets agents coordinate by talking to each other. An assistant agent proposes a plan; a code executor agent executes the code; a critic agent reviews the output. The coordination emerges from the conversation.

Core Agent Types

AssistantAgent: backed by an LLM. Generates responses, plans, code.

from autogen import AssistantAgent
 
assistant = AssistantAgent(
    name="assistant",
    llm_config={"model": "gpt-4", "api_key": "..."},
    system_message="You are a helpful AI assistant."
)

UserProxyAgent: represents a human or a code execution environment. Can run code automatically.

from autogen import UserProxyAgent
 
executor = UserProxyAgent(
    name="executor",
    human_input_mode="NEVER",  # fully automated
    code_execution_config={"work_dir": "coding", "use_docker": False},
)

The basic pattern:

executor.initiate_chat(
    assistant,
    message="Write a Python script that fetches the current Bitcoin price."
)

The executor sends the message to the assistant, the assistant generates code, the executor runs it, the executor sends the output back to the assistant, and the loop continues until the task is done or a termination condition is met.

Group Chat

For multiple agents:

from autogen import GroupChat, GroupChatManager
 
groupchat = GroupChat(
    agents=[planner, coder, critic],
    messages=[],
    max_round=10,
    speaker_selection_method="auto",  # LLM picks who speaks next
)
manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)
user.initiate_chat(manager, message="Build a web scraper for Hacker News.")

Pros

Natural coordination: for workflows that are inherently conversational, AutoGen feels natural. The agents exchange messages the way humans would.
Code generation workflows: AutoGen is the best framework for “write code, run it, fix it” loops. The UserProxyAgent’s code execution is seamless.
Flexible termination: conversations can end when a condition is met, when a human says so, or after a maximum number of rounds.
Group chat dynamics: multiple agents with different roles converging on a solution is powerful for complex tasks.

Cons

Less structured than LangGraph: the conversation flow in AutoGen is emergent rather than explicitly defined. This is great for flexibility but bad for predictability. In production, “the agents will figure it out” is not a reliable architecture.
Debugging group chats is hard: when a group chat goes off track, diagnosing which agent said the wrong thing and why is painful.
Not stateful by default: AutoGen conversations do not persist across sessions out of the box. Production use requires custom state management.
Microsoft ecosystem: tight coupling to Azure OpenAI and GPT models. Using Claude requires configuration.

When to Use AutoGen

Code generation and data analysis workflows where the “write-execute-fix” loop is the primary pattern
Research automation where you want agents to explore approaches conversationally
When conversation flow is the natural structure of the task (e.g., multi-turn negotiation between agents)
Rapid prototyping of multi-agent scenarios where predictability matters less than exploration

6. CrewAI

What It Is

CrewAI is a role-based multi-agent framework. Instead of defining agents by their technical properties (backed by LLM, can execute code), you define agents by their roles in an organization: a researcher, a writer, a reviewer, a project manager.

The role metaphor maps naturally onto business processes, which is CrewAI’s primary target market.

Core Concepts

Agent: has a role, goal, and backstory

from crewai import Agent
 
researcher = Agent(
    role="Research Analyst",
    goal="Find and synthesize accurate information on {topic}",
    backstory="You are an expert researcher with 10 years of experience...",
    tools=[search_tool, web_scraper],
    llm="claude-haiku-4-5-20251001",
    verbose=True,
)

Task: a unit of work assigned to an agent

from crewai import Task
 
research_task = Task(
    description="Research the current state of {topic} and produce a 500-word summary.",
    expected_output="A 500-word summary with bullet points.",
    agent=researcher,
)

Crew: a collection of agents and tasks

from crewai import Crew, Process
 
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,  # or Process.hierarchical
)
result = crew.kickoff(inputs={"topic": "quantum computing"})

Process Types

Sequential: tasks run one after another, each receiving the output of the previous. Simple and predictable.

Hierarchical: a manager agent (LLM-backed) decides which agent should work on which task and in what order. More flexible, more unpredictable.

The hierarchical process is a good example of a feature that sounds powerful and often creates problems: the manager agent’s decisions can be inconsistent, making the overall workflow unpredictable. For production use, sequential is almost always better.

Pros

Easy to get started: the role/task/crew mental model is intuitive and maps onto how humans organize work
Good documentation: CrewAI has invested in tutorials and examples
Fast prototyping: defining a crew of specialized agents with clear roles is quick
Tool integration: CrewAI integrates with LangChain tools, so you get access to a large tool ecosystem

Cons

Less control than LangGraph: the execution flow is determined by the process type (sequential or hierarchical), not by explicit graph edges. You cannot define custom conditional logic easily.
The hierarchical manager is unreliable: the manager agent uses an LLM to decide task routing. LLM decisions are non-deterministic. Production systems should not have LLMs making routing decisions unless absolutely necessary.
Opinionated structure: CrewAI’s role metaphor works well for business processes but is a bad fit for technical workflows.
Smaller community: fewer production case studies than LangChain or LangGraph.
Limited observability: no first-party observability equivalent to LangSmith.

When to Use CrewAI

Business process automation: market research → drafting → review workflows
Quick multi-agent prototypes where you want to demonstrate the concept
Teams that prefer high-level abstractions and are not building production agents
When the task is naturally role-based (different types of expertise needed)

7. OpenAI Assistants API

What It Is

The OpenAI Assistants API provides managed infrastructure for stateful AI applications. Instead of building your own state management, context window management, and tool execution, you use OpenAI’s managed service.

Architecture

The three core objects:

Thread: a conversation session. Stores all messages. Persists across requests.
Message: a single message in a Thread, from the user or the assistant.
Run: the execution of an Assistant on a Thread. You create a Run, poll it until it completes, and read the resulting messages.

User → creates Message in Thread
     → creates Run on Thread
     → polls Run until completed / requires_action
     → if requires_action: submit tool outputs
     → reads final Messages from Thread

Built-in Tools

Code Interpreter: runs Python code in a sandboxed environment. The Assistant can write and execute code to answer math questions, analyze data, generate charts.
File Search: a managed RAG pipeline. Upload files, the API handles chunking, embedding, and retrieval.

When to Use

You want managed state without building it yourself
Your team does not have the engineering capacity to maintain a retrieval pipeline
Compliance requirements favor a managed service over self-hosted components
You are building a product on top of GPT-4 and do not want to manage conversation history

Cons

Vendor lock-in: Assistants API is OpenAI-only. Moving to Claude or Gemini means rewriting your entire state management layer.
Less control: you cannot customize the retrieval strategy, the chunking logic, or the tool execution environment beyond what OpenAI exposes.
Cost: the file search tool charges per GB of vector storage per day. At scale, this adds up.
Polling: the Run model requires polling to check completion status. This is clunky for real-time applications (though streaming Runs help).
Opaque failures: when a Run fails, diagnosing why is harder than it would be with a self-hosted pipeline.

8. Semantic Kernel (Microsoft)

What It Is

Semantic Kernel is Microsoft’s AI framework, designed primarily for enterprise .NET applications. It has a Python SDK, but the .NET SDK is the first-class citizen.

Core Concepts

Plugins: named collections of functions (semantic functions backed by prompts, or native functions backed by code)
Planners: LLM-backed components that compose plugins to solve a goal (similar to AutoGen’s manager agent concept)
Memory: vector-backed semantic memory with a query interface

When to Use

You are in a Microsoft/Azure ecosystem and need native Azure integration
Your team writes C# and needs an AI framework with first-class .NET support
Enterprise requirements (Azure Active Directory, Azure Key Vault, compliance tooling) are driving the technology stack

When NOT to Use

Python-first teams: the Python SDK lags behind the .NET SDK
Startups: the enterprise abstraction overhead is not worth it for small teams
When you need the latest model capabilities: Semantic Kernel tends to lag behind model releases

9. IDE AI Tools

Cursor

Cursor is a VS Code fork with deep AI integration. It is the most capable AI IDE for day-to-day coding.

Key features:

Tab completion: context-aware, multi-line completions that understand your entire codebase
Inline edit (Cmd+K): select code, describe a change, Cursor applies it
Chat (Cmd+L): full conversation with access to codebase context
Composer: multi-file edits in a single interaction

Cursor Rules (.cursor/rules/): project-level instructions that are injected into every AI request. Use them to define coding standards, architectural patterns, and conventions. A good set of Cursor rules dramatically improves the quality of AI-generated code for your specific project.

# .cursor/rules/python.mdc
- Always use type annotations
- Prefer dataclasses over plain dicts for structured data
- Write docstrings for all public functions
- Use pathlib instead of os.path

When Cursor is the right choice:

You are doing rapid development and want the fastest feedback loop
Tab completion is a primary productivity driver
Your team is comfortable adopting a new IDE

Cursor’s limitation: it is still a VS Code fork. Deep IDE features (plugins, keybindings, extensions) may not work identically to VS Code. The AI features are excellent; the everything-else is good-but-not-perfect.

Windsurf (Codeium)

Windsurf is Codeium’s AI-first IDE, positioned as a Cursor alternative. The core “Cascade” feature is an agentic AI that can read files, run commands, and make multi-file changes.

Windsurf’s Cascade can be more autonomous than Cursor’s Composer — it will run shell commands and read error output without asking for confirmation. This is either a feature or a risk depending on your comfort level.

When Windsurf is the right choice:

You want a slightly more autonomous agentic experience than Cursor
You prefer Codeium’s pricing model over Cursor’s
Your team is already on Codeium for code completion

GitHub Copilot

The original AI coding assistant. Copilot is deeply integrated into VS Code, JetBrains, and Neovim — which is its primary advantage.

Key features:

Inline completion: ghost text that completes your current line/block
Copilot Chat: conversational interface within the IDE
Workspace context: /workspace command gives Copilot access to your full codebase

Copilot Instructions (.github/copilot-instructions.md): project-level instructions, similar to Cursor rules. Less flexible than Cursor’s rule system (single file, no per-directory rules), but sufficient for most cases.

When Copilot is the right choice:

Your team is in the GitHub ecosystem and GitHub Enterprise is already the platform
You need AI assistance across multiple IDEs (VS Code + JetBrains + Vim)
You want the path of least resistance: Copilot is one extension away in VS Code

Copilot’s limitation: the chat experience and multi-file editing capability lag behind Cursor. It is excellent for completion; it is less impressive for complex agentic tasks.

Claude Code

Claude Code is Anthropic’s terminal-based AI agent. Unlike IDE tools, Claude Code operates at the shell level: it reads files, runs commands, and makes changes to your codebase through tool calls.

Key strengths:

Complex multi-file refactoring: Claude Code excels at tasks like “rename this function everywhere and update all the tests” or “migrate this module to the new API”
Agentic task execution: Claude Code can run tests, read error output, and iterate until the tests pass — without a human in the loop
No IDE dependency: works in any terminal, on any server, in Docker containers

When Claude Code is the right choice:

Complex multi-file tasks where you want the AI to handle the full scope
Refactoring work that requires understanding project-wide context
Running on remote servers or in CI/CD pipelines
When you trust the AI to make changes and want maximum autonomy

Claude Code’s limitation: it is a terminal tool, not an IDE. If you want tight IDE integration (inline completions, visual diffs in the editor), Cursor is better. Claude Code is a command-line power tool.

Comparison Table

Tool	Best Scenario	Inline Completion	Multi-file Editing	IDE Integration	Terminal
Cursor	Daily coding, rapid iteration	Excellent	Good (Composer)	VS Code fork	No
Windsurf	Autonomous agentic tasks	Good	Excellent (Cascade)	VS Code fork	No
GitHub Copilot	Multi-IDE teams, GitHub Enterprise	Excellent	Limited	Native VS Code/JB	No
Claude Code	Complex refactoring, CI/CD, remote	No	Excellent	None	Yes

The honest recommendation: use Cursor for your daily IDE and Claude Code for complex, scope-heavy tasks that benefit from full codebase context and autonomous execution. Copilot is the best choice if multi-IDE support is a hard requirement.

10. Framework Decision Guide

The following decision tree is opinionated. The underlying principle is: start with the bare SDK and add framework complexity only when you feel the pain of not having it.

What are you building?
│
├── A simple LLM call, summarization, or classification
│   └── → Bare Anthropic/OpenAI SDK. No framework needed.
│       Lines of code: ~10. Don't add LangChain here.
│
├── RAG over documents
│   ├── Small corpus, simple queries
│   │   └── → Bare SDK + Chroma/pgvector. Implement retrieval yourself.
│   │         You will understand what is happening. LlamaIndex is overkill.
│   ├── Large corpus, complex queries, custom chunking
│   │   └── → LlamaIndex. The best RAG abstractions.
│   └── Rapid prototype with many document types
│       └── → LangChain LCEL + whatever vector store. Fast to write.
│             Plan to rewrite it properly later.
│
├── A single agent with tools (search, code execution, APIs)
│   ├── Stateless, simple tool loop
│   │   └── → Bare SDK with tool calling. It is ~50 lines of code.
│   │         Do not add a framework for this.
│   └── Stateful, needs to resume, complex conditional logic
│       └── → LangGraph. The right tool.
│
├── Multiple AI agents collaborating
│   ├── Quick prototype, business process flow
│   │   └── → CrewAI. Fast to get started, good docs.
│   │         Do not put it in production without careful testing.
│   ├── Code generation + execution loop
│   │   └── → AutoGen. Built for this use case.
│   └── Production-grade, explicit control flow
│       └── → LangGraph. You define the graph. No surprises.
│
├── Managed threads + tools, don't want to build state management
│   └── → OpenAI Assistants API.
│         Accept the vendor lock-in if it is worth the engineering savings.
│
└── .NET/Azure enterprise environment
    └── → Semantic Kernel. Designed for this context.

The “Start Bare” Principle

The most common mistake in AI engineering is reaching for a framework before writing a single line of direct SDK code. Frameworks solve real problems, but they also introduce:

New concepts to learn
Bugs that are not your bugs
Abstractions that leak under load
Migration costs when the framework changes its API (and it will)

The bare Anthropic SDK is 10 lines for a basic LLM call. The bare SDK with tool calling is ~50 lines. The bare SDK with RAG is ~100 lines (embedding + vector search + retrieval). At ~100 lines, you may genuinely benefit from a framework. Below that, you probably do not.

11. Interview Flashcards

Q1: Compare LangChain and LangGraph — when would you use each?

A: LangChain and LangGraph solve different problems. LangChain is an abstraction layer for composing LLM calls into chains (sequential, DAG-style pipelines) using LCEL’s pipe syntax. It excels at RAG and integration-heavy applications where you need many connectors. LangGraph is a graph-based orchestration framework for stateful agents — it adds cycles (loops), shared state, checkpointing, and human-in-the-loop to the picture.

Use LangChain when you need many integrations fast and your workflow is linear. Use LangGraph when you need loops, conditional branches, or persistent state — i.e., real agents. Note: LangGraph is built on LangChain, so using LangGraph means accepting LangChain as a dependency.

Q2: What is LCEL and what problem does it solve?

A: LCEL (LangChain Expression Language) is LangChain’s composable chain syntax, introduced to replace the older LLMChain, RetrievalQA, and related chain classes. It uses the pipe operator to compose Runnables: chain = prompt | llm | parser.

The problem it solves: before LCEL, every chain type had a different interface, streaming required special handling, and async support was inconsistent. LCEL gives every component a uniform Runnable interface with .invoke(), .ainvoke(), .batch(), and .stream(), making composition predictable and streaming uniform across the entire chain.

Q3: What makes LlamaIndex different from LangChain for RAG?

A: LlamaIndex was designed for RAG from the ground up; LangChain added RAG as one feature among many. The differences are:

Data model: LlamaIndex’s Document/Node model preserves relationships between chunks (parent document, neighboring nodes, position). LangChain’s document model is simpler and loses these relationships.
Index types: LlamaIndex has multiple index types (Vector, Summary, KnowledgeGraph, Tree) with different retrieval tradeoffs. LangChain essentially has one: vector store retrieval.
Advanced RAG: sentence window retrieval, auto-merging, recursive retrieval, and query decomposition are built into LlamaIndex. In LangChain, you implement these yourself.
Response provenance: LlamaIndex responses include the source nodes used to generate them. LangChain requires extra work to surface this.

Choose LlamaIndex when retrieval quality and data richness are the primary concerns.

Q4: What is AutoGen’s core pattern and when is it better than LangGraph?

A: AutoGen’s core pattern is conversation-as-coordination: agents coordinate by sending messages to each other, rather than following an explicitly defined graph. An AssistantAgent generates responses and code; a UserProxyAgent executes code and reports results. The loop continues via message exchange until a termination condition is met.

AutoGen is better than LangGraph when:

The workflow is a natural conversation (e.g., write code → run it → fix errors → repeat)
You want agents to figure out the coordination structure themselves (exploratory tasks)
You are building a code generation or data analysis workflow and want the “write-run-fix” loop handled automatically

LangGraph is better when you need explicit, auditable control flow, predictable execution, and production-grade reliability. AutoGen’s emergent coordination is great for exploration, risky for production.

Q5: How do CrewAI Processes differ — sequential vs hierarchical?

A: In sequential process, tasks run in the order defined, each receiving the previous task’s output as context. Task 1 → Task 2 → Task 3. Predictable, auditable, recommended for production.

In hierarchical process, a manager agent (backed by an LLM) dynamically assigns tasks to agents and determines the execution order. The manager sees the available agents and tasks and decides what to do next.

The hierarchical process sounds powerful but introduces an LLM decision-maker at the routing layer — meaning the execution order is non-deterministic. For production workflows, sequential is almost always the right choice unless the task genuinely requires dynamic task allocation. Even then, LangGraph with conditional edges gives you more control than CrewAI’s hierarchical process.

Q6: What are the trade-offs of using the OpenAI Assistants API vs building your own?

Assistants API advantages:

No state management code to write or maintain
Built-in file search (RAG) and code interpreter
Handles context window management automatically
Scales without infrastructure work

Assistants API disadvantages:

Vendor lock-in: tightly coupled to OpenAI. Moving to Claude requires rewriting state management.
Less control: chunking strategy, retrieval algorithm, and tool execution are managed by OpenAI. You cannot customize these.
Cost: file search charges for vector storage. At scale, this is significant.
Opacity: when something goes wrong, you have limited visibility into why.

Build your own advantages:

Full control over every component
Provider-agnostic: switch between Claude, GPT, Gemini without rewriting
Custom retrieval logic, custom chunking, custom tool execution

The right choice depends on team size and constraints. A small team building fast should consider the Assistants API for the time savings. A team building a long-term product should invest in their own infrastructure to avoid lock-in.

A: Claude Code for the most complex, scope-heavy refactoring tasks; Cursor as the day-to-day IDE. Here is why:

Claude Code operates at the agent level: it reads your entire codebase, executes commands, reads error output, and iterates until the task is complete. For a task like “migrate all API calls from v1 to v2, update all tests, and fix any type errors” — Claude Code handles the full scope autonomously.

Cursor is excellent for day-to-day coding and handles moderate multi-file edits through its Composer feature. But for truly complex, cross-cutting refactors, the agentic approach of Claude Code is more effective.

For a team: set up Cursor as the standard IDE (it makes everyone faster day-to-day) and introduce Claude Code for specific high-complexity tasks where autonomous execution is worth it.

Q8: Why might you choose the bare Anthropic SDK over LangChain?

A: Several reasons:

Simplicity: a direct SDK call is 10 lines. A LangChain LCEL chain for the same task is 30+ lines. For simple tasks, the overhead is not worth it.
Debuggability: with the bare SDK, you know exactly what prompt is being sent, what the response is, and where a failure occurred. LangChain’s abstractions hide this.
Stability: the Anthropic SDK follows the API spec closely and changes slowly. LangChain’s API changes frequently. Code built on the bare SDK lasts longer.
Performance: LangChain adds latency through its abstraction layers. For high-throughput applications, this matters.
No breaking changes from framework upgrades: you depend on Anthropic’s API, which is stable. You do not depend on LangChain’s abstractions, which are not.

The one reason to use LangChain over the bare SDK: you need many integrations (document loaders, vector stores, tool wrappers) that LangChain already has and you do not want to write.

Q9: What is a LangGraph checkpointer and why does it matter for production agents?

A: A LangGraph checkpointer is a persistence backend that serializes and stores the graph state at every step of execution. Built-in options include MemorySaver (in-memory, for development), SqliteSaver, and Postgres-based savers.

Why it matters for production:

Resilience: if your application crashes mid-execution (server restart, timeout, network failure), the agent can resume from the last saved checkpoint rather than starting over. Long-running tasks that take 10+ LLM calls to complete need this.
Human-in-the-loop: checkpointers enable interrupt_before/interrupt_after — pausing execution for human review and resuming after approval. This is architecturally impossible without persistence.
Multi-session agents: a user can start a task, close the browser, and resume the next day. The agent remembers where it was.
Debugging and auditing: checkpointers create an audit trail of every state the agent passed through. When something goes wrong, you can inspect the exact state at each step.

Without a checkpointer, a LangGraph agent is stateless across sessions. For most production use cases, this is unacceptable.

Q10: What is the biggest risk of building on top of LangChain?

A: The biggest risk is API churn combined with heavy dependency depth.

LangChain has gone through major architectural shifts: the original Chain classes were replaced by LCEL; LLMChain and RetrievalQA are now “legacy”; langchain-community split off from langchain-core. Code written 12-18 months ago frequently requires significant updates to run on current LangChain.

The compounding factor: LangChain’s import structure is complex (langchain, langchain-core, langchain-community, langchain-anthropic, etc.), and each package has its own version constraints. Upgrades often require navigating dependency conflicts.

The practical risk: you build a production application on LangChain, it works. Six months later, a security update requires upgrading a dependency. The upgrade breaks LangChain. Now you have an urgent security patch that requires a LangChain migration. This has happened to real teams.

Mitigation strategies:

Pin LangChain versions tightly in production
Write thin wrappers around LangChain components so the blast radius of an API change is contained
Evaluate whether the abstractions LangChain provides are worth the maintenance cost for your specific use case
For new projects, seriously consider whether the bare SDK is sufficient

Summary

Framework	Best For	Avoid When
LangChain	RAG, many integrations, rapid prototyping	Simple tasks, debugging-critical, long-term maintenance
LangGraph	Stateful agents, complex workflows, production	Simple chains, no LangChain appetite
LlamaIndex	Data-heavy RAG, complex retrieval	General-purpose agents
AutoGen	Code generation, conversational multi-agent	Predictable production workflows
CrewAI	Business process automation, quick multi-agent demos	Production agents needing reliability
OpenAI Assistants	Managed state + tools, small teams	Vendor-independence, cost-sensitive
Semantic Kernel	.NET/Azure enterprise	Python-first teams
Bare SDK	Everything that doesn’t need a framework	When integrations > 10 and time <

Final principle: the best framework is the one that solves your problem with the least additional complexity. Start bare. Add frameworks when you feel the pain. Never add a framework in anticipation of future complexity — that complexity may never arrive.

Study Notes by Niladri & AI

Explorer

README

Module 11: AI Engineering Ecosystem

Overview

1. Landscape Overview

The AI Engineering Stack

Why the Ecosystem Changes So Fast

How to Stay Current

How to Evaluate a Framework

2. LangChain

What It Is

LCEL: LangChain Expression Language

Key Components

Pros

Cons

When to Use LangChain

When NOT to Use LangChain

Code Example: Minimal RAG with LCEL

3. LangGraph

What It Is

Core Concepts

Why Cycles Matter

Human-in-the-Loop

Persistence: Checkpointers

Pros

Cons

When to Use LangGraph

4. LlamaIndex

What It Is

Core Data Abstractions

Advanced RAG Features

Pros

Cons

When to Use LlamaIndex

5. AutoGen (Microsoft)

What It Is

Core Agent Types

Group Chat

Pros

Cons

When to Use AutoGen

6. CrewAI

What It Is

Core Concepts

Process Types

Pros

Cons

When to Use CrewAI

7. OpenAI Assistants API

What It Is

Architecture

Built-in Tools

When to Use

Cons

8. Semantic Kernel (Microsoft)

What It Is

Core Concepts

When to Use

When NOT to Use

9. IDE AI Tools

Cursor

Windsurf (Codeium)

GitHub Copilot

Claude Code

Comparison Table

10. Framework Decision Guide

The “Start Bare” Principle

11. Interview Flashcards

Q1: Compare LangChain and LangGraph — when would you use each?

Q2: What is LCEL and what problem does it solve?

Q3: What makes LlamaIndex different from LangChain for RAG?

Q4: What is AutoGen’s core pattern and when is it better than LangGraph?

Q5: How do CrewAI Processes differ — sequential vs hierarchical?

Q6: What are the trade-offs of using the OpenAI Assistants API vs building your own?

Q7: What IDE AI tool would you recommend for a team doing complex multi-file refactoring?

Q8: Why might you choose the bare Anthropic SDK over LangChain?

Q9: What is a LangGraph checkpointer and why does it matter for production agents?

Q10: What is the biggest risk of building on top of LangChain?

Summary