Large language models are powerful but isolated. They generate text based on training data, but they cannot check current inventory, query a customer database, or create records in external systems. Most production LLM applications require bridging this gap.

This case study presents an architecture for extending LLM capabilities through structured tool integration. The approach progresses from basic function calling patterns through standardized protocols suitable for production deployment. The focus is practical: system prompts that produce reliable structured output, application-layer orchestration that maintains state and coordinates actions, and protocol choices that enable interoperability with commercial LLM providers.

Context and Problem

LLMs are static and stateless. They do not learn after training and retain no memory between calls. With temperature set to zero, outputs become more stable, but true determinism is difficult to guarantee across different runtimes, quantization levels, and provider versions. Treat validation as the real determinism layer: measure JSON-valid rate, schema-valid rate, and tool-call accuracy in CI, and catch regressions when prompts or models change. LLMs are also inherently isolated, with no access to external systems, APIs, or real-time data. What appears to be dynamic, goal-directed behavior emerges entirely from how applications structure inputs and handle outputs.

This distinction matters because it clarifies where responsibility lies. The model predicts text. The application layer manages state, schedules tasks, invokes tools, and coordinates multi-step workflows. Conflating these leads to architectural confusion and unrealistic expectations about model capabilities.

The practical problem: how do you build systems where an LLM can answer questions about current data, interact with databases, or trigger actions in external services? The answer is function calling, a pattern where the model generates structured output that the application interprets as instructions to execute code.

Hardware and Model Selection

The examples in this case study were developed on a workstation with an NVIDIA A4000 (16GB VRAM). Consumer GPUs in the 16-24GB range work equally well. For organizations without GPU infrastructure, the same patterns run on CPU via llama.cpp. This is slower but acceptable for batch processing where latency is not critical.

Model: Microsoft Phi-4, quantized to 6-8 bits per weight using ExLlamaV2. This fits in 16GB VRAM for typical context lengths, but validate with your target context size and concurrency; KV cache grows with context and can dominate memory at longer sequences.

The choice of a smaller local model is deliberate. For function calling and structured extraction with straightforward schemas, compact models often match larger alternatives when prompts are well-constructed. This breaks down with complex nested schemas, noisy inputs, multilingual content, or adversarial edge cases. Evaluate on representative data before committing to a model size.

Local inference eliminates per-token API costs and keeps data on-premises. For organizations processing sensitive customer data or operating under compliance constraints, this matters. However, maintaining a local inference stack has its own costs: GPU drivers, quantization tooling, model updates, and monitoring. The break-even calculation is not just tokens versus hardware; it includes operational overhead.

That said, the patterns themselves are model-agnostic. The same prompts and orchestration logic work with hosted APIs (OpenAI, Anthropic, Google) when local deployment is not feasible or when tasks require capabilities beyond smaller models.

Function Calling via System Prompts

Function calling does not require special model capabilities or API features. Instruction-following LLMs can be guided to produce structured output through careful prompt engineering, though models vary significantly in how reliably they follow schemas under varied inputs.

That said, prompt-based JSON extraction has limits. For production systems, prefer provider-native tool calling APIs when available. Otherwise, require strict JSON-only responses with schema validation and a bounded repair loop (parse, validate, retry with error feedback). For local models, grammar-constrained decoding (llama.cpp grammars, Outlines, etc.) is more reliable than parsing JSON out of free-form text. The approach shown here is pedagogical; the edge cases accumulate in production.

The core approach uses in-context learning: providing examples within the prompt that demonstrate expected behavior. LLMs are pattern matchers. When they see clear examples of inputs and corresponding outputs, they replicate those formats with high fidelity. The model is not “learning” in any persistent sense; it is recognizing patterns in the current context and extrapolating.

A system prompt defines available functions, their parameters, and output format. Few-shot examples show the model when and how to invoke each function. The more examples, the more reliably the model generalizes to novel inputs, though for simple schemas a single example often suffices.

Prompt Structure

Consider a customer service application that needs to look up accounts, check order status, and create support tickets. The system prompt defines the available actions as JSON schemas:

You are a command-to-JSON converter. Return only a JSON object describing the user's intent.

Output must match one of these schemas exactly:

{ "action": "lookup_customer", "customer_id": "", "email": "", "phone": "" }
{ "action": "check_order", "order_id": "", "customer_id": "" }
{ "action": "create_ticket", "customer_id": "", "subject": "", "priority": "" }
{ "action": "search_knowledge_base", "query": "" }

If the user request doesn't match any command, return:
{ "action": "none" }

Use exact identifiers when provided. Leave fields blank if unknown.
Return ONLY valid JSON. No other text.

Examples:
User: look up customer 4521
Assistant: { "action": "lookup_customer", "customer_id": "4521", "email": "", "phone": "" }

User: check the status of order 78234
Assistant: { "action": "check_order", "order_id": "78234", "customer_id": "" }

User: how do I reset my password?
Assistant: { "action": "search_knowledge_base", "query": "password reset" }

The explicit examples anchor the pattern. Without them, models may return JSON with different quoting conventions, add explanatory text, or use alternative field names. The examples constrain the output space.

With this prompt structure, natural language inputs produce predictable structured output:

User InputModel Output
“pull up the account for customer 4521”{ "action": "lookup_customer", "customer_id": "4521", "email": "", "phone": "" }
“what’s the status of order 78234”{ "action": "check_order", "order_id": "78234", "customer_id": "" }
“create a high priority ticket for customer 4521 about a billing discrepancy”{ "action": "create_ticket", "customer_id": "4521", "subject": "billing discrepancy", "priority": "high" }
“what’s the company’s return policy?”{ "action": "search_knowledge_base", "query": "return policy" }

When the model does produce malformed output, common failure modes include adding explanatory text before or after the JSON, using slightly different field names, or wrapping output in markdown code blocks. These are addressable through parsing logic that extracts JSON from surrounding text and normalizes field names. Production implementations should validate output against the schema and retry with a clarifying prompt when validation fails.

The application parses this JSON and executes the corresponding code. The model never directly queries databases or creates tickets. It produces structured instructions that the application interprets.

Application-Layer Orchestration

The application layer handles everything the model cannot: state management, scheduling, external API calls, and action execution. The model generates text. The application acts on that text.

A Minimal Agent Implementation

This example monitors an intake folder for incoming documents such as contracts, invoices, and reports. It summarizes each using an LLM and distributes summaries via email. The implementation below is illustrative; production use would require persistent state (the in-memory set is lost on restart), idempotency, locking, and proper error recovery:

import os
import time
import requests
import smtplib
from email.message import EmailMessage

# Configuration
FOLDER = "./intake"
API_URL = "http://localhost:8000/v1/chat/completions"
SYSTEM_PROMPT = "You are a helpful assistant that summarizes documents clearly and concisely."
PROCESSED = set()

def summarize(content):
    payload = {
        "model": "phi-4",
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Summarize this document:\n\n{content}"}
        ],
        "temperature": 0.5
    }
    response = requests.post(API_URL, json=payload)
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

def send_email(subject, body):
    msg = EmailMessage()
    msg["Subject"] = subject
    msg["From"] = EMAIL_ADDRESS
    msg["To"] = SEND_TO
    msg.set_content(body)
    
    with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as smtp:
        smtp.starttls()
        smtp.login(EMAIL_ADDRESS, EMAIL_PASSWORD)
        smtp.send_message(msg)

def run_agent():
    os.makedirs(FOLDER, exist_ok=True)
    
    while True:
        files = [f for f in os.listdir(FOLDER) if f.endswith(".txt")]
        
        for filename in files:
            path = os.path.join(FOLDER, filename)
            if path in PROCESSED:
                continue
            
            with open(path, "r") as f:
                content = f.read()
            
            try:
                summary = summarize(content)
                send_email(f"Summary: {filename}", summary)
                PROCESSED.add(path)
            except Exception as e:
                print(f"Error processing {filename}: {e}")
        
        time.sleep(5)

if __name__ == "__main__":
    run_agent()

The model’s role is limited to text generation. The application handles file monitoring, state tracking (which files have been processed), email delivery, error handling, and scheduling. This separation is fundamental: models generate, applications act.

For multi-turn tool-using conversations, the statelessness creates additional complexity. The application must maintain enough context to re-prompt the model effectively at each turn, including prior tool results, conversation history, and any planning state. This context management becomes a significant part of the orchestration logic.

Scheduled and Autonomous Behavior

For autonomous behavior, the same pattern extends to scheduled tasks. An LLM can generate job specifications (time, function call, parameters) that the application adds to a scheduler.

A request like “send me a summary of new support tickets every morning at 8am” produces a job definition:

{
    "schedule": "0 8 * * *",
    "action": "summarize_and_email",
    "params": {
        "source": "support_tickets",
        "filter": "created_since_yesterday",
        "recipient": "user@example.com"
    }
}

The application adds this to a cron-like scheduler. At 8am daily, the scheduler executes the job: it fetches tickets, passes them to the LLM for summarization, and emails the results. The model does not “remember” to run tasks. It has no memory between calls. The scheduler invokes the model at specified times with appropriate context.

This is not the LLM becoming stateful or autonomous. It is application code implementing scheduling, with the LLM as a text-generation component within that system.

External Tool Integration

Function calling becomes more powerful when connected to external services. The pattern remains the same: the model generates structured output indicating a function call, the application executes it, and results are fed back to the model for synthesis.

Web Search Integration

Business applications often need current information (competitor announcements, regulatory changes, market data) that falls outside training data. The system prompt instructs the model to return JSON when a search is needed:

You are an assistant that can search the web when needed.
When the user asks for recent information, external data, or facts you may not have,
respond with ONLY a JSON object: {"query": "your search terms"}
Otherwise, answer directly.

Examples:
User: What are the current OSHA requirements for warehouse safety training?
Assistant: {"query": "OSHA warehouse safety training requirements 2025"}

User: How do I calculate gross margin?
Assistant: Gross margin is calculated as (Revenue - Cost of Goods Sold) / Revenue, expressed as a percentage.

User: What did the Fed announce about interest rates this week?
Assistant: {"query": "Federal Reserve interest rate announcement this week"}

Implementation

The application detects function calls in the model’s output, executes the search, and injects results into the conversation:

import json
from pydantic import BaseModel, ValidationError
from duckduckgo_search import DDGS

class SearchQuery(BaseModel):
    query: str

def parse_tool_call(content: str) -> dict | None:
    """Parse and validate JSON tool call with retry on failure.
    
    Production pattern: expect JSON-only response, validate against schema,
    retry with error feedback if malformed. For local models, grammar-constrained
    decoding (llama.cpp grammars, Outlines) avoids parsing issues entirely.
    """
    try:
        data = json.loads(content.strip())
        validated = SearchQuery(**data)
        return validated.model_dump()
    except (json.JSONDecodeError, ValidationError):
        return None

def parse_with_retry(content: str, history: list, max_retries: int = 1) -> dict | None:
    """Retry with error feedback if initial parse fails."""
    result = parse_tool_call(content)
    if result:
        return result
    
    for _ in range(max_retries):
        history.append({
            "role": "user", 
            "content": "Invalid JSON. Return only: {\"query\": \"your search terms\"}"
        })
        response = call_llm(history)
        content = response["choices"][0]["message"]["content"]
        result = parse_tool_call(content)
        if result:
            return result
    return None

def web_search(query, num_results=3):
    """Execute web search and format results."""
    try:
        ddgs = DDGS()
        results = ddgs.text(query, max_results=num_results)
        return [
            {
                "title": r.get("title", "No title"),
                "link": r.get("href", ""),
                "snippet": r.get("body", "No snippet")
            }
            for r in results
        ]
    except Exception as e:
        return [{"error": str(e)}]

def format_results(results):
    """Format search results for injection into conversation."""
    if not results or "error" in results[0]:
        return "Search failed. Please try a different query."
    
    formatted = []
    for r in results:
        formatted.append(f"{r['snippet']} (Source: {r['title']}, {r['link']})")
    return "\n\n".join(formatted)

def process_with_tools(user_message, history):
    """Main loop: call LLM, detect tool use, execute, synthesize."""
    history.append({"role": "user", "content": user_message})
    
    # First LLM call
    response = call_llm(history)
    content = response["choices"][0]["message"]["content"]
    
    # Parse and validate tool call
    function_args = parse_with_retry(content, history)
    
    if function_args:
        # Execute search
        query = function_args.get("query", "")
        results = web_search(query)
        result_text = format_results(results)
        
        # Inject results and get final response
        history.append({"role": "assistant", "content": content})
        history.append({"role": "assistant", "content": f"Search results:\n{result_text}"})
        
        final_response = call_llm(history)
        return final_response["choices"][0]["message"]["content"]
    
    return content

The model now answers questions about current regulations, market conditions, or any information available through search, despite having no direct internet access. The application bridges the gap.

Error Handling and Limitations

External tool integration introduces failure modes that pure text generation does not have:

  • Network failures: Search APIs may be unavailable. The application should return graceful error messages and optionally retry.
  • Rate limiting: External services impose request limits. Implement backoff and caching for repeated queries.
  • Result quality: Search results may be irrelevant, outdated, or contradictory. The model synthesizes what it receives; low-quality inputs produce low-quality outputs.
  • Latency: Each tool call adds round-trip time.
  • Prompt injection: Web pages and documents can contain text designed to manipulate the model. Treat retrieved content as untrusted: consider quoting or bracketing it clearly, limiting what the model can do after retrieval, and validating outputs.

Production implementations should log tool invocations, cache results where appropriate, and implement circuit breakers to prevent cascading failures when external services degrade.

Retrieval-Augmented Generation

RAG extends the tool integration pattern to private knowledge bases. Instead of searching the web, the application searches a vector database of embedded documents and injects relevant content into the prompt. This is how LLMs answer questions about internal policies, product specifications, client records, or any proprietary information.

The Pipeline

RAG combines retrieval and generation in a structured flow:

  1. Ingestion: Documents are chunked and converted to vector embeddings.
  2. Storage: Embeddings are stored with references to source text.
  3. Retrieval: User queries are embedded and compared against stored vectors.
  4. Augmentation: Top-matching chunks are injected into the LLM prompt as context.
  5. Generation: The model generates responses grounded in retrieved content.

Document Chunking

Documents must be split into chunks small enough for embedding models and LLM context windows. Two strategies:

Whole-file ingestion: For short documents (policies, FAQs, product descriptions), embed the entire file as a single chunk. Simple and preserves context.

Overlapping segments: For longer documents, split into chunks of 500-1000 tokens with 50-100 token overlap. The overlap ensures that information spanning chunk boundaries is not lost.

def chunk_document(text, chunk_size=800, overlap=100):
    """Split document into overlapping chunks."""
    words = text.split()
    chunks = []
    start = 0
    
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        if end >= len(words):
            break
        start = end - overlap
    
    return chunks

Embedding and Storage

Each chunk is converted to a vector embedding using a model like BAAI/bge-large-en. The embedding captures semantic meaning, enabling similarity search that goes beyond keyword matching.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en')

def ingest_documents(file_paths):
    """Embed and store documents."""
    store = []
    
    for path in file_paths:
        with open(path, 'r') as f:
            content = f.read()
        
        chunks = chunk_document(content)
        
        for i, chunk in enumerate(chunks):
            embedding = model.encode(chunk)
            store.append({
                "source": path,
                "chunk_index": i,
                "text": chunk,
                "embedding": embedding
            })
    
    return store

Retrieval and Relevance

At query time, the user’s question is embedded and compared against stored vectors using cosine similarity. Top matches are returned if they exceed a relevance threshold.

Note: The code below is pedagogical. The brute-force O(N) search works for small document sets but does not scale. Production systems use approximate nearest neighbor indexes (FAISS, HNSW) or managed vector databases. Thresholds are corpus- and embedding-model-dependent; the 0.5 below is arbitrary. Calibrate with labeled query-document pairs from your actual data.

def retrieve(query, store, top_k=3, threshold=0.5):
    """Find relevant chunks for a query."""
    query_embedding = model.encode(query)
    
    results = []
    for item in store:
        similarity = np.dot(query_embedding, item["embedding"]) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(item["embedding"])
        )
        if similarity >= threshold:
            results.append({
                "text": item["text"],
                "source": item["source"],
                "score": similarity
            })
    
    results.sort(key=lambda x: x["score"], reverse=True)
    return results[:top_k]

The threshold prevents injection of irrelevant content. If no chunks exceed the threshold, the model responds without retrieved context, or explicitly states that no relevant information was found.

Prompt Augmentation

Retrieved chunks are injected into the system prompt, instructing the model to use them:

def build_rag_prompt(query, retrieved_chunks):
    """Construct prompt with retrieved context."""
    context = "\n\n".join([
        f"[Source: {r['source']}]\n{r['text']}" 
        for r in retrieved_chunks
    ])
    
    return f"""Answer the user's question based on the following context. 
If the context doesn't contain relevant information, say so.

Context:
{context}

Question: {query}"""

Example: Internal Knowledge Base

Given three simple text files:

  • returns.txt: “Our return policy allows returns within 30 days of purchase with original receipt. Items must be unused and in original packaging.”
  • shipping.txt: “Standard shipping takes 5-7 business days. Express shipping (2-3 days) is available for an additional $15.”
  • warranty.txt: “All products include a 1-year manufacturer warranty covering defects in materials and workmanship.”

Query: “What’s the return window?”

The system embeds the query, finds that returns.txt has the highest similarity, retrieves it, and injects it into the prompt. The model responds: “Returns are accepted within 30 days of purchase, provided you have the original receipt and the item is unused in original packaging.”

The model used information it was never trained on, grounded in a specific source document.

Limitations

RAG is not a perfect solution:

  • Chunking artifacts: Information split across chunk boundaries may not be retrieved together. Overlapping chunks mitigate but do not eliminate this. Structure-aware chunking (respecting headings, sections, paragraphs) often works better than fixed word counts.
  • Retrieval precision: Semantic similarity alone is imperfect. Questions phrased differently than the source may not match well. Hybrid retrieval (BM25 + vectors) improves recall; re-ranking models (cross-encoders) improve precision by scoring query-document pairs more carefully after initial retrieval.
  • Context window limits: You can only inject so much retrieved content. For complex queries requiring synthesis across many documents, RAG may be insufficient.
  • Hallucination risk: Models may still generate information not present in retrieved context, especially if instructed ambiguously.
  • Real-world document messiness: Production corpora have inconsistent formatting, duplicate content, stale documents, and access control requirements. Metadata filtering (by tenant, document type, freshness) becomes essential.

RAG provides access to company handbooks, technical documentation, client correspondence, or domain-specific content without fine-tuning and without sending proprietary data to external providers. The simple implementation shown here works for small, clean document collections. Larger or messier corpora require hybrid retrieval, re-ranking, metadata filtering, and evaluation pipelines (frameworks like RAGAS can help measure retrieval recall and answer groundedness).

Model Context Protocol for Production Deployment

The patterns above work but lack standardization. Each integration requires custom parsing logic, bespoke function definitions, and provider-specific implementations. Model Context Protocol (MCP) addresses this by providing a common interface for exposing tools to LLMs.

MCP defines how servers expose capabilities and how clients (LLM applications) discover and invoke them. A tool becomes accessible to any MCP-compatible client, including commercial providers like ChatGPT and Claude, through a single implementation. Note that MCP is a relatively new and evolving standard; evaluate library maturity and spec stability for your use case.

Setting Up an MCP Server

First, create sample data to work with:

import sqlite3

def init_database():
    con = sqlite3.connect("demo.db")
    cur = con.cursor()
    
    cur.execute("""
    CREATE TABLE IF NOT EXISTS customers (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL,
        email TEXT NOT NULL
    )""")
    
    cur.execute("""
    CREATE TABLE IF NOT EXISTS orders (
        id INTEGER PRIMARY KEY,
        customer_id INTEGER NOT NULL,
        product TEXT NOT NULL,
        amount REAL NOT NULL,
        order_date TEXT NOT NULL,
        FOREIGN KEY (customer_id) REFERENCES customers(id)
    )""")
    
    cur.executemany(
        "INSERT OR IGNORE INTO customers (id, name, email) VALUES (?, ?, ?)",
        [
            (1, "Alice Johnson", "alice@example.com"),
            (2, "Bob Smith", "bob@example.com"),
            (3, "Charlie Lee", "charlie@example.com"),
        ]
    )
    
    cur.executemany(
        "INSERT OR IGNORE INTO orders (id, customer_id, product, amount, order_date) VALUES (?, ?, ?, ?, ?)",
        [
            (1, 1, "Laptop", 1200.00, "2025-09-01"),
            (2, 1, "Mouse", 25.50, "2025-09-02"),
            (3, 2, "Keyboard", 75.00, "2025-09-03"),
            (4, 3, "Monitor", 300.00, "2025-09-05"),
        ]
    )
    
    con.commit()
    con.close()

Then implement the MCP server with tool definitions:

import sqlite3
import json
from typing import Optional, List, Any
from fastmcp import FastMCP

DB_PATH = "demo.db"
mcp = FastMCP("mcp_server_example")

def query(sql: str, params: tuple = ()) -> List[dict]:
    """Execute query and return results as list of dicts."""
    con = sqlite3.connect(DB_PATH)
    con.row_factory = sqlite3.Row
    cur = con.cursor()
    cur.execute(sql, params)
    rows = [dict(r) for r in cur.fetchall()]
    con.close()
    return rows

@mcp.tool()
def get_customer(customer_id: int) -> str:
    """Fetch a single customer by id."""
    data = query(
        "SELECT id, name, email FROM customers WHERE id = ?", 
        (int(customer_id),)
    )
    return json.dumps(data[0] if data else {}, indent=2)

@mcp.tool()
def list_orders(
    customer_id: Optional[int] = None, 
    min_amount: Optional[float] = None
) -> str:
    """List orders with optional filters for customer_id and minimum amount."""
    sql = "SELECT id, customer_id, product, amount, order_date FROM orders WHERE 1=1"
    params: List[Any] = []
    
    if customer_id is not None:
        sql += " AND customer_id = ?"
        params.append(int(customer_id))
    if min_amount is not None:
        sql += " AND amount >= ?"
        params.append(float(min_amount))
    
    sql += " ORDER BY order_date, id"
    return json.dumps(query(sql, tuple(params)), indent=2)

if __name__ == "__main__":
    mcp.run(transport="http", host="127.0.0.1", port=8000)

FastMCP handles protocol details: JSON-RPC messages, transport negotiation, tool discovery. The developer defines functions with type hints and docstrings; the library exposes them as MCP tools. The docstrings become tool descriptions that help the LLM understand when to invoke each function.

Transport Options

MCP supports multiple transports, each suited to different deployment scenarios:

  • stdio: Communication over standard input/output. Suitable for local development where the LLM process spawns the MCP server directly. No network stack required. Cannot connect to external LLM interfaces like ChatGPT or Claude.
  • HTTP: Network-accessible service. Required for integration with commercial LLM providers. Expect valid TLS for remote connectors; plan for proper certificates.
  • HTTP streaming: Persistent connections for real-time bidirectional communication. Useful for tools that produce incremental results or monitor live data.

For external access, HTTP transport with a reverse proxy handles SSL termination. A minimal Caddy configuration:

my.domain.com {
  handle /mcp* {
    reverse_proxy 127.0.0.1:8000
  }
}

This obtains an SSL certificate automatically via Let’s Encrypt and forwards HTTPS requests to the local MCP server.

Connecting to ChatGPT

With the server running and accessible via HTTPS, you can connect it to ChatGPT. The specific UI steps below may change as MCP tooling evolves; consult current provider documentation:

  1. In ChatGPT, enable Developer Mode in Settings → Connectors → Advanced
  2. Create a new connector with your MCP server URL (e.g., https://my.domain.com/mcp)
  3. Start a new conversation with Developer Mode enabled
  4. Query naturally: “Show me all orders over $100”

ChatGPT discovers the available tools, recognizes the query requires list_orders, invokes it with min_amount=100, and synthesizes the results into a natural response.

Production Considerations

The examples above are intentionally minimal. Production deployments require additional safeguards:

  • Authentication: Implement OAuth or API key validation. The examples above are open to anyone who knows the URL.
  • Input validation: The code above uses parameterized queries, which prevents SQL injection. Patterns that allow arbitrary SQL (like a sql_select tool accepting raw queries) are dangerous even with SELECT-only restrictions. Attackers can extract data, cause denial of service, or exploit database-specific features.
  • Prompt injection: Retrieved documents or web pages can contain text that attempts to override model instructions. Content from external sources should be treated as untrusted and isolated from system prompts where possible.
  • Tool misuse: Models may invoke write operations when they should only read, or access data they should not. Design tools with minimal capabilities (narrow scope, bounded outputs). Implement per-tool authorization, separate read and write tools, and require human approval for high-impact actions.
  • Data exfiltration: Models can inadvertently return sensitive data from tool responses. Apply output filtering and avoid exposing tools that return unbounded data.
  • Rate limiting: Prevent abuse and control costs. A malicious or buggy client could invoke tools thousands of times.
  • Error handling: Return structured errors that help the LLM understand what went wrong. “Customer not found” is more useful than a stack trace.
  • Logging and monitoring: Log every tool invocation with timestamp, parameters, user identity, and results. Essential for debugging, audit, and incident response.
  • Resource limits: Bound query results to prevent memory exhaustion. The list_orders function should limit returned rows.

Security in tool-integrated LLM systems is a primary design concern, not a late-stage add-on. The combination of natural language interfaces and external system access creates attack surface that traditional applications do not have.

FastMCP and similar libraries are appropriate for prototypes, internal tools, and small-scale deployments. Production systems handling sensitive data or significant traffic require more robust infrastructure. There is a real tradeoff between custom orchestration code (simpler to deploy, harder to standardize) and managed MCP servers (interoperable, but another service to operate and secure).

Architecture Summary

The complete architecture separates concerns across three layers:

LayerResponsibilityImplementation
ModelText generation, intent recognition, response synthesisLLM (local or hosted)
OrchestrationState management, scheduling, function call parsing, result injectionApplication code
ToolsExternal capabilities: search, database access, APIs, business systemsMCP servers, direct integrations

The model generates structured output indicating desired actions. The orchestration layer interprets this output, invokes appropriate tools, and feeds results back to the model. Tools execute actual operations against external systems.

This separation enables flexibility. Models can be swapped without changing tool implementations. Tools can be added without modifying orchestration logic. The interfaces between layers (JSON function calls, MCP protocol) provide stable contracts.

When This Approach Applies

Tool-integrated LLM systems are appropriate when:

  • Tasks require information beyond the model’s training data (current prices, client records, live inventory)
  • Workflows involve actions in external systems (CRM updates, ticket creation, order processing)
  • Users expect natural language interfaces to existing services
  • Requirements include grounding responses in specific document collections (policy manuals, product catalogs, compliance documentation)

They add unnecessary complexity when:

  • The task is pure text generation with no external dependencies
  • Structured APIs already exist and users can interact with them directly
  • Latency requirements preclude the round-trips inherent in tool invocation
  • Data volumes are small enough that context injection handles everything

The overhead of function calling, RAG retrieval, and MCP communication adds latency and failure modes. These are acceptable costs when the capabilities justify them, but not every application needs tool integration.

Cost and Privacy Considerations

Local inference eliminates per-token API costs. For organizations processing significant volume, the savings compound over time. More importantly, local deployment keeps data on-premises. For customer records, financial data, medical information, or anything subject to compliance requirements, this eliminates data exposure concerns entirely.

The break-even calculation depends on volume and hardware costs. For occasional use, hosted APIs may be more economical. For sustained workloads with sensitive data, local deployment is often the better choice.

Summary

Extending LLM capabilities requires clear separation between model and application responsibilities. Models generate text; applications manage state, invoke tools, and coordinate actions. Function calling bridges these layers through structured output that applications interpret as instructions.

The architecture presented here progresses through four levels of capability:

  • System prompt function calling: Works with any instruction-following model. Guides the model to produce structured JSON that applications parse and execute. Requires careful prompt engineering and robust parsing logic, but has no external dependencies.
  • Application-layer orchestration: Enables autonomous agents, scheduled tasks, and multi-step workflows. The application maintains state, handles errors, and coordinates execution. For write operations, idempotency matters; retries without idempotency keys are a common source of duplicate actions.
  • Retrieval-Augmented Generation: Grounds responses in external knowledge without fine-tuning. Chunking, embedding, and similarity search add complexity but enable LLMs to answer questions about proprietary documents, current policies, or domain-specific content.
  • Model Context Protocol: Standardizes tool exposure for interoperability across providers. A single MCP implementation makes tools accessible to ChatGPT, Claude, and any other MCP-compatible client. Production deployment requires attention to authentication, validation, and operational concerns.

Each layer builds on the previous. Starting from the basic insight that models are stateless and applications orchestrate makes the more sophisticated patterns easier to understand and the architectural decisions easier to justify.

The key points:

  • LLMs are static, stateless, and isolated. Dynamic behavior emerges from application-layer orchestration.
  • Function calling via system prompts works with any capable model. For production reliability, constrained decoding or provider tool-call APIs often help.
  • In-context learning (examples in the prompt) is essential for reliable structured output.
  • RAG enables access to private knowledge without fine-tuning or external data exposure. Production systems need hybrid retrieval and metadata filtering.
  • MCP provides standardization but adds deployment complexity; evaluate whether the interoperability benefits justify the overhead.
  • Local inference eliminates API costs and keeps data on-premises, but includes operational overhead beyond hardware.
  • Security is a primary design concern, not a late-stage add-on. Prompt injection, tool misuse, and data exfiltration are real risks in tool-integrated systems.

AightBot