(See also: Implementation, Advisory & Education)
We work on systems that need to behave predictably over time. With LLM components, that means being explicit about what is reliable, what is useful with constraints, and what should remain assistive. Most problems we see come from treating these as the same category.
How we think about LLM tasks
LLMs can be productive in many contexts, but reliability is not uniform. We use a simple taxonomy when scoping work, choosing architectures, and deciding where validation belongs.
Tier 1 — Highly reliable
(bounded, auditable, low surprise)
These are tasks where the output space is constrained and evaluation criteria are clear. They usually integrate well into production pipelines because failure modes are visible and downstream impact can be contained.
- Text classification: topic and intent labeling, policy labeling, metadata tagging
- Structured transformation: JSON/XML/CSV/Markdown conversions, normalization, schema mapping
- Information extraction: entities, fields, relations, slot filling, unstructured to structured records
- Semantic comparison: similarity scoring, near-duplicate detection, record linking and entity resolution
- Embedding generation: vector representations for semantic search and retrieval
Tier 2 — Reliable with guardrails
(grounding and review required)
These are effective when the system provides constraints such as documents, templates, or schemas, and when review is part of the workflow. The main design goal is making assumptions explicit and verification inexpensive.
- Source-grounded question answering: document-bound and retrieval-augmented Q&A with traceability
- Text quality and consistency checks: grammar and spelling detection, guideline compliance, contradiction flags
- Rewriting and controlled style transfer: paraphrasing, simplification, tone adjustment within defined constraints
- Controlled generation: template filling, boilerplate generation, outline expansion with review
- Translation and summarization: language translation, register adjustment, and source-tied summaries
<br />
<small>Quality varies by language pair and domain; medical and legal translation requires review. Abstractive summaries can introduce subtle shifts in emphasis.</small> - Code understanding and documentation: code summarization, explanation, and documentation generation
<br />
<small>The code provides ground truth, but validation still matters for downstream use.</small>
Tier 3 — Assistive and generative
(highest variance, validation required)
These can be useful accelerators, but the outputs are harder to validate exhaustively. We generally avoid placing these in unattended workflows.
- Code generation and modification: translation, refactoring, test generation (reviewed)
- Data preparation and augmentation: labeling assistance, normalization, synthetic text generation (validated)
Implications for system design
- Automation is reserved for bounded tasks. If the output space is open-ended, we assume oversight is required.
- Validation is part of the system. We design checks, sampling, logging, and review paths rather than relying on prompts.
- Generative components are isolated. When used, they sit behind review, constraints, or offline processes.
- Grounding is explicit. If a system answers questions, it should be clear what sources were used and what was inferred.
- Maintainability matters. We build for portability, documentation, observability, and long-term operational control.
What this means in practice
Fluent output does not equal dependable behavior. We design around real failure modes: missing context, ambiguous inputs, shifting requirements, and the tendency of generative systems to produce plausible text when they should instead fail safely.
LLMs are flexible, but systems are brittle. Some problems are better solved with rules, traditional machine learning, or structured workflows. We are explicit about that distinction.
This approach tends to produce systems that are easier to reason about, easier to audit, and less surprising to operate.
