AI Capability Gap Analysis

What AI actually can/can't do. Documented, not theorized.

The problem

AI capability discussions are dominated by hype and fear. What's missing: systematic documentation of what models actually do well vs. where they fail, based on real usage rather than benchmarks or speculation.

Benchmarks measure narrow tasks, not practical capability
Marketing claims don't match user experience
Failure modes aren't systematically documented
Capability evolution is rapid but poorly tracked

Data source

353K conversation messages (2023-2025) across:

Coding — TypeScript, Python, infrastructure, debugging
Research — Literature review, synthesis, fact-checking
Writing — Technical docs, analysis, editing
Reasoning — Problem decomposition, planning, decision support

All conversations preserved with metadata. Searchable via semantic embeddings.

Findings: What works

Code generation

Boilerplate and pattern implementation
API integration when docs are clear
Refactoring with explicit constraints
Test generation for defined behavior

Research

Structured information aggregation
Cross-referencing multiple sources
Summarization of known domains
Question generation and framing

Writing

First drafts from bullet points
Format transformation (prose ↔ structured)
Editing with specific instructions
Technical documentation

Findings: What fails

Code generation

Novel algorithm design without examples
System architecture decisions
Debugging without clear reproduction
Performance optimization (often wrong direction)

Research

Distinguishing reliable vs. unreliable sources
Identifying what's actually unknown
Novel synthesis beyond pattern matching
Current events accuracy

Reasoning

Multi-step logical chains (> 5 steps)
Constraint satisfaction with many variables
Catching own errors without prompting
Knowing when it doesn't know

Patterns observed

Capability is context-dependent

Same task succeeds or fails based on: how it's framed, what context is provided, how similar to training data, whether verification is possible.

Confidence doesn't correlate with accuracy

Model expresses same confidence on correct and incorrect outputs. User must independently verify; model self-assessment is unreliable.

Iteration beats single-shot

Multi-turn refinement outperforms trying to get perfect output first attempt. Correction feedback improves subsequent outputs.

Open questions

How to predict which tasks will succeed vs. fail?
What's the optimal human-AI task division?
How do capabilities change across model versions?
What prompt patterns reliably improve output quality?
How to build workflows around capability gaps?

Status

active Data collected. Initial patterns documented. Structured analysis in progress.