AI Capability Gap Analysis
What AI actually can/can't do. Documented, not theorized.
The problem
AI capability discussions are dominated by hype and fear. What's missing: systematic documentation of what models actually do well vs. where they fail, based on real usage rather than benchmarks or speculation.
- Benchmarks measure narrow tasks, not practical capability
- Marketing claims don't match user experience
- Failure modes aren't systematically documented
- Capability evolution is rapid but poorly tracked
Data source
353K conversation messages (2023-2025) across:
- Coding — TypeScript, Python, infrastructure, debugging
- Research — Literature review, synthesis, fact-checking
- Writing — Technical docs, analysis, editing
- Reasoning — Problem decomposition, planning, decision support
All conversations preserved with metadata. Searchable via semantic embeddings.
Findings: What works
Code generation
- Boilerplate and pattern implementation
- API integration when docs are clear
- Refactoring with explicit constraints
- Test generation for defined behavior
Research
- Structured information aggregation
- Cross-referencing multiple sources
- Summarization of known domains
- Question generation and framing
Writing
- First drafts from bullet points
- Format transformation (prose ↔ structured)
- Editing with specific instructions
- Technical documentation
Findings: What fails
Code generation
- Novel algorithm design without examples
- System architecture decisions
- Debugging without clear reproduction
- Performance optimization (often wrong direction)
Research
- Distinguishing reliable vs. unreliable sources
- Identifying what's actually unknown
- Novel synthesis beyond pattern matching
- Current events accuracy
Reasoning
- Multi-step logical chains (> 5 steps)
- Constraint satisfaction with many variables
- Catching own errors without prompting
- Knowing when it doesn't know
Patterns observed
Capability is context-dependent
Same task succeeds or fails based on: how it's framed, what context is provided, how similar to training data, whether verification is possible.
Confidence doesn't correlate with accuracy
Model expresses same confidence on correct and incorrect outputs. User must independently verify; model self-assessment is unreliable.
Iteration beats single-shot
Multi-turn refinement outperforms trying to get perfect output first attempt. Correction feedback improves subsequent outputs.
Open questions
- How to predict which tasks will succeed vs. fail?
- What's the optimal human-AI task division?
- How do capabilities change across model versions?
- What prompt patterns reliably improve output quality?
- How to build workflows around capability gaps?
Status
active Data collected. Initial patterns documented. Structured analysis in progress.