AIED 2026: LLM Difficulty Estimation Across 200 Conditions
We're presenting new work at AIED 2026 on a question that matters for anyone building AI-powered assessments: can large language models accurately estimate how hard a test question is?
Item difficulty estimation is foundational to adaptive testing, item bank development, and quality assurance. Traditional methods require hundreds of student responses per item. If LLMs could reliably estimate difficulty from the item text alone, it would dramatically accelerate assessment development — especially for new content areas where student data doesn't yet exist.
What we tested
We evaluated 15+ language models across 200 experimental conditions, varying prompt strategies, item formats, subject areas, and grade levels. We used our open item bank of 34,000+ CC-licensed assessment items with known psychometric parameters as ground truth.
Key findings
The short version: LLMs can estimate difficulty, but with important caveats. Performance varies significantly by model, subject area, and prompt strategy. Some combinations achieve correlations above 0.7 with empirical difficulty — useful for rough calibration. Others fail badly, especially on items that are hard for reasons LLMs don't naturally attend to (misleading distractors, reading load, prerequisite knowledge gaps).
The full paper, dataset, and evaluation code will be published openly after the conference. This is exactly the kind of infrastructure work Impact-Edu.ai was built to do: rigorous evaluation that benefits the entire field, not just one vendor.