QA Evaluation
Used human judges to assess correctness of response strings
- document provides context for answer
- NIST assessors trained for this task
- 3 independent assessments per question
- assessor judgments do differ
- final judgment set used adjudicated majority opinion�
- � but found such an expensive judgment set is not necessary