Benchmark Results

Accuracy comparison across models and benchmarks from the paper

Ordering:

ARC Challenge

OpenBookQA

GSM8K

MMLU-Pro

MATH

NameIndex

MiddleMatch

Example Comparisons

Example Comparisons

See how prompt repetition changes actual model responses

Multiple Choice

Math

Retrieval