Introduction

Introducing LiveBench: a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:

  • LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
  • Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.
  • LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.

We will evaluate your model on LiveBench! Open a github issue or email us at livebench.ai@gmail.com!

Leaderboard

We update questions each month such that the benchmark completely refreshes every 6 months. The initial version was LiveBench-2024-06-24. The next version was LiveBench-2024-07-25 with additional coding questions and a new spatial reasoning task. All questions are available here. The most recent version is LiveBench-2024-08-31 with updated math questions.

Note: the o1 results are preliminary! Since they introduce a new inference paradigm, we will continue to double check their outputs, as well as the default inference settings and prompt techniques in LiveBench (for all models, not just o1 models). LiveBench is truly "live", and we will update it accordingly as necessary in response to new developments in the field.
2024-08-31
ModelGlobal Average

BibTeX


@article{livebench,
  author    = {White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah},
  title     = {LiveBench: A Challenging, Contamination-Free LLM Benchmark},
  url       = {arXiv preprint arXiv:2406.19314},
  year      = {2024},
}