Chatbot Arena ELO measures relative user preference from blind head-to-head votes in LMArena. It is a live human-judgment signal for conversational quality under real prompts.
Large-scale blinded pairwise voting reduces single-rater bias; continuous updates surface quality shifts quickly after model releases.
Voter population is self-selected and may not match enterprise or domain users; style, verbosity, and safety tone can influence votes independently of factual correctness.
Higher is better. Scores are ELO-like ratings (typically 900-1400). Compare relative ranking — absolute values shift as new models enter.
Voter population is self-selected and may not match enterprise or domain users.