Artificial Analysis announces AA-Omniscience, a benchmark for knowledge and hallucination across 40+ topics; Claude 4.1 Opus takes first place in its key metric
@artificialanlys : X: @artificialanlys , @emollick , @scaling01 , @teortaxestex , @artificialanlys , @zephyr_z9 , @artificialanlys , @artificialanlys , @mweinbach , @artificialanlys , and @artificialanlys X: @artificialanlys : @AnthropicAI takes the top three spots for lowest hallucination rate, with Claude 4.5 Haiku leading at 28%, over three times lower than GPT-5 (high) and Gemini 2.5 Pro. Claude 4.5 Sonnet and Claude 4.1 Opus follow in second and third at 48% [image] Ethan Mollick / @emollick : I applaud the effort to have a new measure of hallucination but I think this is ultimately just a measure of the threshold of refusals in answering trivia questions given a set system prompt. The questions are incredibly specific and nearly impossible without web lookup: @scaling01 : LLMs might be stupid after all they can refuse to answer questions on which they are uncertain without getting a penalty nonetheless, there are only 3 models that answer slightly more questions correctly than incorrectly Claude 4.1 Opus, GPT-5.1 and Grok-4 [image] @teortaxestex : > hallucination/knowledge eval Unsurprisingly, it *very* well tracks the raw model scale of open ones I think V3.2 may trade places with 0528 after bug fixes the only question is: *Haiku*? @artificialanlys : Models with the highest accuracy, including Grok 4, GPT-5.1 and Gemini 2.5 Pro, do not lead the Omniscience Index due to their tendency to guess over abstaining. Claude 4.1 Opus has the best balance of accuracy (31%) and hallucination (48%), giving it the highest score in the [image] @zephyr_z9 : This is a great benchmark [image] @artificialanlys : Grok 4 by @xai, GPT-5 by @OpenAI and Gemini 2.5 Pro by @GoogleDeepMind achieve the highest accuracy in AA-Omniscience. The reason they do not achieve the highest Omniscience Index due to the low hallucination rates of @AnthropicAI's Claude models [image] @artificialanlys : Larger models tend to have higher levels of embedded knowledge, with Kimi K2 Thinking and DeepSeek R1 (0528) topping accuracy charts over smaller models. This advantage does not always hold on the Omniscience Index. For example, Llama 3.1 405B from @AIatMeta beats larger Kimi K2 [image] Max Weinbach / @mweinbach : This is a great benchmark and none of the open weight models right now are, imo, usable as models in any meaningful agentic workload because of this Like.... it needs a positive number, and some of these being SO wrong is insane, though search tools + grounding could help @artificialanlys : Read more about the evaluation and methodology in our AA-Omniscience paper (published arXiv link coming later today): https://huggingface.co/... Explore sample questions and evaluate your model on the public set of AA-Omniscience with our HuggingFace dataset: @artificialanlys : Models differ in their performance across the six domains of AA-Omniscience - no model dominates across all. While @AnthropicAI's Claude 4.1 Opus leads in Law, Software Engineering, and Humanities & Social Sciences, GPT-5.1 from @OpenAI achieves the highest Omniscience Index on [image]