
Andrej Karpathy’s latest project, LLM Council, is drawing attention across the AI community for introducing a fresh approach to model comparison—using AI systems themselves as the evaluators. The initiative features multiple large language models anonymously critiquing and ranking each other’s answers without knowing which model produced them. In early results, OpenAI’s GPT 5.1 has been emerging as the strongest performer, despite prior benchmark reports indicating that Google’s Gemini 3.0 may have recently taken a competitive edge.
Karpathy shared his observations while testing the method in longer-form comprehension tasks: “The models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model.” He also acknowledged that the peer-generated rankings sometimes diverge from his own human assessment, noting that GPT 5.1 can feel “too wordy,” while Gemini 3.0 tends to be “more condensed,” and Claude “too terse.”
What makes LLM Council significant is its focus on real-world reasoning and preference judgment instead of purely quantitative benchmarks. Models not only evaluate correctness but also critique clarity, depth, and relevance—factors that matter to end-users but are often harder to encode in traditional testing.
Interestingly, the experiment has already sparked discourse among developers and researchers. Vasuman M., founder and CEO of Varick AI Agents, reacted to Karpathy’s update by claiming that he previously built a similar internal assessment system and had witnessed the same performance hierarchy. He stated that GPT 5.1 won “every single time,” adding that when competing models are informed the response originated from GPT, “they fold immediately and start correcting themselves.”
Karpathy has emphasized that the system itself is still an informal prototype, created quickly using a lightweight coding approach and not intended as a definitive capability ranking. However, its early outputs are prompting broader reflection on how AI evaluations should evolve—particularly as generative models become more comparable in quality and more embedded in high-stakes business and research workflows.
With AI developers, analysts, and users increasingly seeking transparent and practical benchmarks, LLM Council could serve as an influential step toward more human-aligned evaluation frameworks—where intelligence is judged not just by test scores, but by how effectively models critique, learn, and reason in collaboration.




