Karpathy’s “LLM Council” Experiment Shows GPT 5.1 Leading Peer-Based AI Evaluations

November 24, 2025

2911

Karpathy’s “LLM Council” Experiment Shows GPT 5.1 Leading Peer-Based AI Evaluations

Andrej Karpathy’s latest project, LLM Council, is drawing attention across the AI community for introducing a fresh approach to model comparison—using AI systems themselves as the evaluators. The initiative features multiple large language models anonymously critiquing and ranking each other’s answers without knowing which model produced them. In early results, OpenAI’s GPT 5.1 has been emerging as the strongest performer, despite prior benchmark reports indicating that Google’s Gemini 3.0 may have recently taken a competitive edge.

Karpathy shared his observations while testing the method in longer-form comprehension tasks: “The models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model.” He also acknowledged that the peer-generated rankings sometimes diverge from his own human assessment, noting that GPT 5.1 can feel “too wordy,” while Gemini 3.0 tends to be “more condensed,” and Claude “too terse.”

What makes LLM Council significant is its focus on real-world reasoning and preference judgment instead of purely quantitative benchmarks. Models not only evaluate correctness but also critique clarity, depth, and relevance—factors that matter to end-users but are often harder to encode in traditional testing.

Interestingly, the experiment has already sparked discourse among developers and researchers. Vasuman M., founder and CEO of Varick AI Agents, reacted to Karpathy’s update by claiming that he previously built a similar internal assessment system and had witnessed the same performance hierarchy. He stated that GPT 5.1 won “every single time,” adding that when competing models are informed the response originated from GPT, “they fold immediately and start correcting themselves.”

Karpathy has emphasized that the system itself is still an informal prototype, created quickly using a lightweight coding approach and not intended as a definitive capability ranking. However, its early outputs are prompting broader reflection on how AI evaluations should evolve—particularly as generative models become more comparable in quality and more embedded in high-stakes business and research workflows.

With AI developers, analysts, and users increasingly seeking transparent and practical benchmarks, LLM Council could serve as an influential step toward more human-aligned evaluation frameworks—where intelligence is judged not just by test scores, but by how effectively models critique, learn, and reason in collaboration.

- Advertisement -

Karpathy’s “LLM Council” Experiment Shows GPT 5.1 Leading Peer-Based AI Evaluations

Related Articles

Simarpreet Singh has been appointed as Country Manager – India & SAARC at Upwind

StrategINK Elevates Prateek Tokas as Chief Executive Officer to Accelerate Global Research-Led Growth Strategy

Anthropic Alleges Chinese AI Firms Ran ‘Industrial-Scale’ Distillation Attacks

Semiconductor Startup optoML Secures $1.8 in Pre-Series A to Scale AI SoC Development

LEAVE A REPLY Cancel reply

Latest Articles

Simarpreet Singh has been appointed as Country Manager – India &...

StrategINK Elevates Prateek Tokas as Chief Executive Officer to Accelerate Global...

Anthropic Alleges Chinese AI Firms Ran ‘Industrial-Scale’ Distillation Attacks

Semiconductor Startup optoML Secures $1.8 in Pre-Series A to Scale AI...

AI-Native GTM Platform Kris@Work Secures $3 Mn in Seed Funding

ZeroMoblt Raises Rs 1.5 Crore in Pre-Seed Round to Build AI-Driven...

Stripe, PayPal Ventures bet on India’s Xflow to fix cross-border B2B...

Harshvardhan Chauhan Named CMO and Senior Management Personnel at Metro Brands

Vonkayala Venkata Giri Appointed Managing Director – Enterprise Data Platform at...

Excitel Broadband Appoints Varun Pasricha as Chief Executive Officer