David Pawlan
Co-Founder
Hey y’all,
From AI scientists to shady leaderboard tactics and mini models beating giants, this week’s AI news is full of shake-ups. Microsoft’s new reasoning models are surprisingly powerful (and phone-sized), Claude got a serious productivity upgrade, and decentralized models might just be the next big thing.
Let’s break it all down — fast.
🏆 LMArena’s leaderboard credibility questioned
A damning study from Cohere Labs, MIT, and Stanford suggests the popular LMArena leaderboard might be rigged in favor of tech giants like OpenAI and Google. Allegations include private testing, silent model removals, and biased sampling. LMArena denies wrongdoing, but the episode casts doubt on benchmark integrity — just as Llama 4 Maverick’s drama fades.
Why it matters: Leaderboards shape perception — and funding. If they’re gamed, the whole AI model race loses meaning.
🧠 Microsoft and Anthropic go small, go smart
Microsoft’s new Phi-4 models show that small can be mighty. The flagship 14B-parameter Phi-4-reasoning outpaces larger models like o1-mini and even holds up against DeepSeek's 671B titan. Meanwhile, Anthropic’s new Claude Integrations eliminate the complexity of MCPs, letting Claude plug into apps like Zapier or Square and fetch live data or web results for 45 minutes.
Why it matters: Power is shifting from bloated models to nimble, task-specific ones that run on your laptop or smartphone — no data center required.
🔬 AI scientists enter the chat
FutureHouse launched “AI scientists” — agents that can review research, answer deep scientific questions, and in one case (hello, Phoenix), help you design new chemistry experiments from scratch. This push into public-facing research agents is backed by none other than Eric Schmidt.
Why it matters: It’s a glimpse into a future where AI doesn’t just summarize papers — it creates the next breakthroughs.
🌐 A new kind of AI model: decentralized and user-owned
Vana and Flower Labs are teaming up to build a “user-owned” large language model, Collective-1. The model is powered by volunteered compute and personal data, with the goal of reaching 100B parameters.
Why it matters: This decentralized approach could let smaller players compete with tech giants — and give users control over their data (finally).
New this week:
🎶 Suno v4.5 debuts 8-minute AI songs and better genre control
📻 Australian radio ran an AI host for 6 months — no one noticed
🕵️ Google’s AMIE now reads medical images during diagnosis
🛰️ AI is now uncannily good at guessing where a photo was taken
Conduct Recursive Research Iterations
Prompt: Act as a recursive research optimizer. Analyze results, identify gaps, refine searches, and repeat until you hit max-quality insights.
Use it when you're deep in research mode and want to push past surface-level summaries.
Benchmark trust is breaking down, Microsoft and Anthropic are proving small can be powerful, and AI scientists are moving from theory to hands-on discovery. Meanwhile, decentralized models and Claude’s new integrations offer a peek at AI’s more open, connected future.
Catch you on the next iteration,
—David
Read by 10,000+ AI professionals and builders.
Visa, Mastercard, and PayPal launch AI shopping tools. GPT-4o gets rolled back for being too agreeable. China and Amazon drop powerful new models.
Reddit’s AI scandal raises red flags, Meta launches a Llama-fueled social AI app, and anyone can build with no-code tools. It’s all happening this week in AI.
GPT-4o gets too agreeable, Qwen3 drops open-source, ChatGPT adds shopping smarts, and a new AI startup wants to build actual starships.