How We Broke Top AI Agent Benchmarks: And What Comes Next
Who would have tought… “The benchmarks aren’t measuring what you think they’re measuring” https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
Who would have tought… “The benchmarks aren’t measuring what you think they’re measuring” https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
And leads to strange results.
Unless you think it will solve itself.
“The most interesting thing about the Moon is us.”
I am always surprised how many people do not understand this. “Models do not (broadly speaking) learn over time. They can be tuned by their operators, or periodically rebuilt with new inputs or feedback from users and experts. Models also… Read More »The Future of Everything is Lies, I Guess
This is why virtue signaling is dumb. “virtus, channeled by the other virtues, leads to admirable deeds” https://acoup.blog/2024/03/29/fireside-friday-march-29-2024-on-roman-values/
More of this. “Christina Koch became the first woman to travel to the vicinity of the moon. The last time humans went to the moon, women could not have their own credit card” https://lizplank.substack.com/p/artemis-ii-is-competency-porn-and
Central comitees rarely work. “In 1958, Mao ordered every village in China to produce steel. Farmers melted down their cooking pots in backyard furnaces and reported spectacular numbers. The steel was useless. The crops rotted. Thirty million people starved.In 2026,… Read More »The AI Great Leap Forward
It is all about stories. “The corporation does not wonder whether formation is possible. It has quarterly OKRs around it.” https://tantaman.com/2026-04-06-formation-we-admit-to.html
Matrix was really upfront of its times. I do not think we will have an AI apocalypse. But the current “agent” talks was perfectly captured by the scene with Agents Smiths. [Several Agent Smith Clones walk in]Agent Smith Clone 1:… Read More »“Agents”