The AI Scientist, Another Step Closer to AGI?

Sakana AI, Grok, OpenAI, Anthropic

The research community is closely monitoring the advancements needed to unlock AGI (Artificial General Intelligence). Today, we may be one step closer with the latest research from Sakana.ai, which outlines how they have automated the scientific research lifecycle. Additionally, we see notable progress in autonomous software engineering AI agents, demonstrated by the team at Cosine. Meanwhile, the latest chatbot rankings highlight a highly competitive landscape in terms of speed, price, and capability.

We’ve also launched a podcast covering some of the breaking news stories. If you like to consume content in video format, subscribe to our YouTube channel! Check out this week’s episode where we cover OpenAI Strawberry, Google’s Pingpong robot, and the pricing cuts from OpenAI and Google.

— Sasha Krecinic

Sakana AI has launched The AI Scientist, an AI system that automates scientific research and claims it can manage the entire machine learning research lifecycle. Developed with researchers from the University of Oxford and the University of British Columbia, the system has produced papers in areas such as language modeling and diffusion and is open-sourced. Many in the research community see this as a potential flywheel for AI systems to unlock the ability for self-improvement. [The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery]

Cosine AI says it has built an AI software engineer that scored 30.08% on the SWE-Bench benchmark, surpassing Amazon and Cognition. The model is designed to mimic human software engineering behavior and claims to perform at 50% on the SWE-Lite benchmark, with potential for future challenges. [via @AlistairPullen]

OpenAI's ChatGPT-4o has taken the lead in the Chatbot Arena with a score of 1314, surpassing Google's Gemini-1.5-Pro-Exp after over 11,000 community votes under the masked title of "anonymous chatbot". The model ranks highly in Math, Coding, and Instruction-Following, with a notable improvement in coding, scoring over 30 points higher than its predecessor. Meanwhile, Grok 2 was a recent addition and reached third place and achieved #2 in Coding, #2 in Math, and #4 in Hard Prompts. Super impressive results from the newcomer! [lmsys.org]

Anthropic has introduced a prompt caching feature for its Claude AI models, now available in public beta, which allows developers to cache frequently used contexts between API calls. This feature reduces costs by up to 90% and improves latency by up to 85%, making it highly efficient for use cases like conversational agents, coding assistants, and large document processing. The pricing model involves a slightly higher cost for caching inputs but offers significantly cheaper access to cached content, with early adopters like Notion already seeing substantial benefits. [Prompt caching with Claude]

Google DeepMind has introduced "Gemini Live," a new feature that enhances conversational AI interactions on Android devices. Available to Gemini Advanced subscribers, Gemini Live allows for more natural conversations with the ability to brainstorm ideas, interrupt to ask questions, and pause chats to resume later. This feature is now rolling out in English, making mobile AI interactions smoother and more responsive, with significant potential for accessibility and productivity improvements. [Gemini makes your mobile device a powerful AI assistant]