Apple’s upgraded AI models underwhelm on performance

Apple has announced updates to the AI models that power its suite of Apple Intelligence features across iOS, macOS, and more. But according to the company’s own benchmarks, the models underperform older models from rival tech firms, including OpenAI.

Apple said in a blog post Monday that human testers rated the quality of text generated by its newest “Apple On-Device” model — which runs offline on products, including the iPhone — “comparably” to, but not better than, text from similarly sized Google and Alibaba models. Meanwhile, those same testers rated Apple’s more capable new model, which is called “Apple Server” and is designed to run in the company’s data centers, behind OpenAI’s year-old GPT-4o.

In a separate test evaluating the ability of Apple’s models to analyze images, human raters preferred Meta’s Llama 4 Scout model over Apple Server, according to Apple. That’s a bit surprising. On a number of tests, Llama 4 Scout performs worse than leading models from AI labs like Google, Anthropic, and OpenAI.

The benchmark results add credence to reports suggesting Apple’s AI research division has struggled to catch up to competitors in the cutthroat AI race. Apple’s AI capabilities in recent years have underwhelmed, and a promised Siri upgrade has been delayed indefinitely. Some customers have sued Apple, accusing the firm of marketing AI features for its products that it hasn’t yet delivered.

In addition to generating text, Apple On-Device, which is roughly 3 billion parameters in size, drives features like summarization and text analysis. (Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.) As of Monday, third-party developers can tap into it via Apple’s Foundation Models framework.

Apple says both Apple On-Device and Apple Server boast improved tool use and efficiency compared to their predecessors, and can understand around 15 languages. That’s thanks in part to an expanded training dataset that includes image data, PDFs, documents, manuscripts, infographics, tables, and charts.

Kyle Wiggers is TechCrunch’s AI Editor. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Manhattan with his partner, a music therapist.

View Bio