The Information : Where LLMs Are Falling Short

I’m back from the International Conference on Machine Learning in Vancouver, one of the biggest annual meetups for artificial intelligence researchers. And this year, the conference underscored all the ways in which large language models are still falling short of everyone’s expectations, despite the immense progress that got us to this point.

Researchers even called into question some of the most promising techniques that have gained popularity over the last year, such as chain-of-thought reasoning, or asking the models to describe how they arrived at an answer—their “thoughts,” so to speak.

For instance, one research paper presented at the event explained how chain-of-thought can actually hurt model performance on certain tasks due to “overthinking.” In one example, models were shown strings of letters that followed some rule that the model didn’t know. Then the models were shown another string of letters and asked whether it followed the unknown rule or not.

Humans typically perform better on this task when they’re told to go with their gut feeling. However, the models performed worse when asked to explain their reasoning. Models are great at finding patterns, but when there are so many possibilities for what the unknown rule or pattern might be, they tend to overthink and end up at the wrong answer, the paper argued.

Several papers at ICML pointed out how today’s large language models are rooted in text and that much more work is needed in getting the AI to understand and generate images, videos and audio, otherwise known as multimodality.

One paper from Google DeepMind explained how models that generate audio can only generate tens of seconds of speech at a time before devolving into gibberish because of memory limitations.

The Google DeepMind researchers came up with a new type of model that’s able to generate nearly 20 minutes of speech in one go. Still, it's crazy that even with all the advances in LLMs, speech AI is far behind, especially with the number of startups that want to build AI personal assistants to converse with us.

Another presentation showed how multimodal LLMs fall short when it comes to answering questions where they need to understand images, like diagrams in physics questions.

For instance, in a question that asks the model to predict what a sheet of paper might look like after folding and cutting it in certain ways, the LLM tries to describe (in text form) how it would solve the problem and gets confused. Humans typically arrive at the right answer more easily by visually imagining folding and cutting the paper, according to the research paper.

In another sign of how AI research has gotten more practical and commercial in recent years, ICML also featured lots of papers and presentations about benchmarks, or ways to measure the performance of LLMs on real-world tasks.

These include SWE-Lancer, which measures models’ ability to code by having them complete real tasks on Upwork; WindowsAgentArena, which specifically tests AI agents’ ability to navigate a Windows operating system and apps like Excel and PowerPoint; and LOB-Bench, a benchmark that measures how well AI models can produce certain kinds of financial data. These benchmarks highlight how AI research has gotten more practical and less theoretical in recent years