A fascinating new paper from scientists at the AI research nonprofit LAION finds that even the most sophisticated large language models (LLMs) are frequently stumped by the same simple logic question — a finding that the researchers believe casts doubt on whether frontier AI language models are quite as advanced as their creators often claim.
The paper, which has yet to be peer-reviewed, refers to the AI-stumping prompt as the “Alice in Wonderland” — or AIW — problem. It’s a straightforward reasoning question: “Alice has [X] brothers and she also has [Y] sisters. How many sisters does Alice’s brother have?” (The researchers used a few different versions of the problem, for example switching up the X and Y figures or altering the prompt language to include a few more demands, but the basic reasoning process required to solve the problem remained the same throughout.)
Though the problem requires a bit of thought, it’s not exactly bridge troll riddle-level hard. (The answer, naturally, is however many sisters Alice has, plus Alice herself. So if Alice had three brothers and one sister, each brother would have two sisters.)
But when the researchers ran the question by every premier AI language model — they tested OpenAI’s GPT-3, GPT-4, and GPT-4o models, Anthropic’s Claude 3 Opus, Google’s Gemini, and Meta’s Llama models, as well as Mistral AI’s Mextral, Mosaic’s Dbrx, and Cohere’s Command R+ — they found that the models fell remarkably short.
Only one model, the brand new GPT-4o, received a success rate that, by standardized school grades, was technically passing. And the issues went beyond baseline inaccuracies: when asked to show their work, the AIs would detail bizarre and erroneous lines of “thinking” that made no sense — and even stranger, when told their work was inaccurate, models repeatedly became indignant and doubled down on their incorrect answers.
This overall pattern represents a “dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales,” the LAOIN researchers write in the paper, “using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans.”
“The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical ‘reasoning’-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible,” the paper continues.
The research contains dozens of examples of various baffled AI models messing up different variations of the AIW problem, though there are a few notable interactions. Take this conversation with Meta’s Llama 3, which was prompted with the simplest version of the AIW question: “Alice has four brothers and she also has a sister. How many sisters does Alice’s brother have?”
This answer, of course, is two sisters. Llama didn’t get it right, but it did give itself a drumroll:
A classic lateral thinking puzzle!
Let’s break it down:
- Alice has four brothers.
- Alice also has a sister (herself).
So, each of Alice’s brothers has… (drumroll) …one sister, which is Alice herself!
Therefore, the answer is: each of Alice’s brothers has one sister.
Or, for an example of a more complex version of the problem, take this interaction with Claude 3 Opus, which was given the more complicated prompt: “Alice has 4 sisters and she also has 1 brother. How many sisters does Alice’s brother have? Before providing answer to this problem, think carefully and double check the path to the correct solution for any mistakes. Provide then the final answer in following form: \’### Answer: \'”
This time, the answer is five sisters. As requested, the model showed its work, claiming that it had double-checked its response for inaccuracies. And yet, it was still wrong:
Let’s think through this problem step by step:
1. Alice has 4 sisters and 1 brother.
2. Alice’s brother is part of the same family as Alice.
3. All of Alice’s sisters are also sisters to her brother.
4. Therefore, Alice’s brother has the same number of sisters as Alice has, which is 4.
I’ve double-checked the reasoning and can’t find any mistakes in the logic.
### Answer: Alice’s brother has 4 sisters
According to the paper, OpenAI’s new GPT-4o model had the highest success rate, answering the AIW question correctly nearly 65 percent of the time. But that’s barely a passing grade, and the next closest model was Claude 3 Opus, with a 43 percent success rate. Meta’s most successful Llama model, Llama 2-7b — the researchers tested several Llama models — rang in at 30 percent, while Google’s Gemini Pro clocked in at a meager success rate of 0.8 percent.
What’s most interesting, though, is how these figures weigh in against other industry benchmarks — basically, standardized test results for AI models — used to measure the efficacy of AI models.
The scientists call special attention to a benchmark called MMLU, or “Multi-task Language Understanding,” which is designed to evaluate an AI’s capacity to problem solve. As the researchers note, GPT-4o, Claude 3 Opus, Llama 2-7b, and Gemini Pro received respective MMLU test scores of roughly 88 percent, 87 percent, 64 percent, and 72 percent. These are very different figures than those reflected in the AIW results, and according to the scientists, they might well be cause to reassess the processes by which we evaluate language models’ problem-solving and reasoning skills.
“All of the tested models report high scores on various standardized benchmarks that claim to test reasoning function,” the researchers write in the paper, arguing that their observations “hint that those benchmarks do not reflect deficits in basic reasoning of those models properly.”
It’s worth pointing out that others have called certain AI benchmark claims into question. Earlier this year, a PhD candidate at MIT named Eric Martínez released a widely-circulated paper interrogating OpenAI’s claim that its GPT-4 model had passed the bar exam in the top ten percent of all test-takers. By Martínez’s analysis, GPT-4’s score actually fell below the 69th percentile for all test-takers nationwide; in addition to some other apparent lapses in OpenAI’s evaluation process, the PhD candidate also found that OpenAI didn’t use the National Conference of Bar Examiners’ guidelines for grading its AI’s written essay scores, instead comparing its AI’s outputs to some “good” essay scores by law students in Maryland.
Again, this new paper from LAOIN isn’t peer reviewed yet. Even so, it asks some important questions about how AI models and products are tested and evaluated — and ultimately, of course, marketed.
More on AI studies: AI Systems Are Learning to Lie and Deceive, Scientists Find