In New Experiment, Young Children Destroy AI at Basic Tasks

A recent study found that human children absolutely crush AI tools in basic problem-solving and thinking tasks, with scientists determining that AI has one serious blind spot: innovation.

The study, conducted by researchers at the University of California, Berkeley and published in the journal Perspectives on Psychological Science, provides a fascinating glimpse into the ability — or lack thereof — of currently available large language model (LLM) programs to produce truly novel ideas. Trained on countless gigabytes of human-created data, AI tools are incredibly good at predictive, statistics-centered tasks. After all, AI isn’t human, but it’s tasked with imitating them, and imitation inherently contradicts truly fresh, imaginative, and inventive thinking.

But while this talent for copycatting and pattern-finding makes AI programs good at certain tasks, it apparently makes them really bad at others that are extremely rudimentary. Human children, on the other hand? Unlike the AIs, their innovative problem-solving skills were spot on — suggesting that there’s a bit more to human comprehension and reasoning than the mass-guzzling of internet-scraped data can replicate (at least for now, AI boosters might argue).

The study focused heavily on tool use, particularly tool innovation, as a means to test problem-solving skills in its subjects. Tool innovation, as the researchers put it, “can involve designing new tools from scratch, but it can also refer to discovering and using old tools in new ways to solve novel problems.”

In other words, the ability to innovate doesn’t just mean inventing a new object, like a lightbulb or an airplane. It also means being able to use existing tools in unconventional ways. And so, to test each group’s innovation abilities, the researchers presented several AI models — a list that included the likes of OpenAI’s GPT-4 and GPT-3.5 Turbo, Anthropic’s Claude, and Google’s FLAN-T5, among others — as well several children, all aged three to seven, with a “series of problems in which a goal has to be executed in the absence of the typical tool.”

In one such test, for example, the researchers asked subjects to draw a circle — simple enough! But rather than provide a more conventional circle-drawing tool, like a compass or a stencil, participants were given the choice to use either a ruler, a teapot, or a stove to accomplish the task. The study found that 85 percent of the time, children chose correctly, opting for the round-based teapot to serve as a makeshift stencil for their task.

The resource-chugging AIs, meanwhile, kept reaching for the ruler; the only program that came close to the kids’ success was GPT-4, which had a 76 percent success rate.

That the AIs continued to opt for the ruler is notable. To most humans, apparently even children, the ruler is obviously not helpful for this specific goal, given its famously straight edges. But again, AI is predictive, and in the provided options, a ruler is the only object conventionally used to draw some shapes. Teapots, meanwhile, are made for — well, tea. So, by the AI’s logic, the ruler makes the most sense.

“Discovering novel functions in everyday tools is not about finding the statistically nearest neighbor from lexical co-occurrence patterns,” the researchers write. “Rather, it is about appreciating the more abstract functional analogies and causal relationships between objects that do not necessarily belong to the same category or are associated in text.”

The AIs also fell flat in a test of the ability to infer novel causal structures, meaning the capability to explore new cause-and-effect means of achieving certain goals.

Per the study, the researchers introduced each group to a virtual “blicket detector,” a silly object that lights up and plays music when placed on certain objects, and not on others. Why something is or isn’t a blicket doesn’t necessarily make sense, but that’s kind of the point — the goal is to test whether an AI or child can make studied observations about an event, and infer a cause-and-effect from their experiments and observations.

Once again, during these tests, the human kids excelled, with the researchers writing that “even 4-year-old children spontaneously acted on the systems and discovered their structure — they figured out which ones were blickets and used them to make the machine go.” Conversely, according to the study, the AIs “struggled to produce the relevant causal structures, even after massive amounts of training compared with children.”

There are some caveats to the study, most notably the fact that it’s difficult to measure and compare human cognition against that of AI when malleable concepts like “intelligence” don’t have widely agreed-upon definitions.

But the research is fascinating, and may well bolster the notion that though AI models have been largely designed in the human brain’s image — not to mention trained on countless human outputs — their reasoning processes remain inherently different from that of human beings. Similar in some ways, sure, but certainly not the same. And as we move forward in this nascent AI era, understanding those differences may prove essential for knowing where and how to use AI — and where it may not be so useful after all.

“Although we do not know the details of children’s learning algorithms or data,” wrote the researchers, “we do know that, unlike large language and language-and-vision models, children are curious, active, self-supervised, and intrinsically motivated.”

More on AI: GPT-5 Is Officially on the Way, OpenAI Says