May 19, 2024

Socially Awkward Robots

In the span of just two months, Anthropic, Google, Meta, and OpenAI took turns claiming the top spot in the leaderboards. But as we observe this fierce competition of capital, computation, and crunch time, we must ask ourselves: are we sprinting in the right direction?

The benchmark problem

New model releases are judged by their benchmark performances on reasoning, math, coding, and language understanding. The benchmarks usually take the form of standardized tests - sets of multiple choice or short answer problems. Models that fall short are relegated to obscurity, a manifestation of the "file-drawer problem" where only top-scoring models gain attention and bear offspring.1

This reward scheme nudges researchers and engineers to optimize for the most widely-used metrics. And that is okay, if the goal of AI development to create the best test takers in the world. Yet the marketing rhetoric positions the commercially available AI agents as assistants, teachers, and companions.

I find this misalignment between the objective and the promise to be problematic. It might be an indication that we are at a ripe point to reconsider our benchmarks.

There's more to intelligence than test scores

The first lesson I learned in dementia research is that human cognitive function is multifaceted. A college admissions officer doesn't sort candidates solely by their SAT scores, nor does a hiring manager rely exclusively on IQ tests. We know that a person who scores 98 on a test is not inherently more intelligent than someone scoring 94.

We also know, from our own experiences, that test scores and social skills don't always go hand in hand. Cramming for a test doesn't necessarily enhance a person's empathy, judgment, and emotional intelligence. My dog scored a 0 on HumanEval but he consistently aces being a good buddy.

Focusing on the current set of narrowly-defined benchmarks may lead us to overlook potentially groundbreaking innovations. Suppose a model has subpar performance on the graduate-level GPQA benchmark, but is proficient at explaining concepts in a manner best suited to the target audience. This model might be a better fit as a tutor than the one that gets a perfect score on quantum mechanics and astrophysics tests.

The case for diversity in model development

Just as diversity is beneficial in teams and organizations, developing a diverse array of AI models is necessary for healthy progress. The current benchmarks push us to overfit on a very particular objective function, one that prioritizes test-taking prowess over social and emotional skills.2

Picture a model that can break down complex topics for different audiences, adapting examples and vocabulary on the fly. Or one that can read the emotional subtext in a conversation and respond with empathy. Such abilities are essential to our cognitive functions, but are not represented in the benchmarks for model evaluation.

If companies want us to believe their AI systems are the good friends they claim them to be, they need to back it up. Developing benchmarks for the traits that truly matter for these roles - navigating social dynamics, showing empathy - may be more complex than making a multiple choice test. But it's a challenge we should embrace, drawing on insights from neurology, psychiatry, psychology, sociology, anthropology and more. The more our models advance, the more we can learn from looking introspectively into what makes us human.

The race for artificial intelligence is far from over and it's time to reconsider the finish line. If we keep running in the current direction, we'll likely end up with socially awkward problem solvers. And as one of them, I know that there's enough of us in the field.


  1. Comparison is more lenient toward models that introduce new modalities or models focusing on efficiency.

  2. On a tangentially related note, one concerning trend I see is companies conflating a model's ability to mimic certain stylistic traits with genuine personality or cognitive depth. The recent announcement of GPT4o prompted parallels to the film Her (2013). However, training a model to sound human-like doesn't necessarily imbue it with the capacity for friendship. It's just as likely for a killer robot to sound like Scarlett Johansson as it is for a benevolent robot to sound like HAL.