OpenAI's GPT-5 Growth Challenges and the Complexity of Evaluating AI Agents

OpenAI's GPT-5 has seen mixed growth, with user engagement rising but falling short of expectations, while evaluating AI agents remains complex due to the need for standardized tests and the variability in their performance.

1

OpenAI's growth trajectory for GPT-5 has faced significant challenges despite user engagement increases.

OpenAI anticipated reaching 1 billion weekly active users (WAUs) by the end of 2025, but as of early 2026, it only reached 910 million. The initial release of GPT-5 was met with criticism for not significantly improving capabilities over GPT-4, leading to a slowdown in growth. However, subsequent updates, GPT-5.1 and GPT-5.2, have improved user experience and performance, prompting expectations of reaching 2.75 billion WAUs by 2030.

1.1

User engagement has fluctuated post-GPT-5 release.

Despite reaching 910 million WAUs, OpenAI's growth has not met its ambitious targets, indicating a need for better user retention strategies.

1.2

Updates have improved user experience.

The updates to GPT-5 have made the model more personable and effective in workplace tasks, suggesting that continuous improvement is crucial for user satisfaction.

1.3

Future growth relies on broader integration.

OpenAI's plans to integrate ChatGPT into devices and expand geographically through affordable subscriptions are critical for achieving its long-term user engagement goals.
2

Evaluating AI agents presents unique challenges compared to traditional chatbots.

The complexity of AI agents, which perform a variety of tasks beyond simple question answering, complicates their evaluation. Unlike chatbots, agents require standardized tests that can accurately assess their performance across different environments and tasks. For instance, Anthropic's Claude Opus 4.6 scored highly on coding tasks, but the evaluation was marred by uncertainties due to the saturation of the benchmark, highlighting the need for more diverse testing methods.

2.1

Standardized tests for agents are difficult to create.

Creating effective tests for AI agents involves simulating environments, which can lead to issues like 'reward hacking' if not designed properly.

2.2

Performance variability complicates evaluations.

Differences in scaffolds used to develop agents can lead to inconsistent evaluation results, making it hard to compare agents fairly.

2.3

Ongoing maintenance is crucial for accurate assessments.

As software updates occur, both environments and scaffolds can degrade, necessitating continuous evaluation efforts that are less critical for static chatbot tests.
3

The future of AI agents hinges on their adaptability and user interaction.

As AI agents become more integrated into daily tasks, their effectiveness will depend on how well they can adapt to user needs and environments. For example, Arena's Code Arena tests agents on coding tasks, allowing users to interact and rate performance, which could lead to more practical assessments of their capabilities. This user-centric approach is essential for understanding how agents can enhance productivity in various applications.

3.1

User interaction will shape agent evaluations.

Allowing users to rate agent performance in real-world tasks, like generating websites, provides valuable insights into their practical utility.

3.2

Real-world applications will drive agent development.

As agents are deployed in diverse settings, their ability to perform tasks like managing Slack communications will reveal their strengths and weaknesses.

3.3

Future evaluations may involve complex problem-solving.

As demonstrated by Anastasios Angelopoulos's work on statistical proofs, evaluating agents on complex tasks will be crucial for assessing their true capabilities.