The Generative Assessment Project

A research initiative ranking the strengths and weaknesses of large language model offerings from industry leaders like OpenAI, Anthropic, and Meta as well as other open source models.

We'll periodically update the page with our newest, insightful findings on the rapidly-evolving LLM landscape

Does Your LLM Do What You Ask It To Do?

We evaluated some of the best closed-source and open-source LLMs at answering questions and remaining grounded to context. We got a sense of which LLMs are more cost-effective than others.

May 30, 2024

LLM-Guided Evaluation Experiment

In this experiment, we looked into LLM sensitivity by testing well-known LLMs as both candidates and evaluators.

October 5, 2023

Hedging Answers Experiment

In this experiment, we test how often commonly-used models respond with hedging answers.

August 17, 2023

Hallucination Experiment

We sought out to explore, both quantitatively and qualitatively, how some of today’s top LLMs compare when responding to challenging questions.

August 17, 2023

The Most Robust Way to Evaluate LLMs

Bench is our solution to help teams evaluate the different LLM options out there in a quick, easy and consistent way.

Learn More