Overview / Background
Traditionally, text evaluation has been done using methods like BLEU (evaluation based on word presence) or BERTScore (evaluation based on pre-trained NLP and embedding models).
However, the technological advancements around LLMs sparked our team’s interest in experimenting with a new text evaluation method: using LLMs to evaluate LLMs, or “LLM-guided evaluation.”
We know that, as evaluators, LLMs are more sensitive than other evaluation methods. They’re particularly sensitive to things like:
- The choice of LLM evaluator (gpt-3.5-turbo, claude-2, LLaMa2-70b, command, etc.)
- The task being evaluated (summarization, question-answering, etc.)
- The type of feedback being prompted for (scores 1 through 10, grades A+ through F, etc.)
So, we specifically wanted to look more into LLM sensitivity by testing well-known LLMs as both candidates and evaluators.
Experiment
We set up our experiment like this:
Essentially, we provided five different LLMs with one of two input prompts: either a summarization task, where it was asked to summarize a news article, or a question-answering task, where it was given various reports and papers as context. From there, we used the same LLMs (with the exception of gpt-4 due to cost) to evaluate the output text provided by each LLM candidate. This text was then given an eval result: either a score ranging from 1 to 10, or a grade ranging from A+ to F.
Going into the experiment, our hypothesis was that an LLM evaluator would be biased towards text it itself had generated over text other models had generated.
Results
Ultimately, our hypothesis was not supported: Of the five different LLM candidates, gpt-3.5-turbo commonly scored highest by all of the LLM evaluators.
Below, we’ll dive deeper into the results from each of the LLMs we used as an evaluator.
gpt-3.5-turbo as evaluator
As you can see above, with gpt-3.5-turbo as the evaluator, the summarization results systematically get lower scores than the question-answering results. We can also see that very low scores were rarely given—there were just two Ds in the question-answering letter grade evaluation task.
Something we learned from this was that the kinds of distributions of feedback you can expect from an LLM as an evaluator are going to be very different depending on the type of task you’re actually evaluating for. In other words, the meaning of, say, a “9/10” or an “A-” is going to be different depending on the overall distribution that you’re collecting.
claude-2 as evaluator
Next up was claude-2. Relative to gpt-3.5-turbo, the summarization tasks received more perfect scores, but the question-answering tasks received fewer perfect scores. Again, this just further reinforces that the distributions of feedback you can expect from an LLM are sensitive to which LLM is providing feedback.
Additionally, something to note here is that while claude-2’s feedback distribution was different from the feedback distribution of gpt-3.5-turbo (skewed slightly higher), there was still some consensus with gpt-3.5-turbo on the lowest scoring candidates. The two boxes outlined in orange below are the same two boxes that received the lowest scores from gpt-3.5-turbo.
LLaMa2-70b as evaluator
The takeaway here is that, relative to gpt-3.5-turbo and claude-2, LLaMa2-70b scores too uniformly across different candidates on the same input. The scores it gave were too concentrated to certain values (3/10, 5/10, 8/10), which might make some people see it as a less useful or less robust evaluator than its peers.
command as evaluator
Relative to the other LLM evaluators, command gives lower scores (e.g. 5/10) much more frequently.
Additionally, we’re seeing that it does not always successfully follow the instruction to return a number anywhere in its evaluation. Despite being prompted to “score the correctness of the answer on a scale from 0 to 10,” command gave many scores of -1, as we can see in the question-answering integer evaluation task.
It also occasionally only gave written feedback as a way to evaluate the output text when we had specifically asked for a number or a letter grade, which shows that sometimes these LLM evaluators are not properly trained on instructions.
Note: The full dataset for this experiment will be released soon, so stay tuned for that.
Takeaways
When it comes to evaluating generative text models, there is no one-size-fits-all solution. LLM-guided evaluation can allow for targeted customization of criteria, but prompting alone can be rather unpredictable. Ultimately, the ability to iterate quickly on feedback is crucial to identify existing weaknesses in your LLM-driven system.
Our motivation with Arthur Bench—our recently launched LLM evaluation product—was to create a framework that would allow you to iterate quickly on both your task and on your evaluation system. If you’re working on an LLM production application, we’d love for you to check out the Arthur Bench GitHub and share how you’re thinking about evaluating LLM applications in the future.
FAQ
- How do language models typically learn from feedback? Language models typically learn from feedback through a training process where they adjust their internal algorithms to better match human responses or correct answers, improving their accuracy over time.
- What are the common challenges in evaluating large language models? Common challenges include ensuring the diversity and representativeness of evaluation datasets and maintaining the balance between computational efficiency and the thoroughness of the evaluation process.
- How can the results of such experiments impact the future development of AI and language models? The results can guide developers in refining models to be more effective and context-aware, leading to advancements in AI that are more aligned with human language understanding and use.