August 16, 2024 | V. “Juggy” Jagannathan, PhD,
In this blog, I am focusing on the problem of evaluating the goodness of a solution produced by generative artificial intelligence (AI). In particular, the increasing use of large language models (LLMs) for that purpose. So, we have a situation of LLMs producing an output and an LLM judging that output! Does this work? Let’s explore.
I came across recently this excellent blog by Cameron Wolfe titled “Using LLMs for Evaluation.” This blog is what motivated me to pen this article. To see how effective using LLMs is for evaluation. This is an extremely active area of research, as companies large and small struggle with deploying LLM-based solutions. The gold standard for evaluating generative AI outputs is of course, human evaluation. But this is quite expensive both in terms of cost and time. And humans don’t necessarily agree! The goal when using LLMs for the purpose of evaluation is to determine if the LLMs agree with human evals at about the same rate as humans agree with each other.
One of the first questions that comes to the forefront is what is the goal of the evaluation? Is it to determine if one model is better than another? Then we need to compare one model output with another – typically using a pairwise comparison. A pair of model outputs is presented to the LLM and asked to judge which is better on a specific characteristic. The approach in user interface circles is called A/B testing.
A second line of thought is to determine if the output of a model satisfies some criteria – for example does it contain hallucinations, is it helpful, is it harmless? In this case one asks the LLM to grade the output, using an ordinal scale. A third approach is to assess the overall goodness of the result by comparing it to a gold reference output. In this case the model output is compared to a reference and the LLM is asked to judge goodness based on reference.
Since the debut of GPT4, it has been recognized that LLMs can match human preferences. One of the first papers to explore this is “Judging LLM-as-a-Judge” published last year at the NeurIPS conference. This multi-institutional study created two distinct datasets. One was a handcrafted multi-turn question set in eight different domains called MT-bench. The other called Chatbot Arena, was a crowd sourced platform for collecting and evaluating model outputs preferences in arbitrary conversation settings. Both these datasets are to figure out how GPT-4 performs when compared to humans. They report more than 80% agreement.
Interestingly, their research also reveals that they noticed several different biases when using LLMs as a judge. Their results can vary depending on which model output is presented first – position bias. LLMs favor outputs which are lengthier – verbosity bias. And, if the model producing the output and evaluating the output is one and the same, they score higher – self-enhancement bias.
Researchers at the Chinese University of Hong Kong explore additional biases by tweaking the answer to a question using LLMs. For instance, they will provide fake references and see if the evaluation by LLMs shifts. Another strategy is to format the response to make it look more organized and pretty. They labeled the responses as corresponding to authority bias and beauty bias. They used similar strategy to explore gender and misinformation bias.
UC Berkeley researcher, Shreya Shankar, explains the work she and her colleagues have done over the past year in this YouTube podcast. They have been exploring how to build “evaluation assistants” that aid humans in task-specific evaluations. Last year they published a framework dubbed “SPADE” (System for Prompt Analysis and Delta-based Evaluation). SPADE is a way to systematically analyze prompt refinements. The thesis here is that prompt refinements lead to identification of additional criteria that the output from invoking LLM should adhere to. It is also the case, that it is hard to think of refinements before seeing the output of LLMs. Case in point here is, very few people would even think a hallucination is something to look for in a system output before the advent of LLMs! It simply was not even in the lexicon for evaluation.
Prompt refinements may contain inclusive criteria or exclusion criteria or other specific refinements. They categorized prompt refinements into structural and content-based categories and then further refined those categories. The categorization forms the basis for assertions. Assertions are a debugging tool in coding where you explicitly check if a certain condition holds at a certain point in the execution. So, if the LLM output should not mention, say gender or ethnicity content in the output, you “assert” that condition.
Now, how does one check whether the output satisfies that condition? One can either write some code that can verify compliance or, as is frequently the case, you can also ask an LLM to evaluate the output for that condition. That’s what the SPADE framework does – from prompt refinements to assertions (involving straight generated code or general prompting of LLM) and they optimize this assertion set to maximize coverage while minimizing tests.
The follow-up work to SPADE by a UC Berkely team has a title with a nice ring to it: “Who validates the validators?” They outline a new system, EvalGen. Here, in addition to creating prompts and assertions, the user can validate assertion implementations interactively. The user can also provide thumbs up/down feedback on a sampling of outputs. The system determines the degree of alignment with human feedback on the full pipeline of automatically using LLMs for scoring the system output.
There is an explosion of interest in using LLMs for evaluating outputs from LLMs. This is partly driven by the fact that there is a paucity of pragmatic ways for validating the output. Then there is the notion of a guardrail. How does one prevent toxic or biased or dangerous output? Currently there is no guarantee that the LLM output is going to be hallucination free or will adhere to the instructions faithfully. By incorporating LLMs to solve the problems LLMs create, one may wonder if we are embarking on a recursive quagmire. However, the approach is promising and there is plenty of recent evidence to support it. In fact, AI “godfather” Yoshua Bengio is betting the best guard rail for AI systems are AI systems! One mantra that is worth repeating though – these tools are intended to support humans – not supplant them.
V. “Juggy” Jagannathan, PhD, is an AI evangelist with four decades of
experience in AI and computer science research.