GUIDEHow to Build AI Benchmarks that Evolve with your Model

Evaluating the GPT-5 Series on Custom Benchmarks

Blog Post

OpenAI’s GPT-5 models just dropped, and there’s a lot of buzz. The demos look exciting and reported benchmark results show improvements across general intelligence, coding tasks, reasoning, and hallucinations. But do you need to upgrade? Should you use GPT-5, GPT-5-mini, or GPT-5-nano? And most importantly, how do you evaluate model performance in your own application?

With new models coming out all the time, these are questions we get frequently at HumanSignal. The details change, but the answer is always to build confidence in AI quality by testing on representative data. In this post, we’ll walk though the process of building out a custom benchmark, outline the evaluation method, and share some early findings on the newest OpenAI models.

We hope you’ll find this example a helpful guide to building your own benchmarks!(See also: our post on Why Benchmarks Matter)

Building a custom benchmark

We started where every good benchmark starts: with real data. For this example, we tackled a receipts extraction use case, where our AI system processes images into structured data with store names, addresses, line items, and dates. We collected receipt images representing the messy reality a production system would face.

Then we selected a representative dataset of receipts spanning different:

  • Merchant types (restaurants, retail, grocery stores, international)
  • Image quality (pristine, hand-held, crumpled)
  • Complexity (simple purchases vs. lengthy lists)

Some sample tasks we'll be evaluating

After, we created a template (in Label Studio) defining exactly how we wanted the results to be structured:

  • Receipt category (grocery, restaurant, retail, etc.)
  • Merchant address
  • Purchase date
  • Individual line items with descriptions and prices

Open in Label Studio

We then annotated the data (using AI-assisted pre-labeling with human review), so we’d have a high-quality standard (ground truth dataset) to evaluate GPT-5 outputs against.

Since we now had definitive ground truth, we could use statistical agreement metrics to measure performance. No subjective scoring this time, objective accuracy measurements apply here!

Example of a ground truth labeled task


(Want a more detailed breakdown of different types of GT vs. LLM-judge benchmarks and how to use them? See our post on: How to Build AI Benchmarks that Evolve with your Models )

Our evaluation method

Equipped with our benchmark containing high-quality labeled data, we then generated outputs from GPT-5 models to compare with our ground truth. We could have sent each receipt image to the LLMs then captured and compared their responses, but preferred a more systematic and scalable approach.In this case, we used Label Studio Prompts to integrate LLMs into the evaluation workflow. With Prompts, connected GPT-5 (you can also connect other LLMs, including custom ones), bulk evaluated model responses against our ground truth labels, and tracked model performance across different versions. Since all LLM responses got populated into the Label Studio annotation interface, we were also able to review, annotate, and compare responses side by side.

This is what our evaluation setup looked like:

  • Models: GPT-5, GPT-5-mini, GPT-5-nano with temperature 0 (maximum determinism for extraction tasks)
  • Prompt: Standard extraction instructions for specific fields and topics we’re looking for
  • Scoring: Custom agreement metric (provided by Label Studio, between prediction and GT), cost, and latency

And for additional insight, we reviewed tasks (especially low scoring ones) manually, to understand failure points and spot patterns the metrics don’t capture.

Results from GPT-5-mini against ground truth labels


Results

We then broke down results into three groups: quantitative, qualitative, and baseline.

Quantitative:

On our benchmark of 20 tasks annotated with ground truth:

ModelAccuracy (Avg)# Tasks (>80% Accuracy) CostTotal Inference Time
gpt-50.8915$0.5481067 seconds
gpt-5-mini0.8715$0.058227 seconds
gpt-5-nano0.777$0.022194 seconds

GPT-5 had the best overall accuracy, however performance was similar with GPT-5-mini. For almost 10x cost savings and 5x faster inference time in our application, we might choose to proceed with gpt-5-mini in our scenario (and spend more time refining our prompt).

Qualitative:

After investigating the lowest-scoring tasks and doing some failure analysis (via human-in-the-loop review), we realized the following:

  • Receipts that belong in the category “Retail” sometimes get classified as “Grocery,” especially when the line items were primarily food or consumables
  • The “Address” field expected a location address, but email addresses that were found in the receipt were also returned by the models
  • Receipt images that included more than ~10 line items performed worse

Task mislabeled as ‘Grocery’ category and missing line items in extraction


Based on those findings, we made some refinements to the prompt used with GPT-5-mini (which had the best balance of performance, cost, and speed on our tasks) to see if it’d improve task performance. Although we have a relatively small sample of tasks, we did see some improvement.

Evaluation results with GPT-5-mini and an improved prompt


Baseline:

It was also interesting to compare the responses between models. We ran an average pairwise consensus score across the three models’ responses (using Label Studio’s built-in Agreement Score), and were able to see which tasks resulted in the most disagreement (and agreement) between models:

  • 1 task had <70% agreement
  • 5 tasks had 70-80% agreement
  • 9 tasks had 80-90% agreement
  • 5 tasks had >90% agreement

Tasks with lower agreement tended to have the following traits:

  • Receipt contained items that spanned more than one line in the receipt
  • Longer list of line items
  • Deductions in the list of line items, such as discounts or voided items
  • Receipts that were faded or creased in the image

Almost all tasks had >77% agreement score between the GPT-5 models


Task with the lowest (64%) agreement between all models (Side-by-side of GPT-5 vs. GPT-5-nano)


Task with the highest (99%) agreement between all models (Side-by-side of GPT-5 vs. GPT-5-nano)


Discussion

Now, we’ve built some more confidence on how different GPT-5 models perform in our application, and identified some failure points to remediate. However, running this benchmark is just a starting point. When it comes down to production AI systems, there are more facets to consider in a robust evaluation system. For example:

  • Open-ended, subjective tasks like evaluating chatbot conversations would need a more nuanced approach than GT comparison, such as include rubric-based scoring, LLM-judges, human evaluators, or a combination of the above
  • Benchmarks are evolving systems, to be expanded (e.g. adversarial test cases) and refined (e.g. more granular metrics) as you learn more about your domain
  • Updating your AI system likely is more than a base model swap to GPT-5 - how do you track and evaluate across prompts, model parameters, system integrations, and more?
  • Even for the same model, results may vary across benchmark runs, so look out for tasks with the most variation, consistently in failure modes, and other patterns in performance
  • The most valuable insight into model improvements can be found within failure analysis. Although it’s reassuring to see test cases pass, cases where models fail can be more useful by  revealing important gaps in the AI application. (We encountered and wrote about this recently too!)
  • As your test suite grows, so does your need for a scalable evaluation system that maintains quality insight you’d expect from Human-in-the-loop. This usually means a combination of more sophisticated automations (e.g. LLM-judges with RAG for context) and human-in-the-loop workflows

Comparing Against GPT-4o

We also compared the new GPT-5 model suite with the previous GPT-4o model. GPT-4o scored an accuracy of .88 on our Benchmark task, putting in right in line with the results we saw from the GPT-5 series – it performed better than the mini and nano models, but worse than the GPT-5 base model.

In yesterday’s launch, the folks from OpenAI noted that GPT-5 particularly shines in writing and coding tasks. An interesting update to the Benchmark would be to ask each model to summarize the receipt without simply stating what is on it, to test the model's reading comprehension. Be careful! As with all summarization tasks, this would require an update to our scoring methodology as well, as there can be many right ways to summarize a single task.

Your turn

Whether you’re working with document extraction, content generation, or chatbot conversation evaluation, the principles are the same:

  1. Create a custom benchmark by
    1. Collecting real data from your domain
    2. Defining some scoring method (ground truth, custom criteria, etc)
  2. Run your benchmark, with algorithmic metrics, human review, LLM-judges, or a combo
  3. Analyze your quantitative, qualitative, and baselines to make informed decisions

Ready to get started today?

Build your own benchmark and run it on Label Studio – we’ll provide the GPT-5 credits to get you started! Sign up in Starter Cloud.