GUIDEHow to Build AI Benchmarks that Evolve with your Model
Guide

How to Monitor Models in Production with Label Studio

In this webinar, Alec Harris and Michaela Kaplan walk through how to set up a production monitoring workflow using Label Studio. You'll learn how to pull real model outputs from logs, evaluate them with human-in-the-loop review, and decide when to retrain based on quality thresholds.

Transcript

Nate:
Okay, cool. Well, we are officially two minutes past the hour, so I’m going to go ahead and kick this off. Folks can join as they’re able. Welcome, everybody, to today’s webinar—How to Monitor Models in Production with Label Studio. It’s a really important topic that addresses a serious issue many teams face, and it offers an interesting way to use Label Studio to ensure your model continues to perform well in production. I’m really looking forward to this session.

Before we begin, a couple of quick announcements. First, this webinar is being recorded. We’ll email a copy afterward, along with resources and links mentioned today. Second, we’ll have time at the end for Q&A. You can drop questions in the Q&A widget in the Zoom toolbar—or in chat, since we’re monitoring both. Please send in your questions—we’d love to answer them. That’s it for announcements.

Today’s speakers are Alec Harris, one of our expert product managers at HumanSignal, and Michaela Kaplan, our ML evangelist. They both have deep experience with this process and methodology. With that, I’ll turn it over to Alec to kick things off.

Alec:
Thanks, Nate. I’m Alec, product manager here at HumanSignal. I’ll set the stage for about ten minutes and then hand things over to Michaela for a live demo.

Here’s the scenario: you and your team have invested countless hours building your model. You’ve done exploratory data analysis, labeling, picked an architecture, evaluated performance, and deployed it to production. Maybe you even threw a little celebration—because the results were strong, feedback was promising. But here’s the challenge: production rarely matches your lab environment. Inputs shift over time. Vocabulary evolves. World events change user behavior. And adversarial contexts—like spam—create a constant battle.

This problem isn’t new, but it becomes amplified in the GenAI era. Now model outputs vary too. LLMs are inherently non-deterministic. We often increase temperature for creativity, but that invites inconsistency. We still can’t fully eliminate hallucinations. If you’ve grounded your LLM with a RAG architecture, your knowledge base is still mutable—your documents, codebase, internal tools: all of it evolves. So even internally, our chatbot sometimes behaves differently week to week.

The real question is: how do you maintain confidence in your model as things change? Many teams rely on user feedback—but that’s reactive and risky. Others conduct manual “vibe checks” by pulling random inferences into spreadsheets—unreliable, labor-intensive, and hard to scale.

Advanced approaches like drift detection, LLM-as-a-judge, and LLM evaluation tools like RAGAS are powerful—but not trivial to implement, and typically require expert oversight or incur high cost. At the core, what all of this demonstrates is that you can’t fully replace the human in the loop.

Today we propose a systematic, repeatable, flexible way to monitor production models using Label Studio. It centers human judgment while allowing scale and automation. This approach follows a simple loop: deploy your model, sample production logs, upload to Label Studio for evaluation, measure against thresholds, and if performance drops, use the labeled data to fine-tune or retrain. That way, you catch issues before users do and maintain confidence through constant cycles of evaluation and learning.

With that, I’ll hand it over to Michaela to walk through the demo.

Michaela:
Thanks, Alec. Before I begin, I see a hand raised—please drop your question in the Q&A tab or chat so we can get to it. I’ll go ahead and share my screen.

First, head over to the Label Studio Examples GitHub repository and clone it. We’ll be working in the model_monitoring folder. The package is structured to be modular and customizable. You’ll want to start with the config.ini file—it includes your Label Studio credentials, project settings, sampling parameters, and log storage credentials (e.g. S3 or GCP). You can also configure optional email alerts. Next is scrape_logs.py, where you define how to parse your production logs into Label Studio tasks. For the demo, I used synthetic logs generated via ChatGPT but you'll update this logic to match your format.

Once you configure scraping and sampling logic—by date, confidence score, or other metrics—you get a list of tasks to upload. That then feeds into the main script monitor_project_with_label_studio.py. It connects with Label Studio via API, clones or creates a new monitoring project, uploads the tasks, and then loads the model predictions as pre-annotations. We generate those annotations using helper functions in utils.py, which build the correct prediction payload based on your labeling schema.

Once the project is live, users review and edit model predictions in Label Studio. We then export the annotated results as JSON and run an evaluation script (evaluate.py) to compute how often predictions changed compared to model output. You check whether edits exceeded your quality threshold—if so, it flags performance issues; if not, you continue observation. And that's the core flow: scraping logs, loading into Label Studio for human review, exporting, evaluating, and retraining as needed. The whole process supports automation via cron jobs, Airflow, or other orchestrators.

Nate:
Thanks, Michaela and Alec—that was super helpful. Now we have time for Q&A. Please drop your questions in chat or Q&A.

Q: How do you help engineering teams assemble evaluation sets and run evaluations without overwhelming them?

Michaela:
This package automates the sampling and monitoring steps, eliminating manual overhead. Humans still review samples, but you can delegate that to subject-matter experts or non-technical stakeholders. It’s flexible, and you can dial down sampling volume as you refine your cadence.

Alec:
Exactly. You don’t need to label thousands every week—start smaller and use real data to define a sustainable volume.

Q: How can an LLM be leveraged to QA annotation work?

Michaela:
You can use Label Studio’s ML backend to connect your own models—or LLMs—to pre-annotate tasks. For enterprise users, our “Prompts” feature lets human reviewers QA model-generated labels through prompt-enabled annotation workflows.

Q: How do you identify workflow stages where human supervision is essential? And how do you know when automation is appropriate?

Michaela:
If a task is difficult or ambiguous for a human, it’s likely difficult for a model too. Disagreement among annotators or confusion in labeling guidelines are signals that you should keep humans in the loop. If model performance on a held-out test set is solid, you can consider automation—but always validate before scaling.

Related Content