NEWDark Mode is Here 🌓 Label Studio 1.18.0 Release

Bootstrapping Data Labels with GPT-4

Guide

Introduction

Data labeling is a critical component of machine learning projects, as it involves creating annotated datasets for training and evaluation purposes. However, the process can be time-consuming and expensive, especially for large-scale projects with numerous data points. What if we could leverage the power of advanced language models, like GPT-4, to reduce the cost and effort involved in data labeling tasks?

GPT-4, or Generative Pre-trained Transformer 4, is a state-of-the-art language model developed by OpenAI. Its remarkable ability to understand and generate human-like text has opened up many applications in natural language processing (NLP) and beyond. In this blog post, we'll explore how you can use GPT-4 to bootstrap labels for various tasks, significantly reducing the time and cost involved in the labeling process. We'll focus on sentiment classification as a primary example and demonstrate how prompt engineering can enable you to create accurate and reliable labels using GPT-4.

GPT-4 as a Prediction Engine to Pre-label Data

As in writing, editing is often less strenuous than composing the original work. That’s why starting with pre-labeled data is more attractive than starting with a blank slate. Using GPT-4 as a prediction engine to pre-label data stems from its ability to understand context and generate human-like text. Therefore, it would be excellent to leverage GPT-4's inherent capabilities to reduce the manual effort required for data labeling. This would result in cost savings and make the labeling process less strenuous.

So how do we do this? If you’ve used GPT models, you’re probably familiar with prompts. Prompts set the context for the model before it begins generating output and can be tweaked and engineered to help the model deliver highly specific results. This means we can create prompts that GPT-4 can use to generate text that looks like model predictions. For our use case, we will craft our prompts in a way that guides the model toward producing the desired output format as well.

Let’s take a straightforward example of sentiment analysis. If we are trying to classify the sentiment of a given string of text as "positive," "negative," or "neutral," we could provide a prompt like:

"Classify the sentiment of the following text as 'positive', 'negative', or 'neutral': <input_text>"

Once we have a well-structured prompt, we can use the OpenAI API to generate predictions from GPT-4. Here's an example using Python:

import openai
import re

openai.api_key = "<your_api_key>"

def get_sentiment(input_text):
  prompt = f"Respond in the json format: {{'response': sentiment_classification}}\nText: {input_text}\nSentiment (positive, neutral, negative):"
  response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{
      "role": "user",
      "content": prompt
    }],
    max_tokens=40,
    n=1,
    stop=None,
    temperature=0.5,
  )
  response_text =  response.choices[0].message['content'].strip()
  sentiment = re.search("negative|neutral|positive", response_text).group(0)
  # Add input_text back in for the result
  return {"text": input_text, "response": sentiment}

We can run this with a single example to inspect the output we’re receiving from the API.

# Test single example
sample_text = "I had a terrible time at the party last night!"
sentiment = get_sentiment(sample_text)
print("Result\n",f"{sentiment}")
Result
{
  'text': 'I had a terrible time at the party last night!',
  'response': 'negative'
}

Once we’re satisfied with our prompt and the results we’re getting, we can scale this up to our entire dataset. Here, we’ll assume a text file with one example per line.

import json

input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"

with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
  examples = []
  for line in input_file:
    text = line.strip()
    if text:
      examples.append(convert_ls_format(get_sentiment(text)))
  output_file.write(json.dumps(examples))

We can import the data with pre-labeled predictions into Label Studio and have reviewers verify or correct the labels. This approach significantly reduces the manual work required for data labeling, as human reviewers only need to validate or correct the model-generated labels rather than annotate the entire dataset from scratch. See our full example notebook here.

Note: In most situations, OpenAI is allowed to use any information sent to their APIs to train their models further. So it’s important to not send protected or private data to these APIs for labeling if we don’t want to expose the information more broadly.

Reviewing Pre-labeled Data in Label Studio

Once we have our pre-labeled data ready, we will import it into a data labeling tool, such as Label Studio, for review. This section will guide you through setting up a Label Studio project, importing the pre-labeled data, and reviewing the annotations.

Reviewing Sentiment Classification in Label Studio

Step 1: Install and Launch Label Studio

First, you need to have Label Studio installed on your machine. You can install it using pip:

pip install label-studio

After installing Label Studio, launch it by running the following command:

label-studio

This will open Label Studio in your default web browser.

Step 2: Create a New Project

Click on "Create Project" and enter a project name, such as "Review Bootstrapped Labels.” Next, you need to define the labeling configuration. For Sentiment Analysis, we can use the text Sentiment Analysis Text Classification.

These templates are configurable, so if we want to change any of the properties, it’s really straightforward. The default labeling configuration is shown below.

<View>
  <Header value="Choose text sentiment:"/>
  <Text name="my_text" value="$reviewText"/>
  <Choices name="sentiment" toName="my_text" choice="single" showInline="true">
    <Choice value="Positive"/>
    <Choice value="Negative"/>
    <Choice value="Neutral"/>
  </Choices>
</View>

Click "Create" to finish setting up the project.

Step 3: Import Pre-labeled Data

To import the pre-labeled data, click the "Import" button. Choose the JSON file and select the pre-labeled data file generated earlier (for example, output_responses.json). The data will be imported along with the pre-populated predictions.

Step 4: Review and Update Labels

After importing the data, you can review the model-generated labels. The annotation interface will display the pre-labeled sentiment for each text sample, and reviewers can either accept or correct the suggested label.

You can improve quality further by having multiple annotators review each example.

By utilizing GPT-4-generated labels as a starting point, the review process becomes much more efficient, and reviewers can focus on validating or correcting the annotations rather than creating them from scratch.

Step 5: Export Labeled Data

Once the review process is complete, you can export the labeled data by clicking the "Export" button in the "Data Manager" tab. Choose the desired output format (e.g., JSON, CSV, or TSV), and save the labeled dataset for further use in your machine learning project.

How much does it cost?

One question rolling around in my mind was: “How much did this cost me at the end of the day?”

For language models, OpenAI charges based on the number of tokens in your request. Tokens are typically the number of words in the query, but special characters and emojis can sometimes count as an individual token. OpenAI’s pricing page states, “You can think of tokens as pieces of words, where 1,000 tokens is about 750 words.” For more information on how tokens are counted, see this page.

The cost per token differs according to the model used.

For example, the GPT-4 8K-context model costs $0.03/1K tokens for the prompt, and each generated token costs $0.06/1K tokens, while the GPT-3.5-turbo model costs $0.002/1K tokens.

                                                1K Prompt + 
Model         Prompt    Completion    Context   1K Response Pair
----------------------------------------------------------------
gpt-4-8k      $0.03/1K  $0.06/1K      8K        $0.09
gpt-4-32k     $0.12/1K  $0.06/1K      32K       $0.18
gpt-3.5-turbo $0.002/1K $0.002/1K     4K        $0.004

Summary of token prices for OpenAI. Source OpenAI forum.

To estimate the cost of pre-labeling a dataset, we can use a simple formula that considers the number of examples in the dataset, the price per token for prompts and completions, and the average number of tokens per example.

C_t = E * (C_p * (T_p + T_s) + C_c * T_r)

Where:

C_t: Total cost of pre-labeling the dataset.

E: Number of examples in the dataset.

C_p: Cost per token for the prompt.

C_c: Cost per token for the completion.

T_p: Number of tokens in the prompt.

T_s: Average number of tokens per example.

T_r: Number of tokens in the result.

Additionally, we can calculate the total number of tokens in the dataset as follows:

T_t = E * T_s

Using this formula, we can estimate the cost of pre-labeling a dataset by multiplying the number of examples by the sum of the prompt cost and the completion cost, adjusted for the average number of tokens per example.

For instance, if we have a dataset with 1,000 examples that we want to pre-label for sentiment analysis with GPT-4, we can compute it with the following: a prompt price of $0.03 per 1K tokens, a completion price of $0.06 per 1K tokens, a prompt length of 20 tokens, an average example length of 80 tokens, and an average result token length of 3 tokens, the total cost of pre-labeling would be:

C_t = 1000 * (0.03/1000 * (20 + 80) + 0.06/1000 * 3)

= 1000 * (0.03/1000 * 100 + 0.06/1000 * 3)

= 1000 * (0.003 + 0.00018)

= 1000 * 0.00318

= $3.18

In this example, pre-labeling the dataset using GPT-4 would cost $3.18. Note: The same dataset with GPT-3.5-turbo would cost ~$0.21.

If our pre-labeling task requires less specialized knowledge, we may want to use a less robust model to save cost. It’s usually worth manually reviewing a handful of examples with varying levels of complexity to get a sense of how accurate one model is compared to another. For information on the models, see the OpenAI Pricing page.

Beyond Sentiment Analysis: Label Any NLP Task

Labeling a Named Entity Recognition (NER) example in Label Studio

The exciting thing about this approach is that it’s not limited to sentiment analysis. We can pre-label data for various NLP tasks using GPT-4 by using prompt engineering and guiding it to produce the correct output. Here are a few examples of prompts for different NLP tasks, all of which can be reviewed and labeled in Label Studio. Remember that more complex tasks may require longer prompts, which will count towards your token count and subsequent cost.

Summarization

Prompt

Respond in the json format with a summary for the following text: {'summary': summary}

Text: I love going to the park on a sunny day. The customer service was terrible; they were rude and unhelpful. I am neither happy nor sad about the new policy changes. The cake was delicious and the presentation was fantastic. I had a really bad experience with the product; it broke after two days.

Summary:

Response

{'summary': 'Artificial intelligence is a rapidly developing technology that can learn from data and make decisions without being explicitly programmed. It has the potential to help solve some of the world’s most pressing problems, but there are concerns about risks and ethical considerations such as bias and job displacement. It is important to consider AI’s impact and ensure responsible and beneficial use.'}
Launch in Playground

Taxonomy

Prompt

Respond in the JSON format with a summary for the following text: {'diagnosis': diagnosis}, where the possible diagnoses are Pneumonia, Acute bronchitis, Asthma exacerbation, Congestive heart failure, Chronic obstructive pulmonary disease (COPD), Pulmonary embolism, Pleural effusion, Interstitial lung disease, or Lung cancer.

Text: The patient presented with shortness of breath and a persistent cough. On physical examination, crackles were heard in the lower lung fields. Chest X-ray revealed bilateral infiltrates consistent with pneumonia. The patient was started on antibiotics and showed improvement in symptoms within 48 hours.

Diagnosis:

Response

{'diagnosis': 'Pneumonia'}

Named Entity Recognition (NER)

Prompt

You are an expert NER labeler. Only return JSON. Do not return explanations. Return the CoNLL format for the following sentence in JSON format.

Harry Potter was a student at Hogwarts

Response

{
  "data": [
    {
      "word": "Harry",
      "entity": "B-PER"
    },
    {
      "word": "Potter",
      "entity": "I-PER"
    },
    {
      "word": "was",
      "entity": "O"
    },
    {
      "word": "a",
      "entity": "O"
    },
    {
      "word": "student",
      "entity": "O"
    },
    {
      "word": "at",
      "entity": "O"
    },
    {
      "word": "Hogwarts",
      "entity": "B-LOC"
    }
  ]
}

There’s no substitute for human evaluation, but as we can see, there are endless possibilities for using the GPT models to give us a starting point. It’s usually much easier to edit labels than to generate them from scratch.

Conclusion

By leveraging GPT-4 for pre-labeling and Label Studio for reviewing and correcting the annotations, you can significantly streamline the data labeling process, resulting in time and cost savings. We showed an example of how to pre-label sentiment analysis data. Still, this approach applies to many more tasks, and the results can be passed on to Label Studio, where skilled annotators can refine and curate predictions to produce high-quality datasets ready for model training.

Related Content

  • Everybody Is (Unintentionally) Cheating

    AI benchmarks are quietly failing us. Studies reveal that data leakage, leaderboard manipulation, and misaligned incentives are inflating model performance. This blog explores four pillars of reform, governance, transparency, broad-spectrum metrics, and oversight, and outlines how enterprises can build trust through a centralized benchmark management platform.

    Nikolai Liubimov

    May 13, 2025

  • 3 Annotation Team Playbooks to Boost Label Quality and Speed

    Not every ML team looks the same and your labeling workflow shouldn’t either. In this guide, we break down three common annotation team setups and how to tailor your tools and processes to boost quality, speed, and scale.

    Alec Harris

    May 7, 2025

  • Seven Ways Your RAG System Could be Failing and How to Fix Them

    RAG systems promise more accurate AI responses, but they often fall short due to retrieval errors, hallucinations, and incomplete answers. This post explores seven common RAG failures—from missing top-ranked documents to incorrect formatting—and provides practical solutions to improve retrieval accuracy, ranking, and response quality. Learn how to optimize your RAG system and ensure it delivers reliable, context-aware AI responses.

    Micaela Kaplan

    March 19, 2025