Using Activeloop's Hub + Label Studio for Semantic Segmentation
This Partner Webinar features Ivo Stranic, Head of Product at Activeloop, showcasing how to efficiently create, label, explore and stream classification datasets for semantic segmentation to machine learning frameworks by leveraging Hub with Label Studio.
Transcript
Michael
Good morning, everyone, and thanks for joining us for another Label Studio webinar. Today we’re excited to host Ivo, Head of Product at Activeloop, and Mikhail, Head of Community at Activeloop. They’ll be walking us through a hands-on tutorial on how to use Activeloop’s Hub product with Label Studio for a semantic segmentation project.
Before I hand it over to Ivo, a few quick housekeeping notes.
We take questions in the #webinars channel of our Slack community—feel free to join if you haven’t already. It’s free and a great place to get help with Label Studio or follow up after the webinar.
You can also check out upcoming webinars and replay past ones on our Webinars page. We run three different webinar series:
Partner webinars, like today’s, with MLOps toolmakers
Community conversations, where we showcase real-world Label Studio projects
Tutorials, where we walk through product features step by step
A few highlights:
We’ll be launching our long-awaited video annotation feature soon, with announcements from our CEO coming in Slack later this week.
We’ll also run a tutorial on building custom front ends with Label Studio before year’s end.
And we’re finalizing the Q1 webinar slate for 2022—six sessions from January to March, with details going live on the site soon.
We also recently launched the Label Studio Champions Program. It's simple to join and rewards users with points and prizes for things like contributing to the community, fixing bugs, or attending events. For example, there’s a one-time 20-point bonus activity this week that you can redeem by visiting the leaderboard. Details are in the Champions Program section of our site.
Lastly, a quick plug before we get into the tutorial. Mikhail recently published a companion blog post to today’s walkthrough. You can follow along using the Bitly link in the video description.
If you’re new to Label Studio—especially those of you joining from Activeloop—welcome! Here are a few useful links:
labelstud.io for the open source version
heartex.com for enterprise and team plans
Our GitHub repo, newsletter, and community program are all linked there as well.
That’s all from me. Ivo, take it away.
Ivo
Thanks, Michael! Let me share my screen real quick.
To start, a quick overview of Activeloop. Our founder and I both ran into the same issue while working with deep learning and unstructured data—everything quickly turns into a mess of image files, annotations, and boilerplate code. As datasets grow more complex, maintaining those relationships gets harder. Visualization is also a pain—you’re juggling notebooks, PIL, matplotlib… it’s like operating in the dark.
At Activeloop, we built Hub to solve this. It’s a format and API that adds structure to unstructured data and supports streaming directly to your training environment.
That’s especially important now as data-centric AI gains momentum. Thanks to people like Andrew Ng, more teams are realizing that data quality is just as important—if not more important—than model architecture. Garbage in, garbage out. And tools like Label Studio and Hub are essential for cleaning and managing labeled data at scale.
With Hub, you can:
Create and collaborate on AI datasets using a simple API
Transform and stream data directly to your model without downloading
Instantly visualize large datasets via a web browser
Under the hood, Hub organizes raw computer vision data—like images and annotations—into structured, tensor-based datasets. That structure makes streaming fast and reduces boilerplate dramatically.
Let’s look at an example. Traditionally, if you’re working with large datasets like Objectron (over 2TB), it could take 40+ hours just to set things up: downloading, unzipping, parsing TFRecords. With Hub, you skip all that. Install Hub, load the dataset with one line, and you’re ready to go.
Here’s another benchmark:
Training ImageNet (1.2M images, 160GB) on AWS SageMaker in three scenarios:
File mode: ~4 hours (most of it spent downloading)
Fast file mode: ~2 hours
Hub: ~1 hour, fully streamed, with no pre-downloading required
The takeaway: Hub gets your data to your GPU faster. And because we store data in chunked tensors, GPUs stay fully utilized, reducing training costs.
That’s the overview. Now let’s walk through the actual integration with Label Studio.
Michael
To start labeling, you’ll install Label Studio and create a new project—for example, "Smile Brush Segmentation."
After selecting the semantic segmentation with polygons template, you'll upload your data. Label Studio supports a wide range of modalities beyond images: text, time series, audio, and video (coming soon). You can also build custom templates.
Once the image is loaded, we use the brush tool to label smile regions. You can export the results as PNG masks.
Thanks again to Activeloop for creating the gifs in the companion blog post—and Mikhail, I’m guessing you’re the gif wizard behind them.
Now that the data is labeled, let’s look at what happens next.
Ivo
So now we have a set of images and corresponding masks exported from Label Studio. Currently, these are just regular files. In the near future, we plan to offer direct Hub export from Label Studio. If that's a feature you want, email me at ivo@activeloop.ai—we’d love your input.
For now, let’s walk through converting these files into a Hub dataset.
First, we pip install hub
and import the necessary packages. Then we inspect the folder to make sure the images and masks are aligned correctly.
To create the dataset, we use hub.empty()
to create a dataset at a given location. This could be our cloud, your own S3 or GCP bucket, or just local storage. Then we define two tensors—one for images and one for masks—and append the data.
After appending, the data is now in Hub format. You can access it via index like ds.images[0]
, just like you would with a NumPy array.
Next, we create a TensorFlow dataset by applying transformations (e.g., resizing, batching). Then we use ds.tensorflow()
to create a fully streamable TensorFlow dataset that loads data on demand from the cloud or local storage.
We trained a simple U-Net and tested it. Results were pretty good—some masks were off, but it’s a solid baseline.
Want to scale this up? We trained ImageNet using the same method—this time in a SageMaker notebook on a Tesla V100 GPU. The dataset was streamed from S3 using Hub’s PyTorch integration.
We used ds.pytorch()
to build a PyTorch DataLoader that fed the model during training. Because Hub streams the data efficiently, GPU utilization stayed around 90%—which means no idle compute time and lower cloud costs.
Lastly, let me show you the Hub Visualizer, our web-based data explorer. You can browse datasets visually—even ImageNet-sized ones—with labels, filters, and search. This is key for data-centric AI workflows where inspecting, debugging, and improving labels is critical.
The Visualizer works with any dataset in Hub format, regardless of whether it’s stored locally, on your cloud, or in Activeloop’s hosted storage.
What’s next for Hub?
Git-style version control for datasets
Parallel computation and preprocessing in the cloud or on your cluster
SQL-like queries on unstructured data
And the upcoming Label Studio integration, which will let you export directly to Hub
If any of this sounds useful, reach out to me at ivo@activeloop.ai.
Michael
Thanks, Ivo! Let’s leave this final slide up with some helpful links. You can find Activeloop on Twitter at @HeartexLabs and of course on GitHub—feel free to leave them a star.
If you enjoyed this session, please like the stream and subscribe to the channel. Thanks again to both Ivo and Mikhail, and we’ll see you all again very soon.
Ivo
Thanks, everyone!
Mikhail
Take care!