Partner WebinarsFebruary 9, 2022

Customized Layout Detection for Scientific PDFs with LayoutParser and Label Studio

End users commonly encounter unique layouts that require data annotation and fine-tuning a customized model in order to achieve an ideal accuracy. In this webinar, we will show how to build a model for scientific PDF extraction using Label Studio to create the training dataset and LayoutParser to detect the layout and extract structured information..

Transcript

We’re live! Welcome, everyone. This is a webinar featuring the Layout Parser core team—Shannon and Ben—and we’re focusing on building customized layout parsing models for scientific paper PDFs. Shannon and Ben, can you introduce yourselves?

Shannon
I’m currently a research scientist at an institute for AI. I work on the Semantic Scholar team, which is a platform designed to organize scientific literature and make it easier to search and read papers.

Ben
I’m a fourth-year PhD candidate in computer science and engineering at the University of Washington. Shannon and I met about a year ago, and our research interests overlap a lot—especially in the intersection of machine learning and cultural heritage. I work with digitized newspapers and historical documents, where layout analysis is foundational. That led me to participate in Layout Parser projects and collaborate with Shannon to improve access to cultural materials.

Michael
How did the project actually start?

Shannon
I was working on historical document digitization with Professor Melissa Dell. The documents had complex layouts, and existing OCR tools couldn’t handle them well. We started designing a layout-based parsing layer on top of OCR. That evolved into the first version of Layout Parser. Later, Ben and others joined to help build what’s now the current version.

Michael
How old is the project?

Shannon
Almost three years—we started around mid-2019.

Ben
Shannon is being modest. He’s the real project lead behind Layout Parser. I’ve just been lucky to contribute to it.

Michael
Ben, how did you choose your PhD focus?

Ben
My work centers on human-AI interaction, especially exploratory search systems. One project, Newspaper Navigator with the Library of Congress, aimed to rethink how people search over visual content in historic newspapers. I liked how this space connects interactive machine learning with collaboration across disciplines like the humanities and social sciences. Working on layout analysis with Shannon fits right in.

Michael
Is this like the old-school microfilm readers we see in movies?

Ben
Exactly. You start with raw microfilm, digitize it, and it becomes part of large databases like Chronicling America. That’s where our work on layout analysis helps make sense of the visuals.

Michael
Let’s dive into the demo.

Ben
First, I’ll give a five-minute intro, then turn it over to Shannon, who’ll walk through the demo. We'll leave at least 10 minutes for Q&A at the end.

The central question we’re tackling is: How do we convert complex documents—like digitized newspapers or scientific PDFs—into structured, usable data? It’s not enough to extract raw text. That approach loses essential information about the visual structure.

For example, scientific papers rely on layout to convey meaning—titles, abstracts, sections, and citations are all communicated through their position and formatting. Layout Parser is a deep learning–based library that detects these visual regions. It’s pip-installable and integrates with APIs to let you query, fine-tune, and apply models.

With Label Studio, you can annotate your training data—highlighting layout regions like bibliographies or tables—then use that to fine-tune a Layout Parser model. Once trained, you can apply that model across your dataset.

Now I’ll hand it over to Shannon to walk through the step-by-step demo.

Shannon
We’re going to walk through customizing Layout Parser models using Label Studio, with scientific papers as the use case.

First, we define the problem: we want to detect individual bibliography items at the end of a paper. On the left, you’ll see the raw PDF. On the right, what we want—each reference block highlighted and ready for export into CSV or JSON.

We’ll start by testing existing models from the Layout Parser model zoo. These include community-contributed models for various document types. We chose one trained on scientific papers. When we run it on a sample PDF, it performs okay, but it’s not perfect—it misses some boxes or merges regions incorrectly. That’s why we train our own model.

Next, we create a training dataset. You can use your own documents or download open-access PDFs. Once you have data, install Label Studio locally. It has great instructions on GitHub. We start the server, create a project—say, “Bib Item Annotation”—and upload our images.

Label Studio’s interface is flexible. You can customize how annotation works, like moving the image to one side and adding labels on the other. Then, you just draw rectangles around each bibliography item. We pre-annotated a few examples to save time.

Once annotated, you export the dataset. We choose COCO format because it works well for object detection. The export gives you a JSON file with bounding boxes and a folder of corresponding images.

We inspect the exported data using utility scripts to confirm everything looks right. Then we move on to training.

Layout Parser has a dedicated training repo. Clone it locally and move your annotation data into the data folder. Then, use a provided utility script to split the dataset into training and testing files.

We use the training script—modified slightly for our bibliography use case—and specify the paths to our config and data files. On a GPU, training might take 30 minutes to an hour, depending on dataset size.

Once trained, we load the model and test it on a sample page. Compared to the original pretrained model, the fine-tuned model performs better—it detects cleaner regions and misses fewer boxes.

Then we use the model to extract text. The process is:

Run layout detection on pages with bibliography sections

Use bounding boxes to extract PDF tokens inside those regions

Merge the tokens to reconstruct the full text for each bibliography item

We can load the results into a pandas dataframe, which gives us a structured list of bib entries. One issue we noticed is ordering. Sometimes the references are out of order.

To fix that, we use the token index associated with each region to reorder the entries logically. It’s a simple but effective trick, and it makes the output much more usable.

Michael
What were people using before Layout Parser?

Shannon
Before Layout Parser, people would often search GitHub for one-off solutions that worked only for very specific layouts. But those tools weren’t scalable or easy to generalize. Layout Parser is designed to be flexible—it can work across many document types using a unified interface and fine-tuning workflow.

Ben
And it’s not just about layout detection. Layout Parser provides infrastructure from start to finish—from annotation to training to deployment. That full pipeline is really valuable.

Michael
So before Layout Parser, people had to build custom solutions for each use case. Now they can generalize across layouts?

Shannon
Exactly. And with Label Studio, you can label the data you need to make that generalization possible.

Michael
Any questions from the audience before we move into the next part of the demo?

Ben
Let’s keep going and loop back to questions in a few minutes.

Shannon
Great. So once the model is trained, we test it on other documents. We download new PDFs, run the model, and extract layouts. It’s not perfect, but the improvement is clear—and adding more training data can further increase accuracy.

We then extract the text inside the bounding boxes, page by page, using Layout Parser’s PDF token utilities. We combine this text into structured entries and load it into a dataframe. One final enhancement is sorting entries using the PDF token order so they match the logical reading order.

There’s still more work to be done, like handling references split across pages or formatting multi-line entries, but we’re getting close to a clean, automated pipeline.

Michael
There was a question earlier—how large of a GPU is needed for training?

Shannon
You’ll want something with at least 12GB of VRAM—an AWS p2.xlarge instance should be fine. For smaller datasets, training takes under an hour.

Michael
Another question: Have you tried active learning with Layout Parser?

Shannon
Yes, we published a paper on optimizing active learning for layout object annotation. The idea is to prioritize the most informative samples for labeling. Not all examples are equally useful for training. Active learning lets us focus on the ones that are hardest for the model, improving performance with less manual effort.

Michael
Have you integrated that into the tool?

Shannon
Not directly, but the pipeline supports it. You can run uncertainty sampling and feed results into Label Studio for labeling.

Michael
Someone asked whether Layout Parser can replace PAWL.

Shannon
PAWL is another tool we built for editing PDF annotations. It snaps bounding boxes to PDF tokens, which is helpful for well-structured digital PDFs. But if your documents are scanned or low-quality, that snapping becomes unreliable. In those cases, using Label Studio is more flexible.

Ben
Exactly. If you're working with historical or noisy scans, you’ll want the adaptability Label Studio offers.

Michael
Another question: Can Layout Parser handle multilingual documents?

Shannon
Yes, and this is one of its strengths. Layout Parser works on images—it doesn’t rely on language models. That means it can handle documents with mixed languages like Traditional Chinese and English without issue. In fact, visual patterns across different scripts can actually help the model distinguish content more effectively.

Michael
Can Layout Parser also predict block reading order?

Shannon
Yes. There’s ongoing research on that, including transformer-based approaches for predicting reading order as a graph or sequence. While it’s not fully integrated yet, it’s something we’re working on incorporating into Layout Parser.

Michael
Thanks, everyone. We’ll share the slides and code after the session. If you have more questions, drop them in Slack.

Shannon
Thanks for joining!

Ben
Thank you!

Customized Layout Detection for Scientific PDFs with LayoutParser and Label Studio

Transcript

Related Content