Building a public dataset for FairScan

November 14, 2025

To automatically detect documents, FairScan uses an image segmentation model. The dataset needs to reflect the situations the model should be able to handle, so the model's accuracy depends heavily on the quality and size of that data. For FairScan, I decided to build a dedicated dataset and to make it public. That turned out to be quite an experience.

Why building a public dataset?

Before starting, I looked for existing datasets. I found datasets for OCR or for unwarping documents. I also found a dataset for document scanning, but it was focused on videos and only included a few documents. I could not find anything that matched what I needed, so I had to build it myself.

I could have built a private dataset, but I did not seriously consider that option. FairScan is an open source app and the spirit of open source development is to make it possible for other people to understand how an application works and to modify it. For a model, the training data is at least as important as the source code. Publishing the code without publishing the dataset would have made little sense, especially for an app that aims to be respectful. It was clear to me I had to build a public dataset.

What the dataset contains

A dataset should contain examples of what the model is expected to produce for given inputs. For an image segmentation model that should detect documents, the input is an image and the expected output is a mask that shows which pixels correspond to a document. Here is an example:

To make the model reliable, the dataset needs to cover the situations in which people use a scanning app:

Various types of documents, for example standalone sheets, books and magazines
Various supports and backgrounds, for example a busy desk, a kitchen table or a carpet
Various framing and perspectives, for example close, far or low angle
Various lighting conditions, for example bright sunlight or a dim lamp

Generally speaking, models perform better when they have more data, including some difficult examples, such as a white document on a white background. For an early development version of the app, I started with a dataset of 100 images. As of today, FairScan's dataset contains more than 600 images. My family often saw me taking pictures of documents in strange situations, for example in dim light on the white background of the bathroom sink.

Annotating images

Taking pictures is only one part of the job. The other part is to annotate them, which means defining the expected output for each image. Among the multiple tools that exist for that purpose, I used labelme. It allows you to draw polygons directly on the image to create segmentation masks. For a simple sheet of paper lying very flat, it can be quick. For the curved page of a book, approximating the shape with a polygon can require many clicks. Since the model's precision depends on the precision of the dataset, I did this carefully, using the zoom to place polygon vertices as close as possible to the visible contours. It is certainly tedious, but it was not the only time-consuming task.

Documents for a public dataset

One problem I faced is that I cannot take pictures of any document and put them in a public dataset. That may lead to copyright issues. I take that seriously, especially since I want FairScan to be respectful not only of its users but also of other people's work.

Some documents can be used without any issue. That includes everything that is in the public domain, and Wikimedia Commons has many such documents. I believe that forms issued by public administrations are also acceptable. I added photos of documents that I created in the past, or that were created by people I know who agreed to that usage. That was a start, but it was not enough to represent the variety of documents people scan. Think of books, magazines or color prints.

So I created additional documents myself. I took inspiration from layouts I found at home or online, and recreated documents with dummy content such as Lorem ipsum or with photos from Unsplash, which is allowed by their licence. I created business cards, flyers and textbook pages that I inserted into real books. They only need to look like real documents:

Avoiding overfitting

Creating documents takes time, so the temptation is to reuse them in many images. However, this creates a risk of overfitting. If the dataset contains many images of the same document, the model may learn to recognize that specific document instead of learning what a document looks like in general. I tried to limit the dataset to a maximum of 5 images per document. To reach more than 600 images, I still needed over 120 different documents.

Looking back at the process, building a public dataset clearly takes a significant amount of work. Life is easier for companies that train models without disclosing the training data. They can afford to be less careful with copyright and they can outsource the annotation to low-paid workers who label data all day long. That's how the AI economy works. I chose a different path and I have no regret. Yes, it required effort, but I feel a real satisfaction in knowing that this dataset forms the foundation of FairScan. It is fully aligned with the goal of building a respectful app, and FairScan is a small demonstration that this approach is possible.