1. Help Center
  2. Layout Recognition

7. Newspapers

When working with newspapers, Layout Recognition can be challenging. Use a Field Model first and the advanced configuration settings for line detection second to achieve good results

Processing newspapers may be challenging if you run the Text Recognition directly. The issue often lies not in the model's ability to read the text but in Layout Recognition, which is integrated into Text Recognition as its first step.

The solution is to run the Layout and Text Recognition as two separate steps.

For best results, we suggest first using a Field Model to segment the newspaper page (articles and headings as separated text blocks) and then running the Layout Recognition, keeping the existing text regions.

Tweaking the advanced layout configuration settings helps to have all lines recognised for their entire length.

Most depend on the type of newspaper and image quality you are dealing with, so it could be a trial-and-error process. We recommend testing the settings on a few pages before running the Layout Recognition on the entire document. 

In general, these are the steps to follow to recognise newspaper layout:

Step 1: Newspaper Page Segmentation (articles, headings...)

This step identifies the text regions. In the case of newspapers, it is essential to segment the page into separate text blocks, such as articles and headings. Otherwise, you will end up with a large text region comprising the entire page, which will result in a mixed-up reading order of the lines.

  1. Select the document or the specific page(s) you want to process.
  2. Click on the "Recognise" button.
  3. Go to the top of the recognition section and choose "Fields."
  4. Search for a suitable public or private Field model trained to segment newspapers.
  5. Start the Recognition.

If a Field public model for newspapers is not available or does not perform as expected on your documents, you have the option to train a Field model specifically for your newspapers. This allows you to choose whether you want the model to simply segment the pages into text regions or if you also want to train it to add structural tags (such as article, heading, advertisement, list, figure, etc). For more information on training a Field model, please refer to this article.

Step 2: Layout Recognition with Advanced Settings

  1. Select the document or the specific page(s) you want to process.
  2. Click on the "Recognise" button.
  3. Go to the top of the recognition section and choose "Layout."
  4. Choose the baseline model and modify the Advanced Settings.
    In our experience, these configuration settings have proven effective for most newspapers. However, it is important to note that these settings may vary depending on the quality of the images and the type of newspaper you are working with.

    Layout Model Mixed Line Orientation
    (click on Advanced Settings)
    Generation of Text Regions (Layout Blocks) Keep existing
    Image Scaling Upscale
    (click on Baseline Options)
    Minimal Baseline Lengh Low
    Baseline Accuracy Threshold High
    Use Trained Separators No
    Max distance for merging baselines Medium
    Split Lines on Regions border Yes
    Baseline Settings - Newspapers
    We recommend starting with the suggested settings and adjusting them, if needed, based on the information provided on the Advanced Layout Configuration Settings page.

    In certain situations, we have discovered that resizing the images (by doubling their size) before uploading them to Transkribus can be beneficial.
  5.  Start the Layout Recognition.

Step 3: Text Recognition

Apply the most appropriate Text Model to automatically transcribe the content of the newspapers, as explained on this page