1. Help Center
  2. Training Text Models

3. Data Preparation

Transcribe at least 25 pages before training a Text Recognition model: these pages will be the data (Ground Truth) on which the model will train itself and learn to recognise a new script


Before starting the training of a Text Recognition model, you need to prepare the Ground Truth data, i.e. the images and the corresponding accurate transcriptions on which the model will learn. 

Ground Truth is a term used in Machine Learning. In Transkribus, it is used to indicate the images and the corresponding transcriptions used to train the Artificial Intelligence. The transcriptions should be as accurate as possible because any mistake in the Ground Truth will train the model to learn something wrong. 

Depending on the type of material and the number of hands, between 5,000 and 15,000 words (around 25-75 pages) of transcribed material are required to start. In general, the neural networks of the Text Recognition engine learn quickly: the more training data they have, the better the results will be.

If you are working on printed material, 5,000 words should be sufficient to achieve a good Character Error Rate.

In the case of handwritten documents, our advice is to train the model on at least 10,000 words for each hand. Models trained on a large training data (more than 100,000 words) comprising many hands from the same period and region should be capable of recognising hands not seen in any way during the training: the results, however, will probably be somewhat worse than the Character Error Rate (which is measured on the Validation Data).

The Ground Truth should include examples of all the scripts that you want your model to be able to transcribe. It is possible to train models capable of recognising two or more hands, languages, types of writing or alphabets at the same time: however, all these variants must be present in a representative manner in the Ground Truth.

The pages to include in the Ground Truth are, therefore, important because they will affect the effectiveness of the model. For instance, if you want to train a model that recognises the hands of three different writers, you will have to transcribe about 10,000 words for each writer. In the case of a writer whose handwriting changed over time, the Ground Truth should comprise pages written over various years that are representative of the changes.

To create the Ground Truth, there are two ways:

  1. Manually:
    Run the Layout Recognition on the pages to be included in the Ground Truth and transcribe them accurately, as explained on the Transcribing Manually page. Then save them as Ground Truth.

  2. Partly automatically, partly manually:
    If there is a Text Recognition model that works sufficiently well on your documents, but you would like to train a more accurate one, you can first run the model on your documents, as explained on this page. Correct, then, manually the automatically generated transcriptions and save them as Ground Truth.

In both cases, it is important that the Ground Truth transcriptions are as accurate and correct as possible and that you are consistent with your editorial choices.

Conventions

The most common approach is creating a consistent transcript that accurately represents what you read in your document, including errors and punctuation. This is the case of a diplomatic transcription: combining words, upper and lower cases, superscripts and subscripts, and punctuation marks are all transcribed as they appear in the document. The advantage of this approach is a strong model that will exactly transcribe what is shown in the image.

However, the neural networks could learn, to a certain extent, to apply our transcription conventions. If the conventions are consistently adopted in all our transcriptions, and the Ground Truth is large enough, the model could learn to separate words that appear combined in the documents, normalise historical spelling, transcribe superscripts and subscripts as in line with the rest of the text, and solve abbreviations (see the next point).

In particular:

  • Diacritical characters (e.g. accents, circumflexes, cedillas, hyphens, tildes): it depends on you if you want the Text Recognition model to make a diplomatic transcription or normalise words according to modern orthography. Both approaches are fine; you just need to choose one and be consistent.

  • i/j and I/J: the letters “i” and “j” were often used interchangeably. You can decide to transcribe the letters as they appear in the document or to follow the spelling in use today.

    • u/v and U/V: historical documents often use “v” at the beginning of words and “u” in the middle and end. You can decide to transcribe the letters as they appear in the document or to follow the spelling in use today.
    • Ligatures: are common combinations of letters to form a new character. They can be transcribed in full, using the single characters composing the ligature (e.g. “præs” becomes “praes”).
    • S-characters: the letter “s” can appear in different forms. Normal and long “s” (with descender) can both get transcribed as normal “s” or according to their shape as “s” or “ſ” (U-017F). Double “s” or “ß” (sharp “s” or “Eszett”) are transcribed according to the original text.
    • Hyphenated words: when hyphenated words appear at the end of the line, they should be transcribed and broken up according to the original text. Add a “-” at the end of the line only if present.
    • Text styles: with the Tags button, you can tag words or portions of words as bold, italic, strikethrough, underline, superscript or subscript. If you train these tags when training the model, the tags will be automatically added when recognising new pages (for now, this feature is only available in Transkribus eXpert: read more about it on the Model Setup and Training page). 
    • Fonts: different fonts like Kurrent or Antiqua are not specially marked.

    Each user can use the conventions best suited to their needs. What is important is to be consistent: we recommend taking note of your decisions while transcribing the Ground Truth pages and adding the conventions you used in the Details field of the Text Recognition model.

    Abbreviations

    According to your needs, you can decide to train the model to:

    • Keep the abbreviated form: transcribe the abbreviations as they appear in the documents, using the base characters or the special characters most similar to the characters written by the writer.
    • Transcribe the expanded form: the neural networks are often able to learn to recognise and use expansions, especially if they appear frequently. You just need to write the expansion of the abbreviation in the transcriptions, paying attention to solve them always in the same way. 
    • Tag the abbreviation and add the corresponding expansion as a property: in the Ground Truth, transcribe the abbreviations as they appear, tag them and add the expanded form in the “expansion” field (property of the Abbreviation Tag).  When training the model, check the option to train the Abbreviation tags including the expansions, as well. Read more about it on the Model Setup and Training page). 

     

     


     

    Transkribus eXpert (deprecated)

    Before starting the training of a Text Recognition model, you need to prepare the Ground Truth data, i.e. the images and the corresponding accurate transcriptions on which the model will learn. 

    Ground Truth is a term used in Machine Learning. In Transkribus, it is used to indicate the images and the corresponding transcriptions used to train the Artificial Intelligence. The transcriptions should be as accurate as possible because any mistake in the Ground Truth will train the model to learn something wrong. 

    Depending on the type of material and the number of hands, between 5,000 and 15,000 words (around 25-75 pages) of transcribed material are required to start. In general, the neural networks of the Handwritten Text Recognition engine learn quickly: the more training data they have, the better the results will be.

    If you are working on printed material, 5,000 words should be sufficient to achieve a good Character Error Rate.

    In the case of handwritten documents, our advice is to train the model on at least 10,000 words for each hand. Models trained on a large training data (more than 100,000 words) comprising many hands from the same period and region should be capable of recognising hands not seen in any way during the training: the results, however, will probably be somewhat worse than the Character Error Rate (which is measured on the Validation Data).

    The Ground Truth should include examples of all the scripts that you want your model to be able to transcribe. It is possible to train models capable of recognising two or more hands, languages, types of writing or alphabets at the same time: however, all these variants must be present in a representative manner in the Ground Truth.

    The pages to include in the Ground Truth are, therefore, important because they will affect the effectiveness of the model. If you want to train a model that recognises the hands of three different writers, you will have to transcribe about 10,000 words for each writer. In the case of a writer whose handwriting changed over time, the Ground Truth should comprise pages written over various years that are representative of the changes.

    To create the Ground Truth, there are two ways:

    1. Manually:
      Run the Layout Recognition on the pages to be included in the Ground Truth; transcribe them accurately in the Text editor and save them as Ground Truth.

    2. Partly automatically, partly manually:
      If there is a Text Recognition model that works sufficiently well on your documents, but you would like to train a more accurate one, you can first run the model on your pages. Correct, then, manually the automatically generated transcriptions and save them as Ground Truth.

    In both cases, it is important that the Ground Truth transcriptions are as accurate and correct as possible and that you are consistent with your editorial choices.

    Conventions

    The most common approach is creating a consistent transcript that accurately represents what you read in your document, including errors and punctuation. This is the case of a diplomatic transcription: combining words, upper and lower cases, superscripts and subscripts, and punctuation marks are all transcribed as they appear in the document. The advantage of this approach is a strong model that will exactly transcribe what is shown in the image.

    However, the neural networks could learn, to a certain extent, to apply our transcription conventions. If the conventions are consistently adopted in all our transcriptions, and the Ground Truth is large enough, the model could learn to separate words that appear combined in the document, normalise historical spelling, transcribe superscripts and subscripts as in line with the rest of the text, solve abbreviations (see the next point). 

    In particular:

    • Diacritical characters (e.g. accents, circumflexes, cedillas, hyphens, tildes): it depends on you, if you want the Text Recognition model to make a diplomatic transcription or normalise words according to modern orthography. Both approaches are fine; you just need to choose one and be consistent.
    • i/j and I/J: the letters “i” and “j” were often used interchangeably. You can decide to transcribe the letters as they appear in the document or to follow the spelling in use today.
    • u/v and U/V: historical documents often use “v” at the beginning of words and “u” in the middle and end. You can decide to transcribe the letters as they appear in the document or to follow the spelling in use today.
    • Ligatures: are common combinations of letters to form a new character. They can be transcribed in full, using the single characters composing the ligature (e.g. “præs” becomes “praes”).
    • S-characters: the letter “s” can appear in different forms. Normal and long “s” (with descender) can both get transcribed as normal “s” or according to their shape as “s” or “ſ” (U-017F). Double “s” or “ß” (sharp “s” or “Eszett”) are transcribed according to the original text.
    • Hyphenated words: when hyphenated words appear at the end of the line, they should be transcribed and broken up according to the original text. Add a “-” at the end of the line only if present.
    • Text styles: with the Formatting Bar at the bottom of the Text Editor, you can tag words or portions of words as bold, italic, subscript, superscript, underline, and strikethrough. If you train these tags when training the model, the tags will be automatically added when recognising new pages (read more about it on the Model Setup and Training page).
    • Fonts: different fonts like Kurrent or Antiqua are not specially marked.

    Each user can use the conventions best suited to their needs. What is important is to be consistent: we recommend taking note of your decisions while transcribing the pages and writing down the conventions you used in the Details field of the model.

    Abbreviations

    According to your needs, you can decide to train the model to:

    • Keep the abbreviated form: transcribe the abbreviations as they appear in the documents, using the base characters or the special characters most similar to the characters written by the writer.
    •  Transcribe the expanded form: the neural networks are often able to learn to recognise and use expansions, especially if they appear frequently. You just need to write the expansions of the abbreviations in the transcriptions, paying attention to solve them always in the same way. 
    • Tag the abbreviation and add the corresponding expansion as a property: in the Ground Truth, transcribe the abbreviations as they appear, tag them and add the expanded form in the “expansion” field (property of the Abbreviation Tag).  When training the model, check the option to train the Abbreviation tags with expansions as well (read more about it on the Model Setup and Training page).