The Text-Image Matching (TIM) feature in Transkribus allows you to import and align existing text transcriptions with images of documents. This feature is particularly useful for those who have pre-existing transcriptions in .docx or .txt formats.
How to Use Text-Image Matching
Step 1: Upload Your Images and Open a Document
Before you can match your transcription, you need to upload the images of your documents to Transkribus and open the relevant document to see all of your pages in your document.
Step 2: Import Text
-
Click the "Import Text" Button: This button is located in the main interface of your document workspace on the top right.
-
Choose a Matching Option:
- Use Existing Transcriptions: If you have already performed text recognition in Transkribus, you can use that recognised text to match against your transcription that you can upload as a
.docx
or.txt
file. - Recognise and Match Text: This option lets you upload a
.docx
or.txt
file and match it directly with the images using a model.
- Use Existing Transcriptions: If you have already performed text recognition in Transkribus, you can use that recognised text to match against your transcription that you can upload as a
Step 3: Select a Model
For text matching, you must select a model that aligns with the language and style of your document. Models can be either public or private. For instance, if you are working with German texts, you might select a model designed for German text recognition.
Step 4: Upload Your Transcription
- Upload Your File: You can upload your transcription in
.docx
or.txt
format. - Optional: Adjust Settings for Improved Matching:
- Block threshold: Adjust how closely paragraphs should match recognised text regions.
- Line Threshold: Set how similar a line of text needs to be to match the document image. The default is 0.45, meaning about two characters (from the automatic transcription) per line can be mismatched and still be recognised.
- Preserve Line Order: Keep the original reading order of lines in your transcription.
- Use Source Line Feeds: If your transcription has line breaks, enable this setting to match the entire line.
- Allow Double Matching: This setting allows a single line of text to match multiple instances, such as repeated text in the document.
- Add Unmatched Text at the End: Any text that cannot be matched will be added to the bottom of the document.
Step 5: Start the Matching Process
Once your settings are configured, start the process. Transkribus will first recognise the text in your document images and then match it against your uploaded transcription.
Viewing Results
After the matching process is complete:
- Matched Regions: You’ll see regions marked as "matched," indicating that the provided transcription has been successfully aligned with the document image.
- Review Unmatched Text: Any unmatched text will be added to a separate section at the bottom of the document for your review.
Tips for Usage
To achieve the best results, consider the following tips:
-
Default Settings: The default settings are optimised for lines with up to two character errors. Adjust settings based on the quality of your source text.
-
Source Line Feeds: If your transcription includes correct line feeds, ensure that the "Use Source Line Feeds" option is checked.
-
Page Breaks: Mark correct page breaks in your transcription with the word
TRP_PAGEBREAK
at the end of a page to improve matching speed and quality. -
Line Threshold:
- Increasing the threshold to 0.65 will result in fewer lines being matched but with higher accuracy (99% correct).
- Lower the threshold if your text has more than two character errors per line.
-
Handling Noise: If your transcription includes extra notes or information not present in the image, this can reduce the matching quality.
-
Double Matching: Generally, enabling double matching may reduce quality but is useful for documents with repeated text that needs to be matched to multiple lines or pages.