Discover all the formats available to download your documents from Transkribus and store, publish or further analyse your transcriptions
If you want to work with your images and transcriptions outside of Transkribus, you can download your documents from the platform. Different formats are available.
To download your images and transcriptions, select the document(s) or page(s) you would like to export. Click “Export” on the Tools menu and choose any following options:
- Images: to download the JPG file of each page of the document/each selected page.
- Files: to download the transcriptions and tags:
- PAGE XML: is an XML-based page image representation framework that records information on image characteristics in addition to layout structure and page content. Choosing this format, you will download an XML file for each page, containing the layout information as well as the content of the page.
The complete format definition used in Transkribus can be accessed here. - ALTO: is a special output format which allows you to input the exported document into other programs working with this format. Choosing this format, you will download an XML file for each page, containing the content and layout information of the page.
It is often used in combination with METS for the description of the whole digitized object and the creation of references across the ALTO files, e.g. description of the reading sequence.
More information about ALTO can be found at: http://www.loc.gov/standards/alto/
Selecting images, PAGE-XML and/or Alto, you will also produce a metadata XML file and a METS (Metadata Encoding and Transmission Standard) file containing the links to PAGE, XMLs, ALTO and/or image files depending on which options you choose. A METS file is like a container which includes all the background information about a file. More detailed information about METS can be found at: http://www.loc.gov/standards/mets/
- PDF: to download a PDF file of the selected document/pages.
The PDF will have two layers: the transcribed text (called OCR) and the image. Use the Layers panel to show or hide the content associated with each layer.
Thanks to that, the PDF becomes searchable, and the term you are looking for will be highlighted (note that the highlighted area may not perfectly match the word in the image because the word coordinates are determined from the lines with a certain degree of fuzziness). - TEI: this option is for people working with the Text Encoding Initiative (TEI). The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s.
With this option, you will download a TEI XML file created using the XLS from Dario Kampaskar available here: https://github.com/dariok/page2tei - Docx: you will get the transcriptions in Word files, one file per document.
- Tag XLSX: if you would like to export the textual tags you assigned to your transcription, select this option to produce an Excel file with individual tabs for each tag category and one tab with an overview of all the tags.
- Table XLSX: if your document presents tables, use this option to export them in Excel format. Each table will be exported as a separate sheet of the Excel file.
- PAGE XML: is an XML-based page image representation framework that records information on image characteristics in addition to layout structure and page content. Choosing this format, you will download an XML file for each page, containing the layout information as well as the content of the page.
When you start the job, the download will be processed on the Transkribus server, and you will receive an email with the link to download your files. The link expires in two weeks. You can always check the status of the download with the Jobs button.
For now, there is no option to export and display tags in PDF and Docx. More download options (tags, line and page breaks, tables…) will be implemented soon. In the meantime, you can use the export function in Transkribus eXpert, as described below.
Transkribus eXpert (deprecated)
If you want to work with your images and transcriptions outside of Transkribus, you can export your documents from the platform. Different export formats and features are available to suit your needs.
To open the Export window, click on the folder icon with the green arrow pointing to the right in the Main Bar:
The Export document that opens up has two tabs/options between which you have to choose:
- Server export: the export will be processed on the Transkribus server, and you will receive a link to download your files. The export will not slow your computer down, and the process will not be interrupted if you switch your computer off. After starting the download, you can check the progress of your export by clicking the “Jobs” button in the “Server” tab.
- Client export: the files will be saved directly to your computer. Please choose where you would like to save the exported files: type the file location in the “Base folder” box at the top of the window.
These are the available export formats:
- Transkribus Document: if you export your transcription as a Transkribus Document, you will produce a METS (Metadata Encoding and Transmission Standard) file containing the links to PAGE, XMLs, ALTO and/or image files, depending on which options you choose.
A METS file is like a container which includes all the background information about a file. More detailed information about METS can be found at: http://www.loc.gov/standards/mets/
In conjunction with the METS file, you can export your document in these formats:-
- PAGE: is an XML-based page image representation framework that records information on image characteristics in addition to layout structure and page content.The complete format definition used in Transkribus can be accessed here.
- ALTO: is a special output format which allows you to input the exported document into other programs working with this format. The format is similar to XML and works for OCR, for example. It is often used in combination with METS for the description of the whole digitized object and the creation of references across the ALTO files, e.g. description of the reading sequence.
More information about ALTO can be found at: http://www.loc.gov/standards/alto/ With the “Split Lines Into Words” option Transkribus will divide the lines into words. The program does this by analysing the spaces between words, even if no word segmentation has been performed previously. - Images: choose this option to download the image file of each page of the document/the selected pages.
In Image type, you can choose to download the Original (the image you uploaded) or the JPEG compressed version of the image (the one you see in the Transkribus Image Window).
Under “Filename pattern”, you can choose how the filename will be composed. The second option, “filename”, is the standard one. With this option, the exported file will have the same name as the document you imported. This is important if you want to match local transcripts with the images in Transkribus. So if you export a document, then adjust it externally, and after that upload it to Transkribus again, the program will need to have two similar filenames in order to recognise the file properly.
- PAGE: is an XML-based page image representation framework that records information on image characteristics in addition to layout structure and page content.The complete format definition used in Transkribus can be accessed here.
-
- PDF: when you export a PDF file, you can choose between these options:
-
- “Images plus text layer”: you will see two layers in the exported PDF document: OCR (the transcribed text) and image (image of the document).
- “Images only”: you will produce a PDF file with the document as an image. This means that you will not see the transcribed text.
- “Extra text page”: the transcribed text will be added to the PDF as an extra page after each image.
- “Highlight tags”: select these options to highlight the tags in the exported PDF file. The tags will be shown in the same colours used in Transkribus. At the end of the document, there will also be a symbol legend to explain the signification of the different colours.
- “Highlight article”: the articles will be highlighted with different colours in the exported PDF.
- “PDF/A”: for long-term preservation.
You can also choose the font and image type to use in the PDF.
-
- TEI: this option is for people working with the Text Encoding Initiative (TEI). The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. More information at: http://www.tei-c.org/index.xml
You can choose to create the TEI XML file using the the XLS from Dario Kampaskar (available here: https://github.com/dariok/page2tei) or the “Client Export” format. You can try both and decide which one best suits your needs.
With the “Client Export” format, you can flag the option to export the predefined tags and attributes only: it creates a valid TEI but note that all the tags and attributes created by you will be ignored.
Transkribus enables you to choose the zones you need (no zones; zone per region; zone per line; zone per word). Furthermore, you can choose between line tags and line breaks to tag lines. - DOCX: by choosing this option, you will get your transcriptions in Word files. You can select options relating to line breaks, abbreviations and more according to your needs.
Select “Export selected tags” to make the tags visible in the exported DOCX file. After the export of the Word document, please open it and do the following:- Click on the paragraph button in the Home menu of Word
- Go to “References” and choose “Insert Index”
- The following Office window will open up
- Select “Right align page numbers” and press “OK”
- A confirmation window will pop up: click “Yes”
- An overview of the tags should now appear at the end of the document. If the overview of the tags does not appear, click “Update Index”. This should solve the problem.
- Click on the paragraph button in the Home menu of Word
- TXT: if you do not usually work with Microsoft Word, it is possible to export your transcription as a simple TXT file.
You can choose to split the text into text files from a start tag to an end tag and create several text files; those files can be named according to one or more attributes of the tag. - Tag Export (Excel): If you would like to export the tags you assigned to your transcription, select this option to produce an Excel file with individual tabs for each tag category and one tab with an overview of all the tags.
As described above, you can also export the tags in PDF and DOCX files. - Table Export into Excel: if your document presents tables, use this option to export them in Excel format. Each table will be exported as a separate sheet of the Excel file. However, you can check the option “Create one large table” if you prefer to have all the tables of the selected pages in one table in one Excel sheet. You can also choose to export only one column of your table with the cells' image snippets.
- Page metadata into Excel: to export the page-related metadata in the Metadata-page tab in the Excel format.
In addition to the export format, these options are selectable during the export:
- Version Status: to export a particular version of the document. If you select “Ground Truth”, for example, Transkribus will export only those pages of the document which you have marked as “Ground Truth.”
For the export, the program consults previous versions of your document. This means that if you choose to export all “In Progress” pages, the program will export all pages which have been marked as “In Progress”, even if their status is now updated. The program will export the latest “In Progress” version of your document. If you would like to export a specific former “In Progress” version of your page, open this version of the page in Transkribus. Open the Export window and select “Loaded version for current page” (available only for the Client Export). In the “Pages” option, select “Current” before confirming. - Word Layer: if checked, the text from word layer segmentation will be exported (it works only if you have previously selected the “Add estimated word coordinates” during the Text Recognition).
- Blackening: If you have blacked out sensitive sections of your transcription, these words or phrases can also be hidden in the exported files. To do this, select “Do blackening” in the export options. Note: this option only works for Word, PDF and METS files.
- Create Title Page: with this option, a title page based on the information added in the “Document” tab within the “Metadata” tab is created. In the “Document” tab, you can add information about the title, author, language and date of your document. You can also create an Editorial Declaration to explain how exactly your document has been transcribed (more info on the Editorial Declaration on this page).
- Pages to be exported: select the number of pages you wish to export. You can export all the pages in your document or just the current page.
- All tags/chosen tags: to choose which tags you want to export.