Upload Documents

In order to upload a new document, click on project list in the project drop down menu and click on cloud upload button backup. Different formats are accepted:

  1. TXT,PDF, HTML and DOCX

  2. Native PDF

  3. JPG, PNG

  4. AWS OCR Textract (ZIP file containing JSON textract output and its associated PDF or image file)

  5. JSON: you can upload a JSON file with existing entities and relations. This is useful if you have a pre-annotated JSON file and you don’t want to re-annotate. The JSON should follow the format below:

  1. CSV (UTF-8): you can upload a csv file containing one document per row. Note that the CSV needs to be UTF-8 encoded.

  1. TSV: you can upload a .tsv file containing pre-tagged tokens following the IOB format as shown below. Documents are seperated by the token -DOCSTART- -X- O O at the start of each document.

  1. ZIP: you can upload a zip file containing TXT, PDF or HTML. This is useful to upload documents in bulk.

Once you select your documents, you will then be prompted to choose between 3 pre-annotation options:

  • Dictionary: Auto-annotate the uploaded document(s) using the project's dictionary (not applicable for JSON and TSV upload)

  • Models: Auto-annotate the uploaded document(s) using your trained ML model

  • No pre-annotation: No auto-annotation

You also have the option to remove duplicate documents during upload by checking "remove duplicate documents".

For native PDF, JPG and PNG using OCR, you have the option to choose between three type of OCR engine:

  • Default: Engine will be chosen based on the language of the project

  • OCR 1: Engine based on AWS Textract

  • OCR 2: Engine based on Google Vision API

  • OCR 3: Engine based on Microsoft Azure OCR, includes the option to automatically parse tables

Last updated