Building a dataset

To build Typless models for data extraction, you need to build a dataset of documents for the document type.

Using your existing data

📘 Use it for pre-training

You can use existing data to achieve the state-of-the-art accuracy for your data extraction.

  1. Use all data from documents that have already been manually processed and stored in the database to build a dataset for your document type.

  2. Upload the original file with correct values from your database to train Typless before production.

  3. Use the code to start:

1 Open file as base64 string (Lines 4-6)

Make sure you are pointing to the right path when opening the file.

2 Create payload (Lines 8-64)

The payload consists of learning fields, line items, file name, and a base64 string-encoded file.

3 Specify values for learning fields (Lines 16, 20 etc.)

For every field you have defined in your document type, write the correct value.

4 Specify values for line items (Lines 39-60)

For every line-item row add an array of line item fields with correct values.

5 Add file info (Line 61 & 62)

Add file in base64 and file name.

6 Specify document type name (Line 63)

7 Authorize with API key (Line 72)

Authorize with your API key - prepend it with the word Token.

8 Execute the request (Lines 75-77)

Execute the request and make sure that everything went smooth.

import requests
import base64

file_name = 'name_of_your_document.pdf'
with open(file_name, 'rb') as file:
    base64_data = base64.b64encode(file.read()).decode('utf-8')

payload = {
    "learning_fields": [
        {
            "name": "supplier_name",
            "value": "Amazing Company"
        },
        {
            "name": "receiver_name",
            "value": "Amazing Client"
        },
        {
            "name": "invoice_number",
            "value": "3"
        },
        {
            "name": "purchase_order_number",
            "value": "234778"
        },
        {
            "name": "pay_due_date",
            "value": "2021-03-31"
        },
        {
            "name": "issue_date",
            "value": "2021-02-01"
        },
        {
            "name": "total_amount",
            "value": "15.0000"
        }
    ],
    "line_items": [
        [
            {
                "name": "product_number",
                "value": ""
            },
            {
                "name": "product_description",
                "value": "Amazing service"
            },
            {
                "name": "quantity",
                "value": "1"
            },
            {
                "name": "price",
                "value": "15.0000"
            }

        ]

    ],
    "file": base64_data,
    "file_name": file_name,
    "document_type_name": "line-item-invoice"
}


url = "https://developers.typless.com/api/add-document"

headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": "<<apiKey>>"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.json())

Response:

{
	"details":[
		"0cb9660762f20e13850d36cd45b48d44b63059f7"
	],
	"message":"Document added successfully."
}

Using live data

📘 Use it in a live environment

Using live data allows you to improve your data extraction continuously and automate new suppliers on the fly.

Typless continuously improves with a closed feedback loop where you provide correct values for the extracted document. Check out the example below.

1 Create payload (Lines 5-58)

Create payload with the following parameters:

  • learning_fields

  • line_items

  • document_object_id

  • document_type_name

2 Create fields feedback data (Lines 6-35)

Set the correct data values for all the defined fields that are on the document.

3 Create line items feedback data (Lines 36-55)

Add all the line items with correct data values that are on the document.

4 Set document object id (Line 56)

Set the document_object_id you get from the extraction response in the object_id key. Read more about the object id here.

5 Document type name (Line 57)

Set the document type name you are providing feedback for.

6 Specify headers (Lines 59-63)

Set the correct headers; make sure the content-type is application/json. Under the Authorization header, put your API key prepended with the word Token

7 Execute the request (Lines 65-67)

Send the POST request with the set payload, headers, and URL.

import requests

url = 'https://developers.typless.com/api/add-document-feedback';

payload = {
  "learning_fields": [
        {
            "name": "supplier_name",
            "value": "Amazing Company"
        },
            {
            "name": "receiver_name",
            "value": "Another Amazing Client"
        },
        {
            "name": "invoice_number",
            "value": "350"
        },
            {
            "name": "purchase_order_number",
            "value": "345677"
        }
        {
            "name": "pay_due_date",
            "value": "2021-02-28"
        },
        {
            "name": "issue_date",
            "value": "2021-01-01"
        },
        {
            "name": "total_amount",
            "value": "259.0000"
        }
  ],
  "line_items": [
    [
      {
            "name": "product_number",
            "value": ""
        },
        {
            "name": "product_description",
            "value": "Amazing service"
        },
        {
            "name": "quantity",
            "value": "1"
        },
        {
            "name": "price",
            "value": "259.0000"
        }
    ]
   ],
  "document_object_id": ID-FROM-EXTRACTION-RESPONSE
  "document_type_name": "line-item-invoice"
}
headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": "<<apiKey>>"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.json())

Response:

{
	"details":[
		"0cb96695b4c677c1d6c5562d523aa9541cb5dda8"
	],
	"message":"Values added successfully."
}

Using a training room

For smaller volumes of documents and testing purposes, you can use training room. In the training room, you can train documents for your document type and perform test extractions to quickly see results. Each document type has its own training room. Data you confirm here as correct solutions will be used to train your document type.

Last updated