Custom document

Extracting metadata and line items from any custom document with examples in Python and Node.

This is a general guide that covers how to extract data from any pseudo-structured documents with examples in Python and Node. You will learn how to easily train and extract data from various different documents in many languages and character sets.

You will need:

  • Two different examples of a document

  • Correct values for at least one document

  • 15 minutes of your time

This guide shows you how to

Getting your API Key

The Authorization header for your API key is: Token YOUR-API-KEY (Login if you do not see one). You can also obtain the API key by visiting the Settings page.

Getting your API key

1. Create a new document type

Before you start extracting data, you need to define a document type. Navigate to the Dashboard page and click on the New document type button in the top right corner of the table. Next, select the Custom document card.

Document type is used for all your suppliers Click on Document type to learn more.

You will have to define all metadata fields and line item fields you want to extract. The only exception is supplier_name, which must be present on each document type.

To ensure consistent training and data extraction, Typless uses 3 field data types:

Field type
What is it used for?

STRING

General string fields like document numbers, address, company names, payment references, IBANs, ...

DATE

Dates like issue date, pay due, date of service, delivery date, contract date, ...

NUMBER

Numbers you want to perform calculations with like total amount, net amount, ...

Want to learn more about defining fields? Check out the fields or line items guide to learn more.

2. Add suppliers

Once your document type is created, you need to add data to the dataset of your document type.

To add a document to the dataset, use the add-document endpoint or use training room, where you can easily upload a file and fill out the necessary information.

The dataset is created by uploading an original file with the correct value for each field defined inside the document type:

1 Open file as base64 string (Lines 4-6)

Make sure you are pointing to the right path when opening the file.

2 Create payload (Lines 8-73)

The payload consists of learning fields, line items, file name, and a base64 string-encoded file.

3 Specify values for learning fields (Lines 9-60)

For every field, you have defined in your document type, write the correct value

4 Specify values for line items (Lines 47, 51, 55)

For every line-item row add an array of line item fields with correct values

5 Add file info (Lines 61-62)

Add file in base64 and file name

6 Specify document type name (Line 63)

7 Authorize with API key (Line 72)

Authorize with your API key - prepend it with the word Token.

8 Execute the request (Lines 75-77)

Execute the request and make sure that everything went smooth.

import requests
import base64

file_name = 'name_of_your_document.pdf'
with open(file_name, 'rb') as file:
    base64_data = base64.b64encode(file.read()).decode('utf-8')

payload = {
    "learning_fields": [
        {
            "name": "supplier_name",
            "value": "Amazing Company"
        },
        {
            "name": "receiver_name",
            "value": "Amazing Client"
        },
        {
            "name": "invoice_number",
            "value": "3"
        },
        {
            "name": "purchase_order_number",
            "value": "234778"
        },
        {
            "name": "pay_due_date",
            "value": "2021-03-31"
        },
        {
            "name": "issue_date",
            "value": "2021-02-01"
        },
        {
            "name": "total_amount",
            "value": "15.0000"
        }
    ],
    "line_items": [
        [
            {
                "name": "product_number",
                "value": ""
            },
            {
                "name": "product_description",
                "value": "Amazing service"
            },
            {
                "name": "quantity",
                "value": "1"
            },
            {
                "name": "price",
                "value": "15.0000"
            }

        ]

    ],
    "file": base64_data,
    "file_name": file_name,
    "document_type_name": "line-item-invoice"
}


url = "https://developers.typless.com/api/add-document"

headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": "<<apiKey>>"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.json())

Response:

{
	"details":[
		"0cb9660762f20e13850d36cd45b48d44b63059f7"
	],
	"message":"Document added successfully."
}

As you can see, to achieve high accuracy, Typless only requires the values that are present in the document. However, there are some rules to keep in mind when providing values.

Want to learn more about providing training values? Check out the fields or line items guide to learn more.

3. Execute training

To immediately see the results, you can trigger the training process on the Dashboard page. Look for your document type in the list, and click on .

Need more information about training? You can read more about it here.

4. Extract data from documents

After the training is finished, you can start precisely extracting data from documents from trained suppliers.

To add a document to a dataset, use the extract-data endpoint.

1 Open file as base64 string (Lines 4-6)

Open the file in binary mode and correctly decode it into a base64 string. Make sure that your file is in the same directory as the script.

2 Create payload (Lines 8-12)

Create request payload with all the required parameters:

  • file

  • file_name

  • document_type_name

3 Specify headers (Lines 16-20)

Make sure that the Content-Type is set as application/json.

4 Authorize with your API key (Line 19)

You can get your API key at https://app.typless.com/settings/profile.

5 Execute the request (Lines 22)

Send the request and wait for the response.

import requests
import base64

file_name = 'name_of_your_document.pdf'
with open(file_name, 'rb') as file:
    base64_data = base64.b64encode(file.read()).decode('utf-8')

payload = {
    "file": base64_data,
    "file_name": file_name,
    "document_type_name": "line-item-invoice"
}

url = "https://developers.typless.com/api/extract-data"

headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": "<<apiKey>>"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.json())

Response:

{
    "file_name": "name_of_your_document.pdf",
    "object_id": "1cb25cc8-c9fa-4149-9a83-b4ed6a2173b9",
    "extracted_fields": [
        {
            "name": "supplier",
            "values": [
                {
                    "x": -1,
                    "y": -1,
                    "width": -1,
                    "height": -1,
                    "value": "ScaleGrid",
                    "confidence_score": "0.968",
                    "page_number": -1
                }
            ],
            "data_type": "AUTHOR"
        },
        {
            "name": "invoice_number",
            "values": [
                {
                    "x": 1989,
                    "y": 545,
                    "width": 323,
                    "height": 54,
                    "value": "20190500005890",
                    "confidence_score": "0.250",
                    "page_number": 0
                },
                {
                    "x": 167,
                    "y": 574,
                    "width": 391,
                    "height": 54,
                    "value": "GB123456789",
                    "confidence_score": "0.250",
                    "page_number": 0
                }
            ],
            "data_type": "STRING"
        },
        {
            "name": "issue_date",
            "values": [
                {
                    "x": 2072,
                    "y": 628,
                    "width": 240,
                    "height": 54,
                    "value": "2019-06-05",
                    "confidence_score": "0.358",
                    "page_number": 0
                }
            ],
            "data_type": "DATE"
        },
        {
            "name": "total_amount",
            "values": [
                {
                    "x": 2146,
                    "y": 1196,
                    "width": 126,
                    "height": 54,
                    "value": "47.5300",
                    "confidence_score": "0.990",
                    "page_number": 0
                }
            ],
            "data_type": "NUMBER"
        }
    ],
    "line_items": [
        [
            {
                "name": "Description",
                "values": [
                    {
                        "x": 208,
                        "y": 1196,
                        "width": 1022,
                        "height": 50,
                        "value": "5/2019-MongoBackend-MgmtStandalone-Small-744 hours",
                        "confidence_score": "0.661",
                        "page_number": 0
                    }
                ],
                "data_type": "STRING"
            },
            {
                "name": "Price",
                "values": [
                    {
                        "x": 2146,
                        "y": 1196,
                        "width": 126,
                        "height": 54,
                        "value": "47.5300",
                        "confidence_score": "0.582",
                        "page_number": 0
                    }
                ],
                "data_type": "NUMBER"
            },
            {
                "name": "Quantity",
                "values": [
                    {
                        "x": 1979,
                        "y": 1196,
                        "width": 23,
                        "height": 54,
                        "value": "1",
                        "confidence_score": "0.647",
                        "page_number": 0
                    }
                ],
                "data_type": "NUMBER"
            }
        ]
    ],
    "customer": null
}

Need a more in-depth explanation of the response? You can read about it here.

5. Continuously improve models

Typless embraces the fact that the world is changing all the time. That's why you can improve models on the fly by providing correct data after extraction. Let's say your company has a new partner, Best Supplier. You don't need to start over with building the dataset. You can simply extract and send the correct data after they are verified by your users. You can learn more about providing feedback on the building dataset page.

Add a supplier with feedback:

1 Create payload (Line 3)

Create payload with the following parameters:

  • learning_fields

  • line_items

  • document_object_id

  • document_type_name

2 Create fields feedback data (Lines 5-35)

Set the correct data values for all the defined fields that are on the document

3 Create line items feedback data (Lines 36-55)

4 Set document object id (Line 56)

Set the document_object_id you get from the extraction response in the object_id key. Read more about the object id here.

5 Document type name (Line 57)

Set the document type name you are providing feedback for

6 Specify headers (Lines 59-62)

Set the correct headers, make sure that the content-type is application/json. Under the Authorization header put your API key prepended with the word Token

7 Execute the request (Lines 65-67)

Send the POST request with the set payload, headers, and URL.

import requests

url = 'https://developers.typless.com/api/add-document-feedback';

payload = {
  "learning_fields": [
        {
            "name": "supplier_name",
            "value": "Amazing Company"
        },
            {
            "name": "receiver_name",
            "value": "Another Amazing Client"
        },
        {
            "name": "invoice_number",
            "value": "350"
        },
            {
            "name": "purchase_order_number",
            "value": "345677"
        }
        {
            "name": "pay_due_date",
            "value": "2021-02-28"
        },
        {
            "name": "issue_date",
            "value": "2021-01-01"
        },
        {
            "name": "total_amount",
            "value": "259.0000"
        }
  ],
  "line_items": [
    [
      {
            "name": "product_number",
            "value": ""
        },
        {
            "name": "product_description",
            "value": "Amazing service"
        },
        {
            "name": "quantity",
            "value": "1"
        },
        {
            "name": "price",
            "value": "259.0000"
        }
    ]
   ],
  "document_object_id": ID-FROM-EXTRACTION-RESPONSE
  "document_type_name": "line-item-invoice"
}
headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": "<<apiKey>>"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.json())

Response:

{
	"details":[
		"0cb96695b4c677c1d6c5562d523aa9541cb5dda8"
	],
	"message":"Values added successfully."
}

To send feedback, use the add-document-feedback with object_id.

Running Typless live

The only thing that you need to do to automate your manual data entry is to integrate those simple API calls into your system.

Have any questions or need some help? Write us an email to support@typless.com.

Last updated