Custom document
Extracting metadata and line items from any custom document with examples in Python and Node.
This is a general guide that covers how to extract data from any pseudo-structured documents with examples in Python and Node. You will learn how to easily train and extract data from various different documents in many languages and character sets.
You will need:
Two different examples of a document
Correct values for at least one document
15 minutes of your time
This guide shows you how to
Getting your API Key
The Authorization header for your API key is: Token YOUR-API-KEY
(Login if you do not see one).
You can also obtain the API key by visiting the Settings page.
1. Create a new document type
Before you start extracting data, you need to define a document type. Navigate to the Dashboard page and click on the New document type button in the top right corner of the table. Next, select the Custom document card.
You will have to define all metadata fields and line item fields you want to extract. The only exception is supplier_name
, which must be present on each document type.
To ensure consistent training and data extraction, Typless uses 3 field data types:
STRING
General string fields like document numbers, address, company names, payment references, IBANs, ...
DATE
Dates like issue date, pay due, date of service, delivery date, contract date, ...
NUMBER
Numbers you want to perform calculations with like total amount, net amount, ...
2. Add suppliers
Once your document type is created, you need to add data to the dataset of your document type.
The dataset is created by uploading an original file with the correct value for each field defined inside the document type:
import requests
import base64
file_name = 'name_of_your_document.pdf'
with open(file_name, 'rb') as file:
base64_data = base64.b64encode(file.read()).decode('utf-8')
payload = {
"learning_fields": [
{
"name": "supplier_name",
"value": "Amazing Company"
},
{
"name": "receiver_name",
"value": "Amazing Client"
},
{
"name": "invoice_number",
"value": "3"
},
{
"name": "purchase_order_number",
"value": "234778"
},
{
"name": "pay_due_date",
"value": "2021-03-31"
},
{
"name": "issue_date",
"value": "2021-02-01"
},
{
"name": "total_amount",
"value": "15.0000"
}
],
"line_items": [
[
{
"name": "product_number",
"value": ""
},
{
"name": "product_description",
"value": "Amazing service"
},
{
"name": "quantity",
"value": "1"
},
{
"name": "price",
"value": "15.0000"
}
]
],
"file": base64_data,
"file_name": file_name,
"document_type_name": "line-item-invoice"
}
url = "https://developers.typless.com/api/add-document"
headers = {
"Accept": "application/json",
"Content-Type": "application/json",
"Authorization": "<<apiKey>>"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.json())
Response:
{
"details":[
"0cb9660762f20e13850d36cd45b48d44b63059f7"
],
"message":"Document added successfully."
}
As you can see, to achieve high accuracy, Typless only requires the values that are present in the document. However, there are some rules to keep in mind when providing values.
3. Execute training
👍 Training is executed automatically every day at 10 PM CET
For all of your suppliers with new documents in the dataset of all your document types. Free of charge
To immediately see the results, you can trigger the training process on the Dashboard page.
Look for your document type in the list, and click on .
4. Extract data from documents
After the training is finished, you can start precisely extracting data from documents from trained suppliers.
import requests
import base64
file_name = 'name_of_your_document.pdf'
with open(file_name, 'rb') as file:
base64_data = base64.b64encode(file.read()).decode('utf-8')
payload = {
"file": base64_data,
"file_name": file_name,
"document_type_name": "line-item-invoice"
}
url = "https://developers.typless.com/api/extract-data"
headers = {
"Accept": "application/json",
"Content-Type": "application/json",
"Authorization": "<<apiKey>>"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.json())
Response:
{
"file_name": "name_of_your_document.pdf",
"object_id": "1cb25cc8-c9fa-4149-9a83-b4ed6a2173b9",
"extracted_fields": [
{
"name": "supplier",
"values": [
{
"x": -1,
"y": -1,
"width": -1,
"height": -1,
"value": "ScaleGrid",
"confidence_score": "0.968",
"page_number": -1
}
],
"data_type": "AUTHOR"
},
{
"name": "invoice_number",
"values": [
{
"x": 1989,
"y": 545,
"width": 323,
"height": 54,
"value": "20190500005890",
"confidence_score": "0.250",
"page_number": 0
},
{
"x": 167,
"y": 574,
"width": 391,
"height": 54,
"value": "GB123456789",
"confidence_score": "0.250",
"page_number": 0
}
],
"data_type": "STRING"
},
{
"name": "issue_date",
"values": [
{
"x": 2072,
"y": 628,
"width": 240,
"height": 54,
"value": "2019-06-05",
"confidence_score": "0.358",
"page_number": 0
}
],
"data_type": "DATE"
},
{
"name": "total_amount",
"values": [
{
"x": 2146,
"y": 1196,
"width": 126,
"height": 54,
"value": "47.5300",
"confidence_score": "0.990",
"page_number": 0
}
],
"data_type": "NUMBER"
}
],
"line_items": [
[
{
"name": "Description",
"values": [
{
"x": 208,
"y": 1196,
"width": 1022,
"height": 50,
"value": "5/2019-MongoBackend-MgmtStandalone-Small-744 hours",
"confidence_score": "0.661",
"page_number": 0
}
],
"data_type": "STRING"
},
{
"name": "Price",
"values": [
{
"x": 2146,
"y": 1196,
"width": 126,
"height": 54,
"value": "47.5300",
"confidence_score": "0.582",
"page_number": 0
}
],
"data_type": "NUMBER"
},
{
"name": "Quantity",
"values": [
{
"x": 1979,
"y": 1196,
"width": 23,
"height": 54,
"value": "1",
"confidence_score": "0.647",
"page_number": 0
}
],
"data_type": "NUMBER"
}
]
],
"customer": null
}
5. Continuously improve models
Typless embraces the fact that the world is changing all the time. That's why you can improve models on the fly by providing correct data after extraction. Let's say your company has a new partner, Best Supplier. You don't need to start over with building the dataset. You can simply extract and send the correct data after they are verified by your users. You can learn more about providing feedback on the building dataset page.
Add a supplier with feedback:
import requests
url = 'https://developers.typless.com/api/add-document-feedback';
payload = {
"learning_fields": [
{
"name": "supplier_name",
"value": "Amazing Company"
},
{
"name": "receiver_name",
"value": "Another Amazing Client"
},
{
"name": "invoice_number",
"value": "350"
},
{
"name": "purchase_order_number",
"value": "345677"
}
{
"name": "pay_due_date",
"value": "2021-02-28"
},
{
"name": "issue_date",
"value": "2021-01-01"
},
{
"name": "total_amount",
"value": "259.0000"
}
],
"line_items": [
[
{
"name": "product_number",
"value": ""
},
{
"name": "product_description",
"value": "Amazing service"
},
{
"name": "quantity",
"value": "1"
},
{
"name": "price",
"value": "259.0000"
}
]
],
"document_object_id": ID-FROM-EXTRACTION-RESPONSE
"document_type_name": "line-item-invoice"
}
headers = {
"Accept": "application/json",
"Content-Type": "application/json",
"Authorization": "<<apiKey>>"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.json())
Response:
{
"details":[
"0cb96695b4c677c1d6c5562d523aa9541cb5dda8"
],
"message":"Values added successfully."
}
📘 Closed workflow loop - improve models live!
Use every action from your users to adapt and improve Typless models without any extra costs.
Running Typless live
The only thing that you need to do to automate your manual data entry is to integrate those simple API calls into your system.
Last updated