Building a dataset
To build Typless models for data extraction, you need to build a dataset of documents for the document type.
Using your existing data
Use all data from documents that have already been manually processed and stored in the database to build a dataset for your document type.
Upload the original file with correct values from your database to train Typless before production.
Use the code to start:
import requests
import base64
file_name = 'name_of_your_document.pdf'
with open(file_name, 'rb') as file:
base64_data = base64.b64encode(file.read()).decode('utf-8')
payload = {
"learning_fields": [
{
"name": "supplier_name",
"value": "Amazing Company"
},
{
"name": "receiver_name",
"value": "Amazing Client"
},
{
"name": "invoice_number",
"value": "3"
},
{
"name": "purchase_order_number",
"value": "234778"
},
{
"name": "pay_due_date",
"value": "2021-03-31"
},
{
"name": "issue_date",
"value": "2021-02-01"
},
{
"name": "total_amount",
"value": "15.0000"
}
],
"line_items": [
[
{
"name": "product_number",
"value": ""
},
{
"name": "product_description",
"value": "Amazing service"
},
{
"name": "quantity",
"value": "1"
},
{
"name": "price",
"value": "15.0000"
}
]
],
"file": base64_data,
"file_name": file_name,
"document_type_name": "line-item-invoice"
}
url = "https://developers.typless.com/api/add-document"
headers = {
"Accept": "application/json",
"Content-Type": "application/json",
"Authorization": "<<apiKey>>"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.json())
Response:
{
"details":[
"0cb9660762f20e13850d36cd45b48d44b63059f7"
],
"message":"Document added successfully."
}
Using live data
Typless continuously improves with a closed feedback loop where you provide correct values for the extracted document. Check out the example below.
import requests
url = 'https://developers.typless.com/api/add-document-feedback';
payload = {
"learning_fields": [
{
"name": "supplier_name",
"value": "Amazing Company"
},
{
"name": "receiver_name",
"value": "Another Amazing Client"
},
{
"name": "invoice_number",
"value": "350"
},
{
"name": "purchase_order_number",
"value": "345677"
}
{
"name": "pay_due_date",
"value": "2021-02-28"
},
{
"name": "issue_date",
"value": "2021-01-01"
},
{
"name": "total_amount",
"value": "259.0000"
}
],
"line_items": [
[
{
"name": "product_number",
"value": ""
},
{
"name": "product_description",
"value": "Amazing service"
},
{
"name": "quantity",
"value": "1"
},
{
"name": "price",
"value": "259.0000"
}
]
],
"document_object_id": ID-FROM-EXTRACTION-RESPONSE
"document_type_name": "line-item-invoice"
}
headers = {
"Accept": "application/json",
"Content-Type": "application/json",
"Authorization": "<<apiKey>>"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.json())
Response:
{
"details":[
"0cb96695b4c677c1d6c5562d523aa9541cb5dda8"
],
"message":"Values added successfully."
}
Using a training room
For smaller volumes of documents and testing purposes, you can use training room. In the training room, you can train documents for your document type and perform test extractions to quickly see results. Each document type has its own training room. Data you confirm here as correct solutions will be used to train your document type.
Last updated