🚀 Our new and improved config V3 is now live! See API reference for details.
import requests
url = "https://platform.reducto.ai/parse"
payload = {
"input": "<string>",
"enhance": {
"agentic": [],
"summarize_figures": True
},
"retrieval": {
"chunking": { "chunk_mode": "disabled" },
"filter_blocks": [],
"embedding_optimized": False
},
"formatting": {
"add_page_markers": False,
"table_output_format": "dynamic",
"merge_tables": False,
"include": []
},
"spreadsheet": {
"split_large_tables": {
"enabled": True,
"size": 50
},
"include": [],
"clustering": "accurate",
"exclude": []
},
"settings": {
"ocr_system": "standard",
"force_url_result": False,
"return_ocr_data": False,
"return_images": [],
"embed_pdf_metadata": False,
"persist_results": False
}
}
headers = {
"Authorization": "Bearer <token>",
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json()){
"job_id": "<string>",
"duration": 123,
"usage": {
"num_pages": 123,
"credits": 123
},
"result": {
"type": "<string>",
"chunks": [
{
"content": "<string>",
"embed": "<string>",
"enriched": "<string>",
"blocks": [
{
"type": "Header",
"bbox": {
"left": 123,
"top": 123,
"width": 123,
"height": 123,
"page": 123,
"original_page": 123
},
"content": "<string>",
"image_url": "<string>",
"confidence": "low",
"granular_confidence": {
"extract_confidence": 123,
"parse_confidence": 123
}
}
],
"enrichment_success": false
}
],
"ocr": {
"words": [
{
"text": "<string>",
"bbox": {
"left": 123,
"top": 123,
"width": 123,
"height": 123,
"page": 123,
"original_page": 123
},
"confidence": 123,
"chunk_index": 123
}
],
"lines": [
{
"text": "<string>",
"bbox": {
"left": 123,
"top": 123,
"width": 123,
"height": 123,
"page": 123,
"original_page": 123
},
"confidence": 123,
"chunk_index": 123
}
]
},
"custom": "<unknown>"
},
"pdf_url": "<string>",
"studio_link": "<string>"
}import requests
url = "https://platform.reducto.ai/parse"
payload = {
"input": "<string>",
"enhance": {
"agentic": [],
"summarize_figures": True
},
"retrieval": {
"chunking": { "chunk_mode": "disabled" },
"filter_blocks": [],
"embedding_optimized": False
},
"formatting": {
"add_page_markers": False,
"table_output_format": "dynamic",
"merge_tables": False,
"include": []
},
"spreadsheet": {
"split_large_tables": {
"enabled": True,
"size": 50
},
"include": [],
"clustering": "accurate",
"exclude": []
},
"settings": {
"ocr_system": "standard",
"force_url_result": False,
"return_ocr_data": False,
"return_images": [],
"embed_pdf_metadata": False,
"persist_results": False
}
}
headers = {
"Authorization": "Bearer <token>",
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json()){
"job_id": "<string>",
"duration": 123,
"usage": {
"num_pages": 123,
"credits": 123
},
"result": {
"type": "<string>",
"chunks": [
{
"content": "<string>",
"embed": "<string>",
"enriched": "<string>",
"blocks": [
{
"type": "Header",
"bbox": {
"left": 123,
"top": 123,
"width": 123,
"height": 123,
"page": 123,
"original_page": 123
},
"content": "<string>",
"image_url": "<string>",
"confidence": "low",
"granular_confidence": {
"extract_confidence": 123,
"parse_confidence": 123
}
}
],
"enrichment_success": false
}
],
"ocr": {
"words": [
{
"text": "<string>",
"bbox": {
"left": 123,
"top": 123,
"width": 123,
"height": 123,
"page": 123,
"original_page": 123
},
"confidence": 123,
"chunk_index": 123
}
],
"lines": [
{
"text": "<string>",
"bbox": {
"left": 123,
"top": 123,
"width": 123,
"height": 123,
"page": 123,
"original_page": 123
},
"confidence": 123,
"chunk_index": 123
}
]
},
"custom": "<unknown>"
},
"pdf_url": "<string>",
"studio_link": "<string>"
}Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
For parse/split/extract pipelines, the URL of the document to be processed. You can provide one of the following: 1. A publicly available URL 2. A presigned S3 URL 3. A reducto:// prefixed URL obtained from the /upload endpoint after directly uploading a document 4. A jobid:// prefixed URL obtained from a previous /parse invocation
For edit pipelines, this should be a string containing the edit instructionsShow child attributes
Agentic uses vision language models to enhance the accuracy of the output of different types of extraction. This will incur a cost and latency increase.
If True, summarize figures using a small vision language model. Defaults to True.
Show child attributes
Show child attributes
Choose how to partition chunks. Variable mode chunks by character length and visual context. Section mode chunks by section headers. Page mode chunks according to pages. Page sections mode chunks first by page, then by sections within each page. Disabled returns one single chunk.
variable, section, page, disabled, block, page_sections The approximate size of chunks (in characters) that the document will be split into. Defaults to null, in which case the chunk size is variable between 250 - 1500 characters.
A list of block types to filter out from 'content' and 'embed' fields. By default, no blocks are filtered.
Header, Footer, Title, Section Header, Page Number, List Item, Figure, Table, Key Value, Text, Comment, Signature If True, use embedding optimized mode. Defaults to False.
Show child attributes
If True, add page markers to the output. Defaults to False. Useful for extracting data with page specific information.
The mode to use for table output. Defaults to dynamic, which returns md for simpler tables and html for more complex tables.
html, json, md, jsonbbox, dynamic, csv A flag to indicate if consecutive tables with the same number of columns should be merged. Defaults to False.
A list of formatting to include in the output. [insert description of each option here later]
change_tracking, highlight, comments, hyperlinks Show child attributes
Whether to include cell color and formula information in the output.
cell_colors, formula In a spreadsheet with different tables inside, we enable splitting up the tables by default. Accurate mode applies more powerful models for superior accuracy, at 5× the default per-cell rate. Disabling will register as one large table.
accurate, fast, disabled Whether to exclude hidden sheets, rows, or columns in the output.
hidden_sheets, hidden_rows, hidden_cols Show child attributes
Standard is our best multilingual OCR system. Legacy only supports germanic languages and is available for backwards compatibility.
standard, legacy Force the result to be returned in URL form.
Force the URL to be downloaded as a specific file extension (e.g. .png).
If True, return OCR data in the result. Defaults to False.
Whether to return images for the specified block types. By default, no images are returned.
figure, table If True, embed OCR metadata into the returned PDF. Defaults to False.
If True, persist the results indefinitely. Defaults to False.
The timeout for the job in seconds.
Password to decrypt password-protected documents.
Successful Response
The duration of the parse request in seconds.
The response from the document processing service. Note that there can be two types of responses, Full Result and URL Result. This is due to limitations on the max return size on HTTPS. If the response is too large, it will be returned as a presigned URL in the URL response. You should handle this in your application.
Show child attributes
type = 'full'
"full"Show child attributes
The content of the chunk extracted from the document.
Chunk content optimized for embedding and retrieval.
The enriched content of the chunk extracted from the document.
Show child attributes
The type of block extracted from the document.
Header, Footer, Title, Section Header, Page Number, List Item, Figure, Table, Key Value, Text, Comment, Signature The bounding box of the block extracted from the document.
Show child attributes
The page number of the bounding box (1-indexed).
The page number in the original document of the bounding box (1-indexed).
The content of the block extracted from the document.
(Experimental) The URL of the image associated with the block.
The confidence for the block. It is either low or high and takes into account factors like OCR and table structure
Granular confidence scores for the block. It is a dictionary of confidence scores for the block. The confidence scores will not be None if the user has enabled numeric confidence scores.
Whether the enrichment was successful.
Show child attributes
Show child attributes
Show child attributes
The page number of the bounding box (1-indexed).
The page number in the original document of the bounding box (1-indexed).
OCR confidence score between 0 and 1, where 1 indicates highest confidence
The index of the chunk that the word belongs to.
Show child attributes
Show child attributes
The page number of the bounding box (1-indexed).
The page number in the original document of the bounding box (1-indexed).
OCR confidence score between 0 and 1, where 1 indicates highest confidence
The index of the chunk that the line belongs to.
The storage URL of the converted PDF file.
The link to the studio pipeline for the document.
Was this page helpful?