Overview
Document processing is asynchronous and follows this workflow:- Upload document to trigger ingestion job
- Poll job status until completion
- Access extracted content via API (covered in separate documentation)
Prerequisites
S3 Configuration
You must configure your own S3 bucket in the system configuration. Your ECS application requires the following S3 permissions: IAM Policy Example:Upload Document
Endpoint
Headers
Request Body (Multipart Form Data)
| Field | Type | Required | Description |
|---|---|---|---|
file | File | Yes | Document file to upload and process |
fileType | String | No | Override automatic file type detection |
extractionFormat | String | No | Desired output format for extracted content |
csvDelimiter | String | No | CSV delimiter preference (when extractionFormat=csv) |
htmlIncludeStyles | Boolean | No | Include CSS styling in HTML extraction |
preserveFormatting | Boolean | No | Maintain original document formatting |
extractImages | Boolean | No | Extract and encode images separately |
extractTables | Boolean | No | Extract tables as structured data |
connectorDataId | String | No | Used to batch file uploads for connectors |
Supported File Types
- Microsoft Word:
.docx,.doc,.docm,.dotx,.dotm,.xlsx,.pptx - Web Formats:
.html, - Rich Text:
.rtf - Data Files:
.csv - Other:
.pdf
File Limits
- Maximum file size: 100MB
- Processing time: 2 minutes to 1 hour depending on document complexity
Extraction Format Options
Text Formats
plain- Plain text extraction (default)html- HTML with optional styling preservationmarkdown- Markdown format conversion
Data Formats
csv- Comma-separated values (for tabular data)json- Structured JSON formatxml- XML structured format
Binary Formats
base64- Base64 encoded content for binary preservation
Format-Specific Parameters
CSV Extraction (extractionFormat=csv)
extractionFormat=html)
extractionFormat=json)
Example Request
Success Response (200 OK)
Poll Ingestion Job
Endpoint
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
jobId | String | Yes | Job ID returned from document upload |
Headers
Example Request
Response Format
Job Pending/Processing (200 OK)
Job Completed (200 OK)
Job Failed (200 OK)
Job Status Values
| Status | Description |
|---|---|
pending | Job queued for processing |
processing | Document being processed |
completed | Processing finished successfully |
failed | Processing failed after all retries |
Processing Stages
uploading_to_s3- Uploading file to storagedetecting_file_type- Analyzing file formatextracting_text- Extracting textual contentextracting_images- Processing images and figuresextracting_tables- Processing tabular dataextracting_metadata- Gathering document metadataformatting_output- Converting to requested formatstoring_results- Saving extracted content to databases
Polling Best Practices
Recommended Strategy
Python Example
Error Handling
Common Error Codes
| Error Code | Description | Resolution |
|---|---|---|
INVALID_FILE_TYPE | File type not supported or corrupted | Check file format and integrity |
FILE_TOO_LARGE | File exceeds 100MB limit | Reduce file size or split document |
UPLOAD_FAILED | S3 upload unsuccessful | Check S3 permissions and connectivity |
EXTRACTION_FAILED | Content extraction error | Verify document is not password protected |
TIMEOUT_EXCEEDED | Processing took longer than 1 hour | Try splitting large documents |
QUOTA_EXCEEDED | Processing quota reached | Wait for quota reset or upgrade plan |
INVALID_PARAMETERS | Invalid extraction parameters | Check parameter format and values |
Error Response Format
Client Errors (4xx)
Server Errors (5xx)
Retry Strategy
The system automatically retries failed jobs up to 3 times with exponential backoff. Users will see all intermediate failure states through the polling endpoint. Retry Schedule:- First retry: After 1 minute
- Second retry: After 5 minutes
- Third retry: After 15 minutes
- Final failure: Returned to user
Data Retention
File Storage
- Hot Storage: 8 days in primary S3 storage
- Total Retention: 90 days (moved to cheaper storage after 8 days)
- Access: Users can access files directly from S3 during retention period
- Deletion: No notification provided before automatic deletion