Skip to main content

Overview

Document processing is asynchronous and follows this workflow:
  1. Upload document to trigger ingestion job
  2. Poll job status until completion
  3. Access extracted content via API (covered in separate documentation)

Prerequisites

S3 Configuration

You must configure your own S3 bucket in the system configuration. Your ECS application requires the following S3 permissions: IAM Policy Example:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::your-bucket-name"
    }
  ]
}
ECS Task Role Configuration: Attach the above policy to your ECS task role, or define permissions at the role level in your AWS configuration.

Upload Document

Endpoint

POST /ingest/document

Headers

Authorization: Bearer <access_token>
Content-Type: multipart/form-data

Request Body (Multipart Form Data)

FieldTypeRequiredDescription
fileFileYesDocument file to upload and process
fileTypeStringNoOverride automatic file type detection
extractionFormatStringNoDesired output format for extracted content
csvDelimiterStringNoCSV delimiter preference (when extractionFormat=csv)
htmlIncludeStylesBooleanNoInclude CSS styling in HTML extraction
preserveFormattingBooleanNoMaintain original document formatting
extractImagesBooleanNoExtract and encode images separately
extractTablesBooleanNoExtract tables as structured data
connectorDataIdStringNoUsed to batch file uploads for connectors

Supported File Types

  • Microsoft Word: .docx, .doc, .docm, .dotx, .dotm ,.xlsx ,.pptx
  • Web Formats: .html,
  • Rich Text: .rtf
  • Data Files: .csv
  • Other: .pdf

File Limits

  • Maximum file size: 100MB
  • Processing time: 2 minutes to 1 hour depending on document complexity

Extraction Format Options

Text Formats

  • plain - Plain text extraction (default)
  • html - HTML with optional styling preservation
  • markdown - Markdown format conversion

Data Formats

  • csv - Comma-separated values (for tabular data)
  • json - Structured JSON format
  • xml - XML structured format

Binary Formats

  • base64 - Base64 encoded content for binary preservation

Format-Specific Parameters

CSV Extraction (extractionFormat=csv)
csvDelimiter: "," | ";" | "|" | "\t"
includeHeaders: true | false
quoteFields: true | false
HTML Extraction (extractionFormat=html)
htmlIncludeStyles: true | false
stripScripts: true | false (default: true)
preserveImages: true | false
JSON Extraction (extractionFormat=json)
includeMetadata: true | false
nestTables: true | false
separateImages: true | false

Example Request

curl -X POST "https://api.yourdomain.com/ingest/document" \
  -H "Authorization: Bearer your-access-token" \
  -F "[email protected]" \
  -F "fileType=docx" \
  -F "extractionFormat=json" \
  -F "extractImages=true" \
  -F "extractTables=true"

Success Response (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "pending",
  "created_at": "2024-01-15T10:30:00Z",
  "estimated_completion": "2024-01-15T10:35:00Z"
}

Poll Ingestion Job

Endpoint

POST /ingest/job/{jobId}

Path Parameters

ParameterTypeRequiredDescription
jobIdStringYesJob ID returned from document upload

Headers

Authorization: Bearer <access_token>
Content-Type: application/json

Example Request

curl -X POST "https://api.yourdomain.com/ingest/job/ingest-abc123def456" \
  -H "Authorization: Bearer your-access-token" \
  -H "Content-Type: application/json"

Response Format

Job Pending/Processing (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "processing",
  "progress": 45,
  "current_stage": "extracting_tables",
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:32:30Z",
  "estimated_completion": "2024-01-15T10:35:00Z"
}

Job Completed (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "completed",
  "progress": 100,
  "created_at": "2024-01-15T10:30:00Z",
  "completed_at": "2024-01-15T10:34:12Z",
  "processing_time_seconds": 252,
  "file_info": {
    "original_filename": "document.docx",
    "file_size_bytes": 2048576,
    "s3_url": "s3://your-bucket/ingested/ingest-abc123def456/document.docx",
    "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
  },
  "extraction_results": {
    "total_pages": 15,
    "total_words": 3247,
    "total_images": 8,
    "total_tables": 3,
    "extraction_format": "json",
    "content_id": "content-abc123def456"
  }
}

Job Failed (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "failed",
  "progress": 30,
  "created_at": "2024-01-15T10:30:00Z",
  "failed_at": "2024-01-15T10:32:45Z",
  "error": {
    "error_code": "INVALID_FILE_TYPE",
    "message": "File type not supported or corrupted",
    "details": "Unable to parse document structure"
  },
  "retry_count": 3,
  "max_retries": 3
}

Job Status Values

StatusDescription
pendingJob queued for processing
processingDocument being processed
completedProcessing finished successfully
failedProcessing failed after all retries

Processing Stages

  • uploading_to_s3 - Uploading file to storage
  • detecting_file_type - Analyzing file format
  • extracting_text - Extracting textual content
  • extracting_images - Processing images and figures
  • extracting_tables - Processing tabular data
  • extracting_metadata - Gathering document metadata
  • formatting_output - Converting to requested format
  • storing_results - Saving extracted content to databases

Polling Best Practices

async function pollJobStatus(jobId) {
  const maxAttempts = 50; // ~4 hours maximum
  let attempt = 0;
  let delay = 5 * 60 * 1000; // Start with 5 minutes
  const maxDelay = 15 * 60 * 1000; // Cap at 15 minutes
  
  while (attempt < maxAttempts) {
    try {
      const response = await fetch(`/ingest/job/${jobId}`, {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${accessToken}`,
          'Content-Type': 'application/json'
        }
      });
      
      const jobStatus = await response.json();
      
      if (jobStatus.status === 'completed') {
        return jobStatus;
      }
      
      if (jobStatus.status === 'failed') {
        throw new Error(`Job failed: ${jobStatus.error.message}`);
      }
      
      // Exponential backoff with jitter
      await new Promise(resolve => setTimeout(resolve, delay + Math.random() * 1000));
      delay = Math.min(delay * 1.2, maxDelay);
      attempt++;
      
    } catch (error) {
      console.error(`Polling attempt ${attempt} failed:`, error);
      attempt++;
      
      if (attempt >= maxAttempts) {
        throw new Error('Max polling attempts reached');
      }
      
      // Shorter delay for error retry
      await new Promise(resolve => setTimeout(resolve, 30000));
    }
  }
  
  throw new Error('Job did not complete within expected timeframe');
}

Python Example

import time
import requests
import random
from typing import Dict, Any

def poll_job_status(job_id: str, access_token: str, base_url: str) -> Dict[Any, Any]:
    max_attempts = 50
    attempt = 0
    delay = 300  # 5 minutes
    max_delay = 900  # 15 minutes
    
    while attempt < max_attempts:
        try:
            response = requests.post(
                f"{base_url}/ingest/job/{job_id}",
                headers={
                    "Authorization": f"Bearer {access_token}",
                    "Content-Type": "application/json"
                }
            )
            response.raise_for_status()
            job_status = response.json()
            
            if job_status["status"] == "completed":
                return job_status
            elif job_status["status"] == "failed":
                raise Exception(f"Job failed: {job_status['error']['message']}")
            
            # Exponential backoff with jitter
            sleep_time = delay + random.uniform(0, 1)
            time.sleep(sleep_time)
            delay = min(delay * 1.2, max_delay)
            attempt += 1
            
        except requests.RequestException as e:
            print(f"Polling attempt {attempt} failed: {e}")
            attempt += 1
            
            if attempt >= max_attempts:
                raise Exception("Max polling attempts reached")
            
            time.sleep(30)  # Shorter delay for error retry
    
    raise Exception("Job did not complete within expected timeframe")

Error Handling

Common Error Codes

Error CodeDescriptionResolution
INVALID_FILE_TYPEFile type not supported or corruptedCheck file format and integrity
FILE_TOO_LARGEFile exceeds 100MB limitReduce file size or split document
UPLOAD_FAILEDS3 upload unsuccessfulCheck S3 permissions and connectivity
EXTRACTION_FAILEDContent extraction errorVerify document is not password protected
TIMEOUT_EXCEEDEDProcessing took longer than 1 hourTry splitting large documents
QUOTA_EXCEEDEDProcessing quota reachedWait for quota reset or upgrade plan
INVALID_PARAMETERSInvalid extraction parametersCheck parameter format and values

Error Response Format

Client Errors (4xx)

{
  "error_code": "INVALID_FILE_TYPE",
  "message": "The uploaded file type is not supported",
  "details": {
    "supported_types": [".docx", ".doc", ".html", ".rtf", ".csv"],
    "detected_type": ".pdf"
  }
}

Server Errors (5xx)

{
  "error_code": "INTERNAL_SERVER_ERROR",
  "message": "An unexpected error occurred during processing",
  "request_id": "req-abc123def456",
  "retry_after": 300
}

Retry Strategy

The system automatically retries failed jobs up to 3 times with exponential backoff. Users will see all intermediate failure states through the polling endpoint. Retry Schedule:
  • First retry: After 1 minute
  • Second retry: After 5 minutes
  • Third retry: After 15 minutes
  • Final failure: Returned to user

Data Retention

File Storage

  • Hot Storage: 8 days in primary S3 storage
  • Total Retention: 90 days (moved to cheaper storage after 8 days)
  • Access: Users can access files directly from S3 during retention period
  • Deletion: No notification provided before automatic deletion

Extracted Content

Extracted content is stored in the system databases and available via API (covered in separate documentation). This content has different retention policies than the original files.

Integration Example

Complete Workflow

async function ingestDocument(file, options = {}) {
  // 1. Upload document
  const formData = new FormData();
  formData.append('file', file);
  
  if (options.fileType) formData.append('fileType', options.fileType);
  if (options.extractionFormat) formData.append('extractionFormat', options.extractionFormat);
  if (options.extractImages) formData.append('extractImages', options.extractImages);
  if (options.extractTables) formData.append('extractTables', options.extractTables);
  
  const uploadResponse = await fetch('/ingest/document', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${accessToken}`
    },
    body: formData
  });
  
  const { jobId } = await uploadResponse.json();
  console.log(`Document upload started. Job ID: ${jobId}`);
  
  // 2. Poll for completion
  const completedJob = await pollJobStatus(jobId);
  console.log('Document processing completed:', completedJob);
  
  // 3. Extract content (covered in separate API documentation)
  const contentId = completedJob.extraction_results.content_id;
  return contentId;
}

// Usage
const contentId = await ingestDocument(file, {
  extractionFormat: 'json',
  extractImages: true,
  extractTables: true
});

Troubleshooting

Common Issues

Problem: Job stays in “pending” status for extended time Solution: High processing queue; implement proper polling with exponential backoff Problem: “UPLOAD_FAILED” error Solution: Verify S3 bucket permissions and network connectivity Problem: “EXTRACTION_FAILED” for Word documents Solution: Check if document is password-protected or corrupted Problem: Processing takes longer than expected Solution: Large or complex documents may take up to 1 hour; ensure polling timeout is sufficient Problem: Cannot access S3 files Solution: Verify IAM permissions include both object-level and bucket-level access