Ingestion & Extraction - Artos Documentation

Overview

Document processing is asynchronous and follows this workflow:

Upload document to trigger ingestion job
Poll job status until completion
Access extracted content via API (covered in separate documentation)

Prerequisites

S3 Configuration

You must configure your own S3 bucket in the system configuration. Your ECS application requires the following S3 permissions: IAM Policy Example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::your-bucket-name"
    }
  ]
}

ECS Task Role Configuration: Attach the above policy to your ECS task role, or define permissions at the role level in your AWS configuration.

Upload Document

Endpoint

POST /ingest/document

Headers

Authorization: Bearer <access_token>
Content-Type: multipart/form-data

Request Body (Multipart Form Data)

Field	Type	Required	Description
`file`	File	Yes	Document file to upload and process
`fileType`	String	No	Override automatic file type detection
`extractionFormat`	String	No	Desired output format for extracted content
`csvDelimiter`	String	No	CSV delimiter preference (when extractionFormat=csv)
`htmlIncludeStyles`	Boolean	No	Include CSS styling in HTML extraction
`preserveFormatting`	Boolean	No	Maintain original document formatting
`extractImages`	Boolean	No	Extract and encode images separately
`extractTables`	Boolean	No	Extract tables as structured data
`connectorDataId`	String	No	Used to batch file uploads for connectors

Supported File Types

Microsoft Word: .docx, .doc, .docm, .dotx, .dotm ,.xlsx ,.pptx
Web Formats: .html,
Rich Text: .rtf
Data Files: .csv
Other: .pdf

File Limits

Maximum file size: 100MB
Processing time: 2 minutes to 1 hour depending on document complexity

Extraction Format Options

Text Formats

plain - Plain text extraction (default)
html - HTML with optional styling preservation
markdown - Markdown format conversion

Data Formats

csv - Comma-separated values (for tabular data)
json - Structured JSON format
xml - XML structured format

Binary Formats

base64 - Base64 encoded content for binary preservation

Format-Specific Parameters

CSV Extraction (extractionFormat=csv)

csvDelimiter: "," | ";" | "|" | "\t"
includeHeaders: true | false
quoteFields: true | false

HTML Extraction (extractionFormat=html)

htmlIncludeStyles: true | false
stripScripts: true | false (default: true)
preserveImages: true | false

JSON Extraction (extractionFormat=json)

includeMetadata: true | false
nestTables: true | false
separateImages: true | false

Example Request

curl -X POST "https://api.yourdomain.com/ingest/document" \
  -H "Authorization: Bearer your-access-token" \
  -F "[email protected]" \
  -F "fileType=docx" \
  -F "extractionFormat=json" \
  -F "extractImages=true" \
  -F "extractTables=true"

Success Response (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "pending",
  "created_at": "2024-01-15T10:30:00Z",
  "estimated_completion": "2024-01-15T10:35:00Z"
}

Poll Ingestion Job

Endpoint

POST /ingest/job/{jobId}

Path Parameters

Parameter	Type	Required	Description
`jobId`	String	Yes	Job ID returned from document upload

Headers

Authorization: Bearer <access_token>
Content-Type: application/json

Example Request

curl -X POST "https://api.yourdomain.com/ingest/job/ingest-abc123def456" \
  -H "Authorization: Bearer your-access-token" \
  -H "Content-Type: application/json"

Response Format

Job Pending/Processing (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "processing",
  "progress": 45,
  "current_stage": "extracting_tables",
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:32:30Z",
  "estimated_completion": "2024-01-15T10:35:00Z"
}

Job Completed (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "completed",
  "progress": 100,
  "created_at": "2024-01-15T10:30:00Z",
  "completed_at": "2024-01-15T10:34:12Z",
  "processing_time_seconds": 252,
  "file_info": {
    "original_filename": "document.docx",
    "file_size_bytes": 2048576,
    "s3_url": "s3://your-bucket/ingested/ingest-abc123def456/document.docx",
    "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
  },
  "extraction_results": {
    "total_pages": 15,
    "total_words": 3247,
    "total_images": 8,
    "total_tables": 3,
    "extraction_format": "json",
    "content_id": "content-abc123def456"
  }
}

Job Failed (200 OK)

{
  "jobId": "ingest-abc123def456",
  "status": "failed",
  "progress": 30,
  "created_at": "2024-01-15T10:30:00Z",
  "failed_at": "2024-01-15T10:32:45Z",
  "error": {
    "error_code": "INVALID_FILE_TYPE",
    "message": "File type not supported or corrupted",
    "details": "Unable to parse document structure"
  },
  "retry_count": 3,
  "max_retries": 3
}

Job Status Values

Status	Description
`pending`	Job queued for processing
`processing`	Document being processed
`completed`	Processing finished successfully
`failed`	Processing failed after all retries

Processing Stages

uploading_to_s3 - Uploading file to storage
detecting_file_type - Analyzing file format
extracting_text - Extracting textual content
extracting_images - Processing images and figures
extracting_tables - Processing tabular data
extracting_metadata - Gathering document metadata
formatting_output - Converting to requested format
storing_results - Saving extracted content to databases

Polling Best Practices

Recommended Strategy

async function pollJobStatus(jobId) {
  const maxAttempts = 50; // ~4 hours maximum
  let attempt = 0;
  let delay = 5 * 60 * 1000; // Start with 5 minutes
  const maxDelay = 15 * 60 * 1000; // Cap at 15 minutes
  
  while (attempt < maxAttempts) {
    try {
      const response = await fetch(`/ingest/job/${jobId}`, {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${accessToken}`,
          'Content-Type': 'application/json'
        }
      });
      
      const jobStatus = await response.json();
      
      if (jobStatus.status === 'completed') {
        return jobStatus;
      }
      
      if (jobStatus.status === 'failed') {
        throw new Error(`Job failed: ${jobStatus.error.message}`);
      }
      
      // Exponential backoff with jitter
      await new Promise(resolve => setTimeout(resolve, delay + Math.random() * 1000));
      delay = Math.min(delay * 1.2, maxDelay);
      attempt++;
      
    } catch (error) {
      console.error(`Polling attempt ${attempt} failed:`, error);
      attempt++;
      
      if (attempt >= maxAttempts) {
        throw new Error('Max polling attempts reached');
      }
      
      // Shorter delay for error retry
      await new Promise(resolve => setTimeout(resolve, 30000));
    }
  }
  
  throw new Error('Job did not complete within expected timeframe');
}

Python Example

import time
import requests
import random
from typing import Dict, Any

def poll_job_status(job_id: str, access_token: str, base_url: str) -> Dict[Any, Any]:
    max_attempts = 50
    attempt = 0
    delay = 300  # 5 minutes
    max_delay = 900  # 15 minutes
    
    while attempt < max_attempts:
        try:
            response = requests.post(
                f"{base_url}/ingest/job/{job_id}",
                headers={
                    "Authorization": f"Bearer {access_token}",
                    "Content-Type": "application/json"
                }
            )
            response.raise_for_status()
            job_status = response.json()
            
            if job_status["status"] == "completed":
                return job_status
            elif job_status["status"] == "failed":
                raise Exception(f"Job failed: {job_status['error']['message']}")
            
            # Exponential backoff with jitter
            sleep_time = delay + random.uniform(0, 1)
            time.sleep(sleep_time)
            delay = min(delay * 1.2, max_delay)
            attempt += 1
            
        except requests.RequestException as e:
            print(f"Polling attempt {attempt} failed: {e}")
            attempt += 1
            
            if attempt >= max_attempts:
                raise Exception("Max polling attempts reached")
            
            time.sleep(30)  # Shorter delay for error retry
    
    raise Exception("Job did not complete within expected timeframe")

Error Handling

Common Error Codes

Error Code	Description	Resolution
`INVALID_FILE_TYPE`	File type not supported or corrupted	Check file format and integrity
`FILE_TOO_LARGE`	File exceeds 100MB limit	Reduce file size or split document
`UPLOAD_FAILED`	S3 upload unsuccessful	Check S3 permissions and connectivity
`EXTRACTION_FAILED`	Content extraction error	Verify document is not password protected
`TIMEOUT_EXCEEDED`	Processing took longer than 1 hour	Try splitting large documents
`QUOTA_EXCEEDED`	Processing quota reached	Wait for quota reset or upgrade plan
`INVALID_PARAMETERS`	Invalid extraction parameters	Check parameter format and values

Error Response Format

Client Errors (4xx)

{
  "error_code": "INVALID_FILE_TYPE",
  "message": "The uploaded file type is not supported",
  "details": {
    "supported_types": [".docx", ".doc", ".html", ".rtf", ".csv"],
    "detected_type": ".pdf"
  }
}

Server Errors (5xx)

{
  "error_code": "INTERNAL_SERVER_ERROR",
  "message": "An unexpected error occurred during processing",
  "request_id": "req-abc123def456",
  "retry_after": 300
}

Retry Strategy

The system automatically retries failed jobs up to 3 times with exponential backoff. Users will see all intermediate failure states through the polling endpoint. Retry Schedule:

First retry: After 1 minute
Second retry: After 5 minutes
Third retry: After 15 minutes
Final failure: Returned to user

Data Retention

File Storage

Hot Storage: 8 days in primary S3 storage
Total Retention: 90 days (moved to cheaper storage after 8 days)
Access: Users can access files directly from S3 during retention period
Deletion: No notification provided before automatic deletion

Extracted Content

Extracted content is stored in the system databases and available via API (covered in separate documentation). This content has different retention policies than the original files.

Integration Example

Complete Workflow

async function ingestDocument(file, options = {}) {
  // 1. Upload document
  const formData = new FormData();
  formData.append('file', file);
  
  if (options.fileType) formData.append('fileType', options.fileType);
  if (options.extractionFormat) formData.append('extractionFormat', options.extractionFormat);
  if (options.extractImages) formData.append('extractImages', options.extractImages);
  if (options.extractTables) formData.append('extractTables', options.extractTables);
  
  const uploadResponse = await fetch('/ingest/document', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${accessToken}`
    },
    body: formData
  });
  
  const { jobId } = await uploadResponse.json();
  console.log(`Document upload started. Job ID: ${jobId}`);
  
  // 2. Poll for completion
  const completedJob = await pollJobStatus(jobId);
  console.log('Document processing completed:', completedJob);
  
  // 3. Extract content (covered in separate API documentation)
  const contentId = completedJob.extraction_results.content_id;
  return contentId;
}

// Usage
const contentId = await ingestDocument(file, {
  extractionFormat: 'json',
  extractImages: true,
  extractTables: true
});

Troubleshooting

Common Issues

Problem: Job stays in “pending” status for extended time Solution: High processing queue; implement proper polling with exponential backoff Problem: “UPLOAD_FAILED” error Solution: Verify S3 bucket permissions and network connectivity Problem: “EXTRACTION_FAILED” for Word documents Solution: Check if document is password-protected or corrupted Problem: Processing takes longer than expected Solution: Large or complex documents may take up to 1 hour; ensure polling timeout is sufficient Problem: Cannot access S3 files Solution: Verify IAM permissions include both object-level and bucket-level access

API Documentation

Concepts

SDK Reference

Integrations

Deployment

​Overview

​Prerequisites

​S3 Configuration

​Upload Document

​Endpoint

​Headers

​Request Body (Multipart Form Data)

​Supported File Types

​File Limits

​Extraction Format Options

​Text Formats

​Data Formats

​Binary Formats

​Format-Specific Parameters

​Example Request

​Success Response (200 OK)

​Poll Ingestion Job

​Endpoint

​Path Parameters

​Headers

​Example Request

​Response Format

​Job Pending/Processing (200 OK)

​Job Completed (200 OK)

​Job Failed (200 OK)

​Job Status Values

​Processing Stages

​Polling Best Practices

​Recommended Strategy

​Python Example

​Error Handling

​Common Error Codes

​Error Response Format

​Client Errors (4xx)

​Server Errors (5xx)

​Retry Strategy

​Data Retention

​File Storage

​Extracted Content

​Integration Example

​Complete Workflow

​Troubleshooting

​Common Issues

Overview

Prerequisites

S3 Configuration

Upload Document

Endpoint

Headers

Request Body (Multipart Form Data)

Supported File Types

File Limits

Extraction Format Options

Text Formats

Data Formats

Binary Formats

Format-Specific Parameters

Example Request

Success Response (200 OK)

Poll Ingestion Job

Endpoint

Path Parameters

Headers

Example Request

Response Format

Job Pending/Processing (200 OK)

Job Completed (200 OK)

Job Failed (200 OK)

Job Status Values

Processing Stages

Polling Best Practices

Recommended Strategy

Python Example

Error Handling

Common Error Codes

Error Response Format

Client Errors (4xx)

Server Errors (5xx)

Retry Strategy

Data Retention

File Storage

Extracted Content

Integration Example

Complete Workflow

Troubleshooting

Common Issues