Document Generation

Document generation in Artos is an automated process that transforms source documents into professionally formatted regulatory documents using MRT templates.

Overview

The document generation process:

Extract - Extract content from source documents
Classify - Classify extracted content by document type
Ingest - Organize classified content for processing
Outline - Create document outline from MRT template
Orchestrate - Apply extraction rules and processing
Generate - Produce final formatted document

The Generation Pipeline

Step 1: Request Generation

Submit a generation request with source documents and template:

POST /api/v1/documents/generate
{
  "document_type": "CSR",
  "file_paths": ["s3://bucket/protocol.pdf", "s3://bucket/data.xlsx"],
  "document_set_key": "project-2024",
  "document_set_name": "Q1 CSR Documents",
  "generic_mrt_id": "csr-template-uuid",
  "output_name": "CSR_Final.docx"
}

Returns: 202 Accepted with task ID

{
  "message": "Document generation started",
  "task_id": "celery-task-abc-123"
}

Step 2: Extract Content

The system extracts content from source documents:

Source Documents:
├─ protocol.pdf (document structure, sections, text)
├─ safety_report.docx (safety data, tables, findings)
└─ efficacy_data.xlsx (results, analysis, statistics)

↓ Extraction Engine

Extracted Content:
├─ Document Type: CSR
├─ Identified Sections:
│  ├─ Introduction/Background
│  ├─ Methodology/Study Design
│  ├─ Results
│  ├─ Safety Analysis
│  ├─ Efficacy Analysis
│  └─ Conclusions
├─ Structured Data:
│  ├─ Tables (efficacy, safety)
│  ├─ Key Statistics
│  └─ Findings
└─ Metadata:
   ├─ Source Document Mapping
   ├─ Confidence Scores
   └─ Processing Metadata

Step 3: Classify Content

Content is automatically classified based on:

Document structure
Section headers
Content type indicators
Extraction rules

Extracted Content
↓ Classification Rules
├─ Safety Data: 95% confidence
├─ Efficacy Data: 92% confidence
├─ Methodology: 89% confidence
└─ Background: 96% confidence

Step 4: Create Outline

The outline is generated from the MRT template:

{
  "outline_id": "outline-uuid",
  "template_id": "csr-template-uuid",
  "sections": [
    {
      "section_id": "section-1",
      "order_index": 0,
      "level": 1,
      "title": "Executive Summary",
      "status": "pending"
    },
    {
      "section_id": "section-2",
      "order_index": 1,
      "level": 1,
      "title": "Methodology",
      "status": "pending"
    }
  ]
}

Step 5: Apply Extraction Rules

For each section, extraction rules are applied to find and process content:

Template Section: "Safety Analysis"
├─ Extraction Rule 1: Find adverse events table
│  └─ Source: safety_report.docx, Table 3
│  └─ Result: ✓ Found and extracted
├─ Extraction Rule 2: Extract safety conclusions
│  └─ Source: protocol.pdf, Section 5.2
│  └─ Result: ✓ Found and extracted
└─ Extraction Rule 3: Summarize safety profile
   └─ Source: Multiple sources
   └─ Result: ✓ Generated summary

Step 6: Populate Outline

Extracted content is populated into the outline:

{
  "section_id": "section-4",
  "title": "Safety Analysis",
  "extracted_content": {
    "adverse_events_table": [...],
    "safety_summary": "...",
    "confidence_score": 0.94
  },
  "sources": [
    "safety_report.docx",
    "protocol.pdf"
  ]
}

Step 7: Generate Document

The outline is converted to a formatted DOCX document:

Outline
├─ Formatting Rules Applied
├─ Styles Applied (fonts, colors, spacing)
├─ Headers and Footers Added
├─ Table of Contents Generated
├─ References Formatted
└─ Final DOCX File Created

Step 8: Return Result

Document is saved and retrieval information returned:

{
  "message": "Document generation complete",
  "document_id": "doc-xyz-789",
  "status": "Complete",
  "output_location": "s3://bucket/CSR_Final.docx",
  "file_name": "CSR_Final.docx"
}

Status Tracking

Monitor generation progress using the status endpoint:

GET /api/v1/documents/status/{task_id}

Status values:

Generating - Currently processing
Complete - Successfully finished
Failed - Error occurred

# Immediately after request
GET /api/v1/documents/status/celery-task-abc-123
→ status: "Generating"

# After processing
GET /api/v1/documents/status/celery-task-abc-123
→ status: "Complete"

Retrieval

Once complete, retrieve the document:

# Get document details
GET /api/v1/documents/{document_id}

# Response includes metadata, sections, and access information
{
  "document": {
    "document_id": "doc-xyz-789",
    "file_name": "CSR_Final.docx",
    "status": "Complete",
    "sections": [...],
    "s3_location": "s3://bucket/CSR_Final.docx"
  }
}

# Download via presigned URL
GET /api/v1/files/documents
→ Returns presigned S3 URLs for direct download

Generation Configuration

Document Selection

Optionally specify which sections to include:

{
  "generic_mrt_id": "template-uuid",
  "selected_section_ids": ["section-1", "section-3", "section-5"]
}

This generates only the specified sections, omitting others.

Document Instructions

Provide document-level instructions:

{
  "document_instructions": "Use company style guide. Emphasize safety data. Exclude preliminary findings."
}

Instructions are passed to extraction rules and formatting engine.

Style Guides

Apply a specific style guide:

{
  "style_guide_id": "company-style-2024"
}

Style guides control:

Font and font sizes
Colors and formatting
Section numbering style
Citation format
Table formatting

Quality Assurance

Confidence Scores

Each extraction includes a confidence score (0-1):

Safety Data Extraction:
├─ Adverse Events: 0.96 (high confidence)
├─ Lab Findings: 0.89 (good confidence)
├─ Conclusions: 0.92 (good confidence)
└─ Overall Section: 0.92

High confidence (>0.9) indicates reliable extraction. Lower confidence may require manual review.

Content Validation

Rules are applied to validate extracted content:

Completeness - All required sections present
Consistency - Data consistent across document
Compliance - Meets regulatory requirements
Format - Proper structure and formatting

Error Handling

If generation fails:

GET /api/v1/documents/status/task-id
→ status: "Failed"
→ error: "Missing required section: Safety Analysis"

Common causes:

Missing source documents
Template not found
Invalid extraction rules
Insufficient data in sources
Processing timeout

Troubleshooting:

Verify all source files were uploaded
Confirm template ID is correct
Check that source documents contain required data
Review extraction rule configuration
Try with smaller documents first

Typical Workflow

#!/bin/bash

# 1. Upload source documents
curl -X POST "$API/api/v1/files/upload" \
  -H "Authorization: Bearer $TOKEN" \
  -F "file_name=protocol.pdf" \
  -F "[email protected]" \
  -F "container=documents"

# 2. Request generation
RESPONSE=$(curl -X POST "$API/api/v1/documents/generate" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type": "CSR",
    "file_paths": ["s3://bucket/protocol.pdf"],
    "document_set_key": "project-2024",
    "document_set_name": "Q1 CSR",
    "generic_mrt_id": "template-uuid",
    "output_name": "CSR_Final.docx"
  }')

TASK_ID=$(echo $RESPONSE | jq -r '.task_id')

# 3. Poll status
while true; do
  STATUS=$(curl -s -X GET "$API/api/v1/documents/status/$TASK_ID" \
    -H "Authorization: Bearer $TOKEN" | jq -r '.status')

  [ "$STATUS" = "Complete" ] && break
  [ "$STATUS" = "Failed" ] && exit 1

  sleep 5
done

# 4. Retrieve and download
curl -X GET "$API/api/v1/files/documents" \
  -H "Authorization: Bearer $TOKEN" | jq '.files'

Performance Considerations

Processing Time

Typical processing times:

Simple documents (single source): 2-5 minutes
Complex documents (multiple sources): 5-15 minutes
Large datasets: 15-30+ minutes

Time depends on:

Source document size
Number of sections
Complexity of extraction rules
Available processing resources

Limits

Max file size: 100 MB per document
Max sections: 100 per template
Max extraction rules: 500 per template
Max concurrent generations: 10 per organization

Best Practices

Organize Sources - Ensure source documents are well-structured
Test Rules - Validate extraction rules on small samples first
Monitor Progress - Use status polling to track generation
Handle Errors - Implement error handling and retry logic
Archive Results - Keep generated documents for compliance
Version Control - Track template versions and changes

MRT Workflow - Template structure and concepts
Async Operations - How async processing works
Documents API - API documentation
Status Polling - Track generation

Getting Started

API Reference

Core Concepts

SDK Reference

Cookbooks

Integrations

Deployment

Document Generation

Document Generation

Overview

The Generation Pipeline

Step 1: Request Generation

Step 2: Extract Content

Step 3: Classify Content

Step 4: Create Outline

Step 5: Apply Extraction Rules

Step 6: Populate Outline

Step 7: Generate Document

Step 8: Return Result

Status Tracking

Retrieval

Generation Configuration

Document Selection

Document Instructions

Style Guides

Quality Assurance

Confidence Scores

Content Validation

Error Handling

Typical Workflow

Performance Considerations

Processing Time

Limits

Best Practices

Getting Started

API Reference

Core Concepts

SDK Reference

Cookbooks

Integrations

Deployment

​Document Generation

​Overview

​The Generation Pipeline

​Step 1: Request Generation

​Step 2: Extract Content

​Step 3: Classify Content

​Step 4: Create Outline

​Step 5: Apply Extraction Rules

​Step 6: Populate Outline

​Step 7: Generate Document

​Step 8: Return Result

​Status Tracking

​Retrieval

​Generation Configuration

​Document Selection

​Document Instructions

​Style Guides

​Quality Assurance

​Confidence Scores

​Content Validation

​Error Handling

​Typical Workflow

​Performance Considerations

​Processing Time

​Limits

​Best Practices

​Related Topics

Document Generation

Overview

The Generation Pipeline

Step 1: Request Generation

Step 2: Extract Content

Step 3: Classify Content

Step 4: Create Outline

Step 5: Apply Extraction Rules

Step 6: Populate Outline

Step 7: Generate Document

Step 8: Return Result

Status Tracking

Retrieval

Generation Configuration

Document Selection

Document Instructions

Style Guides

Quality Assurance

Confidence Scores

Content Validation

Error Handling

Typical Workflow

Performance Considerations

Processing Time

Limits

Best Practices

Related Topics