Skip to main content

Document Generation

Document generation in Artos is an automated process that transforms source documents into professionally formatted regulatory documents using MRT templates.

Overview

The document generation process:
  1. Extract - Extract content from source documents
  2. Classify - Classify extracted content by document type
  3. Ingest - Organize classified content for processing
  4. Outline - Create document outline from MRT template
  5. Orchestrate - Apply extraction rules and processing
  6. Generate - Produce final formatted document

The Generation Pipeline

Step 1: Request Generation

Submit a generation request with source documents and template:
POST /api/v1/documents/generate
{
  "document_type": "CSR",
  "file_paths": ["s3://bucket/protocol.pdf", "s3://bucket/data.xlsx"],
  "document_set_key": "project-2024",
  "document_set_name": "Q1 CSR Documents",
  "generic_mrt_id": "csr-template-uuid",
  "output_name": "CSR_Final.docx"
}
Returns: 202 Accepted with task ID
{
  "message": "Document generation started",
  "task_id": "celery-task-abc-123"
}

Step 2: Extract Content

The system extracts content from source documents:
Source Documents:
├─ protocol.pdf (document structure, sections, text)
├─ safety_report.docx (safety data, tables, findings)
└─ efficacy_data.xlsx (results, analysis, statistics)

↓ Extraction Engine

Extracted Content:
├─ Document Type: CSR
├─ Identified Sections:
│  ├─ Introduction/Background
│  ├─ Methodology/Study Design
│  ├─ Results
│  ├─ Safety Analysis
│  ├─ Efficacy Analysis
│  └─ Conclusions
├─ Structured Data:
│  ├─ Tables (efficacy, safety)
│  ├─ Key Statistics
│  └─ Findings
└─ Metadata:
   ├─ Source Document Mapping
   ├─ Confidence Scores
   └─ Processing Metadata

Step 3: Classify Content

Content is automatically classified based on:
  • Document structure
  • Section headers
  • Content type indicators
  • Extraction rules
Extracted Content
↓ Classification Rules
├─ Safety Data: 95% confidence
├─ Efficacy Data: 92% confidence
├─ Methodology: 89% confidence
└─ Background: 96% confidence

Step 4: Create Outline

The outline is generated from the MRT template:
{
  "outline_id": "outline-uuid",
  "template_id": "csr-template-uuid",
  "sections": [
    {
      "section_id": "section-1",
      "order_index": 0,
      "level": 1,
      "title": "Executive Summary",
      "status": "pending"
    },
    {
      "section_id": "section-2",
      "order_index": 1,
      "level": 1,
      "title": "Methodology",
      "status": "pending"
    }
  ]
}

Step 5: Apply Extraction Rules

For each section, extraction rules are applied to find and process content:
Template Section: "Safety Analysis"
├─ Extraction Rule 1: Find adverse events table
│  └─ Source: safety_report.docx, Table 3
│  └─ Result: ✓ Found and extracted
├─ Extraction Rule 2: Extract safety conclusions
│  └─ Source: protocol.pdf, Section 5.2
│  └─ Result: ✓ Found and extracted
└─ Extraction Rule 3: Summarize safety profile
   └─ Source: Multiple sources
   └─ Result: ✓ Generated summary

Step 6: Populate Outline

Extracted content is populated into the outline:
{
  "section_id": "section-4",
  "title": "Safety Analysis",
  "extracted_content": {
    "adverse_events_table": [...],
    "safety_summary": "...",
    "confidence_score": 0.94
  },
  "sources": [
    "safety_report.docx",
    "protocol.pdf"
  ]
}

Step 7: Generate Document

The outline is converted to a formatted DOCX document:
Outline
├─ Formatting Rules Applied
├─ Styles Applied (fonts, colors, spacing)
├─ Headers and Footers Added
├─ Table of Contents Generated
├─ References Formatted
└─ Final DOCX File Created

Step 8: Return Result

Document is saved and retrieval information returned:
{
  "message": "Document generation complete",
  "document_id": "doc-xyz-789",
  "status": "Complete",
  "output_location": "s3://bucket/CSR_Final.docx",
  "file_name": "CSR_Final.docx"
}

Status Tracking

Monitor generation progress using the status endpoint:
GET /api/v1/documents/status/{task_id}
Status values:
  • Generating - Currently processing
  • Complete - Successfully finished
  • Failed - Error occurred
# Immediately after request
GET /api/v1/documents/status/celery-task-abc-123
 status: "Generating"

# After processing
GET /api/v1/documents/status/celery-task-abc-123
 status: "Complete"

Retrieval

Once complete, retrieve the document:
# Get document details
GET /api/v1/documents/{document_id}

# Response includes metadata, sections, and access information
{
  "document": {
    "document_id": "doc-xyz-789",
    "file_name": "CSR_Final.docx",
    "status": "Complete",
    "sections": [...],
    "s3_location": "s3://bucket/CSR_Final.docx"
  }
}

# Download via presigned URL
GET /api/v1/files/documents
 Returns presigned S3 URLs for direct download

Generation Configuration

Document Selection

Optionally specify which sections to include:
{
  "generic_mrt_id": "template-uuid",
  "selected_section_ids": ["section-1", "section-3", "section-5"]
}
This generates only the specified sections, omitting others.

Document Instructions

Provide document-level instructions:
{
  "document_instructions": "Use company style guide. Emphasize safety data. Exclude preliminary findings."
}
Instructions are passed to extraction rules and formatting engine.

Style Guides

Apply a specific style guide:
{
  "style_guide_id": "company-style-2024"
}
Style guides control:
  • Font and font sizes
  • Colors and formatting
  • Section numbering style
  • Citation format
  • Table formatting

Quality Assurance

Confidence Scores

Each extraction includes a confidence score (0-1):
Safety Data Extraction:
├─ Adverse Events: 0.96 (high confidence)
├─ Lab Findings: 0.89 (good confidence)
├─ Conclusions: 0.92 (good confidence)
└─ Overall Section: 0.92
High confidence (>0.9) indicates reliable extraction. Lower confidence may require manual review.

Content Validation

Rules are applied to validate extracted content:
  • Completeness - All required sections present
  • Consistency - Data consistent across document
  • Compliance - Meets regulatory requirements
  • Format - Proper structure and formatting

Error Handling

If generation fails:
GET /api/v1/documents/status/task-id
 status: "Failed"
 error: "Missing required section: Safety Analysis"
Common causes:
  • Missing source documents
  • Template not found
  • Invalid extraction rules
  • Insufficient data in sources
  • Processing timeout
Troubleshooting:
  1. Verify all source files were uploaded
  2. Confirm template ID is correct
  3. Check that source documents contain required data
  4. Review extraction rule configuration
  5. Try with smaller documents first

Typical Workflow

#!/bin/bash

# 1. Upload source documents
curl -X POST "$API/api/v1/files/upload" \
  -H "Authorization: Bearer $TOKEN" \
  -F "file_name=protocol.pdf" \
  -F "[email protected]" \
  -F "container=documents"

# 2. Request generation
RESPONSE=$(curl -X POST "$API/api/v1/documents/generate" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type": "CSR",
    "file_paths": ["s3://bucket/protocol.pdf"],
    "document_set_key": "project-2024",
    "document_set_name": "Q1 CSR",
    "generic_mrt_id": "template-uuid",
    "output_name": "CSR_Final.docx"
  }')

TASK_ID=$(echo $RESPONSE | jq -r '.task_id')

# 3. Poll status
while true; do
  STATUS=$(curl -s -X GET "$API/api/v1/documents/status/$TASK_ID" \
    -H "Authorization: Bearer $TOKEN" | jq -r '.status')

  [ "$STATUS" = "Complete" ] && break
  [ "$STATUS" = "Failed" ] && exit 1

  sleep 5
done

# 4. Retrieve and download
curl -X GET "$API/api/v1/files/documents" \
  -H "Authorization: Bearer $TOKEN" | jq '.files'

Performance Considerations

Processing Time

Typical processing times:
  • Simple documents (single source): 2-5 minutes
  • Complex documents (multiple sources): 5-15 minutes
  • Large datasets: 15-30+ minutes
Time depends on:
  • Source document size
  • Number of sections
  • Complexity of extraction rules
  • Available processing resources

Limits

  • Max file size: 100 MB per document
  • Max sections: 100 per template
  • Max extraction rules: 500 per template
  • Max concurrent generations: 10 per organization

Best Practices

  1. Organize Sources - Ensure source documents are well-structured
  2. Test Rules - Validate extraction rules on small samples first
  3. Monitor Progress - Use status polling to track generation
  4. Handle Errors - Implement error handling and retry logic
  5. Archive Results - Keep generated documents for compliance
  6. Version Control - Track template versions and changes