Documentation Index
Fetch the complete documentation index at: https://docs.artosai.com/llms.txt
Use this file to discover all available pages before exploring further.
Document Generation
Document generation in Artos is an automated process that transforms source documents into professionally formatted regulatory documents using templates.
Overview
The document generation process:
- Extract - Extract content from source documents
- Classify - Classify extracted content by document type
- Ingest - Organize classified content for processing
- Outline - Create document outline from template
- Orchestrate - Apply extraction rules and processing
- Generate - Produce final formatted document
The Generation Pipeline
Step 1: Request Generation
Submit a generation request with source documents and template:
POST /api/v1/documents/generate
{
"document_type": "CSR",
"file_paths": ["org-id/documents/protocol.pdf", "org-id/documents/data.xlsx"],
"document_set_key": "project-2024",
"document_set_name": "Q1 CSR Documents",
"generic_mrt_id": "csr-template-uuid",
"output_name": "CSR_Final.docx"
}
Returns: 202 Accepted with task ID
{
"message": "Document generation started",
"task_id": "celery-task-abc-123"
}
The system extracts content from source documents:
Source Documents:
├─ protocol.pdf (document structure, sections, text)
├─ safety_report.docx (safety data, tables, findings)
└─ efficacy_data.xlsx (results, analysis, statistics)
↓ Extraction Engine
Extracted Content:
├─ Document Type: CSR
├─ Identified Sections:
│ ├─ Introduction/Background
│ ├─ Methodology/Study Design
│ ├─ Results
│ ├─ Safety Analysis
│ ├─ Efficacy Analysis
│ └─ Conclusions
├─ Structured Data:
│ ├─ Tables (efficacy, safety)
│ ├─ Key Statistics
│ └─ Findings
└─ Metadata:
├─ Source Document Mapping
├─ Confidence Scores
└─ Processing Metadata
Step 3: Classify Content
Content is automatically classified based on:
- Document structure
- Section headers
- Content type indicators
- Extraction rules
Extracted Content
↓ Classification Rules
├─ Safety Data: 95% confidence
├─ Efficacy Data: 92% confidence
├─ Methodology: 89% confidence
└─ Background: 96% confidence
Step 4: Create Outline
The outline is generated from the template:
{
"outline_id": "outline-uuid",
"template_id": "csr-template-uuid",
"sections": [
{
"section_id": "section-1",
"order_index": 0,
"level": 1,
"title": "Executive Summary",
"status": "pending"
},
{
"section_id": "section-2",
"order_index": 1,
"level": 1,
"title": "Methodology",
"status": "pending"
}
]
}
For each section, extraction rules are applied to find and process content:
Template Section: "Safety Analysis"
├─ Extraction Rule 1: Find adverse events table
│ └─ Source: safety_report.docx, Table 3
│ └─ Result: ✓ Found and extracted
├─ Extraction Rule 2: Extract safety conclusions
│ └─ Source: protocol.pdf, Section 5.2
│ └─ Result: ✓ Found and extracted
└─ Extraction Rule 3: Summarize safety profile
└─ Source: Multiple sources
└─ Result: ✓ Generated summary
Step 6: Populate Outline
Extracted content is populated into the outline:
{
"section_id": "section-4",
"title": "Safety Analysis",
"extracted_content": {
"adverse_events_table": [...],
"safety_summary": "...",
"confidence_score": 0.94
},
"sources": [
"safety_report.docx",
"protocol.pdf"
]
}
Step 7: Generate Document
The outline is converted to a formatted DOCX document:
Outline
├─ Formatting Rules Applied
├─ Styles Applied (fonts, colors, spacing)
├─ Headers and Footers Added
├─ Table of Contents Generated
├─ References Formatted
└─ Final DOCX File Created
Step 8: Return Result
Document is saved and retrieval information returned:
{
"message": "Document generation complete",
"document_id": "doc-xyz-789",
"status": "Complete",
"output_location": "org-id/output/CSR_Final.docx",
"file_name": "CSR_Final.docx"
}
Status Tracking
Monitor generation progress using the status endpoint:
GET /api/v1/documents/status/{task_id}
Status values:
- Pending - Accepted but not yet picked up by a worker
- Ingesting - Source documents are being ingested
- Generating - Document content is being generated
- Ready - Successfully finished
- Failed - Error occurred
# Immediately after request
GET /api/v1/documents/status/{document_id}
→ status: "Generating"
# After processing
GET /api/v1/documents/status/{document_id}
→ status: "Ready"
Retrieval
Once complete, retrieve the document:
# Get document details
GET /api/v1/documents/{document_id}
# Response includes metadata, sections, and access information
{
"document": {
"document_id": "doc-xyz-789",
"file_name": "CSR_Final.docx",
"status": "Complete",
"sections": [...],
"s3_location": "org-id/output/CSR_Final.docx"
}
}
# Download via presigned URL
GET /api/v1/files/documents
→ Returns presigned S3 URLs for direct download
Generation Configuration
Document Selection
Optionally specify which sections to include:
{
"generic_mrt_id": "template-uuid",
"selected_section_ids": ["section-1", "section-3", "section-5"]
}
This generates only the specified sections, omitting others.
Document Instructions
Provide document-level instructions:
{
"document_instructions": "Use company style guide. Emphasize safety data. Exclude preliminary findings."
}
Instructions are passed to extraction rules and formatting engine.
Style Guides
Apply a specific style guide:
{
"style_guide_id": "company-style-2024"
}
Style guides control:
- Font and font sizes
- Colors and formatting
- Section numbering style
- Citation format
- Table formatting
Quality Assurance
Confidence Scores
Each extraction includes a confidence score (0-1):
Safety Data Extraction:
├─ Adverse Events: 0.96 (high confidence)
├─ Lab Findings: 0.89 (good confidence)
├─ Conclusions: 0.92 (good confidence)
└─ Overall Section: 0.92
High confidence (>0.9) indicates reliable extraction.
Lower confidence may require manual review.
Content Validation
Rules are applied to validate extracted content:
- Completeness - All required sections present
- Consistency - Data consistent across document
- Compliance - Meets regulatory requirements
- Format - Proper structure and formatting
Error Handling
If generation fails:
GET /api/v1/documents/status/task-id
→ status: "Failed"
→ error: "Missing required section: Safety Analysis"
Common causes:
- Missing source documents
- Template not found
- Invalid extraction rules
- Insufficient data in sources
- Processing timeout
Troubleshooting:
- Verify all source files were uploaded
- Confirm template ID is correct
- Check that source documents contain required data
- Review extraction rule configuration
- Try with smaller documents first
Typical Workflow
#!/bin/bash
# 1. Upload source documents
curl -X POST "$API/api/v1/files/upload" \
-H "Authorization: Bearer $TOKEN" \
-F "file_name=protocol.pdf" \
-F "file_content=@protocol.pdf" \
-F "container=documents"
# 2. Request generation
RESPONSE=$(curl -X POST "$API/api/v1/documents/generate" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"document_type": "CSR",
"file_paths": ["org-id/documents/protocol.pdf"],
"document_set_key": "project-2024",
"document_set_name": "Q1 CSR",
"generic_mrt_id": "template-uuid",
"output_name": "CSR_Final.docx"
}')
TASK_ID=$(echo $RESPONSE | jq -r '.task_id')
# 3. Poll status
while true; do
STATUS=$(curl -s -X GET "$API/api/v1/documents/status/$TASK_ID" \
-H "Authorization: Bearer $TOKEN" | jq -r '.status')
[ "$STATUS" = "Complete" ] && break
[ "$STATUS" = "Failed" ] && exit 1
sleep 5
done
# 4. Retrieve and download
curl -X GET "$API/api/v1/files/documents" \
-H "Authorization: Bearer $TOKEN" | jq '.files'
Processing Time
Typical processing times:
- Simple documents (single source): 2-5 minutes
- Complex documents (multiple sources): 5-15 minutes
- Large datasets: 15-30+ minutes
Time depends on:
- Source document size
- Number of sections
- Complexity of extraction rules
- Available processing resources
Limits
- Max file size: 100 MB per document
- Max sections: 100 per template
- Max extraction rules: 500 per template
- Max concurrent generations: 10 per organization
Best Practices
- Organize Sources - Ensure source documents are well-structured
- Test Rules - Validate extraction rules on small samples first
- Monitor Progress - Use status polling to track generation
- Handle Errors - Implement error handling and retry logic
- Archive Results - Keep generated documents for compliance
- Version Control - Track template versions and changes