> ## Documentation Index
> Fetch the complete documentation index at: https://docs.artosai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Generation

> Understanding the document generation process

# Document Generation

Document generation in Artos is an automated process that transforms source documents into professionally formatted regulatory documents using templates.

## Overview

The document generation process:

1. **Extract** - Extract content from source documents
2. **Classify** - Classify extracted content by document type
3. **Ingest** - Organize classified content for processing
4. **Outline** - Create document outline from template
5. **Orchestrate** - Apply extraction rules and processing
6. **Generate** - Produce final formatted document

## The Generation Pipeline

### Step 1: Request Generation

Submit a generation request with source documents and template:

```bash theme={null}
POST /api/v1/documents/generate
{
  "document_type": "CSR",
  "file_paths": ["org-id/documents/protocol.pdf", "org-id/documents/data.xlsx"],
  "document_set_key": "project-2024",
  "document_set_name": "Q1 CSR Documents",
  "generic_mrt_id": "csr-template-uuid",
  "output_name": "CSR_Final.docx"
}
```

**Returns**: 202 Accepted with task ID

```json theme={null}
{
  "message": "Document generation started",
  "task_id": "celery-task-abc-123"
}
```

### Step 2: Extract Content

The system extracts content from source documents:

```
Source Documents:
├─ protocol.pdf (document structure, sections, text)
├─ safety_report.docx (safety data, tables, findings)
└─ efficacy_data.xlsx (results, analysis, statistics)

↓ Extraction Engine

Extracted Content:
├─ Document Type: CSR
├─ Identified Sections:
│  ├─ Introduction/Background
│  ├─ Methodology/Study Design
│  ├─ Results
│  ├─ Safety Analysis
│  ├─ Efficacy Analysis
│  └─ Conclusions
├─ Structured Data:
│  ├─ Tables (efficacy, safety)
│  ├─ Key Statistics
│  └─ Findings
└─ Metadata:
   ├─ Source Document Mapping
   ├─ Confidence Scores
   └─ Processing Metadata
```

### Step 3: Classify Content

Content is automatically classified based on:

* Document structure
* Section headers
* Content type indicators
* Extraction rules

```
Extracted Content
↓ Classification Rules
├─ Safety Data: 95% confidence
├─ Efficacy Data: 92% confidence
├─ Methodology: 89% confidence
└─ Background: 96% confidence
```

### Step 4: Create Outline

The outline is generated from the template:

```json theme={null}
{
  "outline_id": "outline-uuid",
  "template_id": "csr-template-uuid",
  "sections": [
    {
      "section_id": "section-1",
      "order_index": 0,
      "level": 1,
      "title": "Executive Summary",
      "status": "pending"
    },
    {
      "section_id": "section-2",
      "order_index": 1,
      "level": 1,
      "title": "Methodology",
      "status": "pending"
    }
  ]
}
```

### Step 5: Apply Extraction Rules

For each section, extraction rules are applied to find and process content:

```
Template Section: "Safety Analysis"
├─ Extraction Rule 1: Find adverse events table
│  └─ Source: safety_report.docx, Table 3
│  └─ Result: ✓ Found and extracted
├─ Extraction Rule 2: Extract safety conclusions
│  └─ Source: protocol.pdf, Section 5.2
│  └─ Result: ✓ Found and extracted
└─ Extraction Rule 3: Summarize safety profile
   └─ Source: Multiple sources
   └─ Result: ✓ Generated summary
```

### Step 6: Populate Outline

Extracted content is populated into the outline:

```json theme={null}
{
  "section_id": "section-4",
  "title": "Safety Analysis",
  "extracted_content": {
    "adverse_events_table": [...],
    "safety_summary": "...",
    "confidence_score": 0.94
  },
  "sources": [
    "safety_report.docx",
    "protocol.pdf"
  ]
}
```

### Step 7: Generate Document

The outline is converted to a formatted DOCX document:

```
Outline
├─ Formatting Rules Applied
├─ Styles Applied (fonts, colors, spacing)
├─ Headers and Footers Added
├─ Table of Contents Generated
├─ References Formatted
└─ Final DOCX File Created
```

### Step 8: Return Result

Document is saved and retrieval information returned:

```json theme={null}
{
  "message": "Document generation complete",
  "document_id": "doc-xyz-789",
  "status": "Complete",
  "output_location": "org-id/output/CSR_Final.docx",
  "file_name": "CSR_Final.docx"
}
```

## Status Tracking

Monitor generation progress using the status endpoint:

```bash theme={null}
GET /api/v1/documents/status/{task_id}
```

Status values:

* **Pending** - Accepted but not yet picked up by a worker
* **Ingesting** - Source documents are being ingested
* **Generating** - Document content is being generated
* **Ready** - Successfully finished
* **Failed** - Error occurred

```bash theme={null}
# Immediately after request
GET /api/v1/documents/status/{document_id}
→ status: "Generating"

# After processing
GET /api/v1/documents/status/{document_id}
→ status: "Ready"
```

## Retrieval

Once complete, retrieve the document:

```bash theme={null}
# Get document details
GET /api/v1/documents/{document_id}

# Response includes metadata, sections, and access information
{
  "document": {
    "document_id": "doc-xyz-789",
    "file_name": "CSR_Final.docx",
    "status": "Complete",
    "sections": [...],
    "s3_location": "org-id/output/CSR_Final.docx"
  }
}

# Download via presigned URL
GET /api/v1/files/documents
→ Returns presigned S3 URLs for direct download
```

## Generation Configuration

### Document Selection

Optionally specify which sections to include:

```json theme={null}
{
  "generic_mrt_id": "template-uuid",
  "selected_section_ids": ["section-1", "section-3", "section-5"]
}
```

This generates only the specified sections, omitting others.

### Document Instructions

Provide document-level instructions:

```json theme={null}
{
  "document_instructions": "Use company style guide. Emphasize safety data. Exclude preliminary findings."
}
```

Instructions are passed to extraction rules and formatting engine.

### Style Guides

Apply a specific style guide:

```json theme={null}
{
  "style_guide_id": "company-style-2024"
}
```

Style guides control:

* Font and font sizes
* Colors and formatting
* Section numbering style
* Citation format
* Table formatting

## Quality Assurance

### Confidence Scores

Each extraction includes a confidence score (0-1):

```
Safety Data Extraction:
├─ Adverse Events: 0.96 (high confidence)
├─ Lab Findings: 0.89 (good confidence)
├─ Conclusions: 0.92 (good confidence)
└─ Overall Section: 0.92
```

High confidence (>0.9) indicates reliable extraction.
Lower confidence may require manual review.

### Content Validation

Rules are applied to validate extracted content:

* **Completeness** - All required sections present
* **Consistency** - Data consistent across document
* **Compliance** - Meets regulatory requirements
* **Format** - Proper structure and formatting

## Error Handling

If generation fails:

```bash theme={null}
GET /api/v1/documents/status/task-id
→ status: "Failed"
→ error: "Missing required section: Safety Analysis"
```

Common causes:

* Missing source documents
* Template not found
* Invalid extraction rules
* Insufficient data in sources
* Processing timeout

Troubleshooting:

1. Verify all source files were uploaded
2. Confirm template ID is correct
3. Check that source documents contain required data
4. Review extraction rule configuration
5. Try with smaller documents first

## Typical Workflow

```bash theme={null}
#!/bin/bash

# 1. Upload source documents
curl -X POST "$API/api/v1/files/upload" \
  -H "Authorization: Bearer $TOKEN" \
  -F "file_name=protocol.pdf" \
  -F "file_content=@protocol.pdf" \
  -F "container=documents"

# 2. Request generation
RESPONSE=$(curl -X POST "$API/api/v1/documents/generate" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type": "CSR",
    "file_paths": ["org-id/documents/protocol.pdf"],
    "document_set_key": "project-2024",
    "document_set_name": "Q1 CSR",
    "generic_mrt_id": "template-uuid",
    "output_name": "CSR_Final.docx"
  }')

TASK_ID=$(echo $RESPONSE | jq -r '.task_id')

# 3. Poll status
while true; do
  STATUS=$(curl -s -X GET "$API/api/v1/documents/status/$TASK_ID" \
    -H "Authorization: Bearer $TOKEN" | jq -r '.status')

  [ "$STATUS" = "Complete" ] && break
  [ "$STATUS" = "Failed" ] && exit 1

  sleep 5
done

# 4. Retrieve and download
curl -X GET "$API/api/v1/files/documents" \
  -H "Authorization: Bearer $TOKEN" | jq '.files'
```

## Performance Considerations

### Processing Time

Typical processing times:

* **Simple documents** (single source): 2-5 minutes
* **Complex documents** (multiple sources): 5-15 minutes
* **Large datasets**: 15-30+ minutes

Time depends on:

* Source document size
* Number of sections
* Complexity of extraction rules
* Available processing resources

### Limits

* **Max file size**: 100 MB per document
* **Max sections**: 100 per template
* **Max extraction rules**: 500 per template
* **Max concurrent generations**: 10 per organization

## Best Practices

1. **Organize Sources** - Ensure source documents are well-structured
2. **Test Rules** - Validate extraction rules on small samples first
3. **Monitor Progress** - Use status polling to track generation
4. **Handle Errors** - Implement error handling and retry logic
5. **Archive Results** - Keep generated documents for compliance
6. **Version Control** - Track template versions and changes

## Related Topics

* **[Template Workflow](mrt-workflow)** - Template structure and concepts
* **[Async Operations](async-operations)** - How async processing works
* **[Documents API](../api-reference/documents)** - API documentation
* **[Status Polling](../api-reference/documents#get-document-status)** - Track generation