Documents & Retrieval

Documents Management

List All Documents

Retrieve a list of all processed documents accessible to your user account. Endpoint

GET /documents/all

Headers

Authorization: Bearer <access_token>
Content-Type: application/json

Query Parameters

Parameter	Type	Required	Description
`page`	Integer	No	Page number for pagination (default: 1)
`limit`	Integer	No	Number of documents per page (default: 20, max: 100)
`file_type`	String	No	Filter by file type (docx, html, rtf, csv)
`document_name`	String	No	Filter by document name (partial match)

Example Request

curl -X GET "https://api.yourdomain.com/documents/all?page=1&limit=20&file_type=docx" \
  -H "Authorization: Bearer your-access-token" \
  -H "Content-Type: application/json"

Success Response (200 OK)

[
  {
    "document_id": "doc-abc123def456",
    "filename": "clinical-trial-protocol.docx",
    "upload_date": "2024-01-15T10:30:00Z",
    "file_size": 2048576,
    "s3_url": "s3://your-bucket/ingested/doc-abc123def456/clinical-trial-protocol.docx",
    "file_type": "docx",
    "document_type": "Clinical Trial Protocol",
    "description": "Phase III randomized controlled trial protocol",
    "file_synopsis": "A comprehensive protocol for evaluating efficacy and safety...",
    "processing_status": "completed"
  },
  {
    "document_id": "doc-def456ghi789",
    "filename": "regulatory-guidelines.html",
    "upload_date": "2024-01-14T15:22:00Z",
    "file_size": 1024000,
    "s3_url": "s3://your-bucket/ingested/doc-def456ghi789/regulatory-guidelines.html",
    "file_type": "html",
    "document_type": "Regulatory Guidelines",
    "description": "FDA guidance document for clinical trials",
    "file_synopsis": "Updated FDA guidelines covering trial design requirements...",
    "processing_status": "completed"
  }
]

Permission-Based Access

Admin users: See all documents in the organization
Regular users: Only see documents assigned to them
Note: Only documents with processing_status: "completed" are returned

Get Specific Document

Retrieve detailed information about a specific document by filename. Endpoint

GET /documents/{filename}

Path Parameters

Parameter	Type	Required	Description
`filename`	String	Yes	Full filename including any group/folder structure

Headers

Authorization: Bearer <access_token>
Content-Type: application/json

Example Request

curl -X GET "https://api.yourdomain.com/documents/clinical-trial-protocol.docx" \
  -H "Authorization: Bearer your-access-token" \
  -H "Content-Type: application/json"

Success Response (200 OK)

{
  "document_id": "doc-abc123def456",
  "filename": "clinical-trial-protocol.docx",
  "upload_date": "2024-01-15T10:30:00Z",
  "file_size": 2048576,
  "s3_url": "s3://your-bucket/ingested/doc-abc123def456/clinical-trial-protocol.docx",
  "file_type": "docx",
  "document_type": "Clinical Trial Protocol",
  "description": "Phase III randomized controlled trial protocol",
  "file_synopsis": "A comprehensive protocol for evaluating efficacy and safety of the investigational drug across multiple clinical sites...",
  "processing_status": "completed",
  "extraction_metadata": {
    "total_pages": 45,
    "total_words": 12847,
    "total_images": 8,
    "total_tables": 12,
    "extraction_format": "json",
    "last_processed": "2024-01-15T10:34:12Z"
  }
}

Document Deletion

Only admin users can delete documents from the system. Endpoint

DELETE /documents/{document_id}

Headers

Authorization: Bearer <access_token>
Content-Type: application/json

Permission Required: Admin level access Success Response (200 OK)

{
  "message": "Document successfully deleted",
  "document_id": "doc-abc123def456",
  "deleted_at": "2024-01-15T16:30:00Z"
}

Note: Deleting a document removes both the original file and all extracted content from the system.

Content Retrieval

The retrieval system uses hybrid search (vector embeddings + keyword matching with reranking) to find relevant passages from your ingested documents. This functionality is specifically designed to support document generation workflows.

Important Limitation

This is not a general-purpose search API. The retrieval functionality is optimized for finding content that will be used in AI document generation. For comprehensive document search needs, consider dedicated search solutions.

Retrieve Relevant Passages

Endpoint

GET /retrieve

Headers

Authorization: Bearer <access_token>
Content-Type: application/json

Query Parameters

Parameter	Type	Required	Description
`q`	String	Yes	Search query or question
`sourceUrls`	String	No	Comma-separated list of S3 URLs to search within
`topK`	Integer	No	Number of results to return (default: 5, recommended: 10-25)
`search_type`	String	No	Search method: “hybrid” (default), “vector”, “keyword”

Search Types

Hybrid Search (Default)

Combines vector embeddings and keyword matching with intelligent reranking for optimal results.

curl -X GET "https://api.yourdomain.com/retrieve?q=patient%20eligibility%20criteria&topK=15" \
  -H "Authorization: Bearer your-access-token"

Vector Search Only

Uses semantic similarity through vector embeddings to find conceptually related content.

curl -X GET "https://api.yourdomain.com/retrieve?q=adverse%20events%20reporting&search_type=vector&topK=10" \
  -H "Authorization: Bearer your-access-token"

Keyword Search Only

Traditional text-based search using keyword matching.

curl -X GET "https://api.yourdomain.com/retrieve?q=FDA%20approval%20process&search_type=keyword&topK=20" \
  -H "Authorization: Bearer your-access-token"

Filtering by Source Documents

Search within specific documents by providing their S3 URLs:

curl -X GET "https://api.yourdomain.com/retrieve?q=dosage%20recommendations&sourceUrls=s3://bucket/doc1.docx,s3://bucket/doc2.html&topK=10" \
  -H "Authorization: Bearer your-access-token"

Success Response (200 OK)

{
  "results": [
    {
      "url": "s3://your-bucket/ingested/doc-abc123def456/clinical-trial-protocol.docx",
      "snippet": "Primary efficacy endpoint will be assessed using the modified intention-to-treat population, defined as all randomized patients who received at least one dose of study medication...",
      "score": 0.8547,
      "document_title": "Phase III Clinical Trial Protocol",
      "document_type": "Clinical Trial Protocol",
      "file_synopsis": "A comprehensive protocol for evaluating efficacy and safety of the investigational drug"
    },
    {
      "url": "s3://your-bucket/ingested/doc-def456ghi789/regulatory-guidelines.html",
      "snippet": "The FDA recommends that primary endpoints be clinically meaningful and directly related to patient benefit. Statistical significance should be accompanied by clinical relevance...",
      "score": 0.7923,
      "document_title": "FDA Regulatory Guidelines",
      "document_type": "Regulatory Guidelines", 
      "file_synopsis": "Updated FDA guidelines covering trial design requirements and endpoint selection"
    },
    {
      "url": "s3://your-bucket/ingested/doc-ghi789jkl012/statistical-plan.docx",
      "snippet": "Sample size calculations are based on detecting a 15% difference between treatment groups with 80% power and alpha level of 0.05...",
      "score": 0.7654,
      "document_title": "Statistical Analysis Plan",
      "document_type": "Statistical Analysis Plan",
      "file_synopsis": "Detailed statistical methodology for the clinical trial analysis"
    }
  ],
  "query": "primary efficacy endpoint",
  "search_type": "hybrid",
  "total_results": 3,
  "processing_time_ms": 247
}

Response Fields

Field	Description
`url`	S3 URL of the source document
`snippet`	Relevant text passage (length not configurable)
`score`	Relevance score (0.0 to 1.0, higher = more relevant)
`document_title`	Human-readable document title
`document_type`	Document classification
`file_synopsis`	Brief summary of document content

Empty Results Response

When no relevant content is found:

{
  "results": [],
  "query": "nonexistent topic",
  "search_type": "hybrid",
  "total_results": 0,
  "processing_time_ms": 156
}

Rate Limiting

Both documents and retrieval endpoints implement rate limiting due to underlying LLM processing requirements.

Rate Limits by User Level

Admin Users

Document listing: 100 requests per minute
Document retrieval: 100 requests per minute
Content search: 50 requests per minute

Regular Users

Document listing: 50 requests per minute
Document retrieval: 50 requests per minute
Content search: 20 requests per minute

Rate Limit Headers

X-RateLimit-Limit: 50
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1640995200

Rate Limit Exceeded Response (429)

{
  "error_code": "RATE_LIMIT_EXCEEDED",
  "message": "Search rate limit exceeded",
  "retry_after": 60,
  "requests_remaining": 0,
  "reset_time": "2024-01-15T10:35:00Z"
}

Document Access & Download

Accessing Full Document Content

To access the complete content of a document, use the S3 URL provided in the document metadata:

# Python example using the document class method
import boto3

class DocumentAccess:
    def __init__(self, aws_credentials):
        self.s3_client = boto3.client('s3', **aws_credentials)
    
    def get_document_content(self, s3_url):
        """Access full document content via S3 URL"""
        # Parse S3 URL to extract bucket and key
        bucket_name = s3_url.split('/')[2]
        object_key = '/'.join(s3_url.split('/')[3:])
        
        try:
            response = self.s3_client.get_object(
                Bucket=bucket_name,
                Key=object_key
            )
            return response['Body'].read()
        except Exception as e:
            raise Exception(f"Failed to access document: {str(e)}")

# Usage
doc_access = DocumentAccess(your_aws_credentials)
content = doc_access.get_document_content("s3://your-bucket/path/to/document.docx")

JavaScript Example

// Using AWS SDK for JavaScript
import AWS from 'aws-sdk';

class DocumentAccess {
    constructor(awsConfig) {
        this.s3 = new AWS.S3(awsConfig);
    }
    
    async getDocumentContent(s3Url) {
        const urlParts = s3Url.replace('s3://', '').split('/');
        const bucket = urlParts[0];
        const key = urlParts.slice(1).join('/');
        
        try {
            const response = await this.s3.getObject({
                Bucket: bucket,
                Key: key
            }).promise();
            
            return response.Body;
        } catch (error) {
            throw new Error(`Failed to access document: ${error.message}`);
        }
    }
}

Error Handling

Common Error Codes

Error Code	HTTP Status	Description
`DOCUMENT_NOT_FOUND`	404	Requested document does not exist
`ACCESS_DENIED`	403	User lacks permission to access document
`INVALID_QUERY`	400	Search query is malformed or empty
`INVALID_SOURCE_URL`	400	One or more source URLs are invalid
`RATE_LIMIT_EXCEEDED`	429	Too many requests within time window
`SEARCH_TIMEOUT`	408	Search query took too long to process

Error Response Format

{
  "error_code": "DOCUMENT_NOT_FOUND",
  "message": "The requested document could not be found",
  "details": {
    "document_id": "doc-nonexistent123",
    "user_id": "user-abc123"
  },
  "request_id": "req-def456ghi789"
}

Best Practices

Efficient Document Management

Use pagination for large document collections
Filter by file type when looking for specific document formats
Monitor processing status before attempting retrieval
Cache document lists when possible to reduce API calls

Effective Content Retrieval

Be specific in queries - more specific queries yield better results
Use appropriate topK values - recommended range is 10-25
Choose the right search type:
- Hybrid: Best for most use cases
- Vector: Better for conceptual/semantic queries
- Keyword: Better for exact term matching
Limit source URLs when searching specific documents
Implement proper error handling for empty results

Rate Limit Management

// Example: Implement request queuing for rate limit compliance
class APIClient {
    constructor(baseUrl, accessToken) {
        this.baseUrl = baseUrl;
        this.accessToken = accessToken;
        this.requestQueue = [];
        this.isProcessing = false;
        this.rateLimit = {
            remaining: 50,
            resetTime: null
        };
    }
    
    async search(query, options = {}) {
        return new Promise((resolve, reject) => {
            this.requestQueue.push({ query, options, resolve, reject });
            this.processQueue();
        });
    }
    
    async processQueue() {
        if (this.isProcessing || this.requestQueue.length === 0) return;
        if (this.rateLimit.remaining <= 0) {
            const waitTime = this.rateLimit.resetTime - Date.now();
            if (waitTime > 0) {
                setTimeout(() => this.processQueue(), waitTime);
                return;
            }
        }
        
        this.isProcessing = true;
        const request = this.requestQueue.shift();
        
        try {
            const response = await this.makeSearchRequest(request.query, request.options);
            this.updateRateLimit(response.headers);
            request.resolve(response.data);
        } catch (error) {
            request.reject(error);
        } finally {
            this.isProcessing = false;
            setTimeout(() => this.processQueue(), 100); // Small delay between requests
        }
    }
}

Integration Example

Complete Document Search Workflow

import requests
import time
from typing import List, Dict, Any

class DocumentSearchClient:
    def __init__(self, base_url: str, access_token: str):
        self.base_url = base_url
        self.access_token = access_token
        self.headers = {
            'Authorization': f'Bearer {access_token}',
            'Content-Type': 'application/json'
        }
    
    def get_all_documents(self, file_type: str = None) -> List[Dict]:
        """Retrieve all accessible documents"""
        params = {}
        if file_type:
            params['file_type'] = file_type
            
        response = requests.get(
            f"{self.base_url}/documents/all",
            headers=self.headers,
            params=params
        )
        response.raise_for_status()
        return response.json()
    
    def search_documents(self, query: str, source_urls: List[str] = None, 
                        top_k: int = 15, search_type: str = "hybrid") -> Dict:
        """Search for relevant content across documents"""
        params = {
            'q': query,
            'topK': top_k,
            'search_type': search_type
        }
        
        if source_urls:
            params['sourceUrls'] = ','.join(source_urls)
        
        response = requests.get(
            f"{self.base_url}/retrieve",
            headers=self.headers,
            params=params
        )
        response.raise_for_status()
        return response.json()
    
    def search_specific_documents(self, query: str, document_types: List[str]) -> Dict:
        """Search within documents of specific types"""
        # First, get documents of specified types
        all_docs = self.get_all_documents()
        target_docs = [
            doc for doc in all_docs 
            if doc['document_type'] in document_types
        ]
        
        if not target_docs:
            return {"results": [], "message": "No documents found of specified types"}
        
        # Extract S3 URLs for targeted search
        source_urls = [doc['s3_url'] for doc in target_docs]
        
        # Perform search within these documents
        return self.search_documents(query, source_urls=source_urls)

# Usage example
client = DocumentSearchClient("https://api.yourdomain.com", "your-access-token")

# Search for clinical trial information in protocol documents
results = client.search_specific_documents(
    query="patient inclusion exclusion criteria",
    document_types=["Clinical Trial Protocol", "Study Protocol"]
)

print(f"Found {len(results['results'])} relevant passages:")
for result in results['results']:
    print(f"- {result['document_title']}: {result['snippet'][:100]}...")

API Documentation

Concepts

SDK Reference

Integrations

Deployment

Documents Management

List All Documents

Get Specific Document

Document Deletion

Content Retrieval

Important Limitation

Retrieve Relevant Passages

Search Types

Hybrid Search (Default)

Vector Search Only

Keyword Search Only

Filtering by Source Documents

Success Response (200 OK)

Response Fields

Empty Results Response

Rate Limiting

Rate Limits by User Level

Rate Limit Headers

Rate Limit Exceeded Response (429)

Document Access & Download

Accessing Full Document Content

JavaScript Example

Error Handling

Common Error Codes

Error Response Format

Best Practices

Efficient Document Management

Effective Content Retrieval

Rate Limit Management

Integration Example

Complete Document Search Workflow

API Documentation

Concepts

SDK Reference

Integrations

Deployment

​Documents Management

​List All Documents

​Get Specific Document

​Document Deletion

​Content Retrieval

​Important Limitation

​Retrieve Relevant Passages

​Search Types

​Hybrid Search (Default)

​Vector Search Only

​Keyword Search Only

​Filtering by Source Documents

​Success Response (200 OK)

​Response Fields

​Empty Results Response

​Rate Limiting

​Rate Limits by User Level

​Rate Limit Headers

​Rate Limit Exceeded Response (429)

​Document Access & Download

​Accessing Full Document Content

​JavaScript Example

​Error Handling

​Common Error Codes

​Error Response Format

​Best Practices

​Efficient Document Management

​Effective Content Retrieval

​Rate Limit Management

​Integration Example

​Complete Document Search Workflow

Documents Management

List All Documents

Get Specific Document

Document Deletion

Content Retrieval

Important Limitation

Retrieve Relevant Passages

Search Types

Hybrid Search (Default)

Vector Search Only

Keyword Search Only

Filtering by Source Documents

Success Response (200 OK)

Response Fields

Empty Results Response

Rate Limiting

Rate Limits by User Level

Rate Limit Headers

Rate Limit Exceeded Response (429)

Document Access & Download

Accessing Full Document Content

JavaScript Example

Error Handling

Common Error Codes

Error Response Format

Best Practices

Efficient Document Management

Effective Content Retrieval

Rate Limit Management

Integration Example

Complete Document Search Workflow