Skip to main content

Documents Management

List All Documents

Retrieve a list of all processed documents accessible to your user account. Endpoint
GET /documents/all
Headers
Authorization: Bearer <access_token>
Content-Type: application/json
Query Parameters
ParameterTypeRequiredDescription
pageIntegerNoPage number for pagination (default: 1)
limitIntegerNoNumber of documents per page (default: 20, max: 100)
file_typeStringNoFilter by file type (docx, html, rtf, csv)
document_nameStringNoFilter by document name (partial match)
Example Request
curl -X GET "https://api.yourdomain.com/documents/all?page=1&limit=20&file_type=docx" \
  -H "Authorization: Bearer your-access-token" \
  -H "Content-Type: application/json"
Success Response (200 OK)
[
  {
    "document_id": "doc-abc123def456",
    "filename": "clinical-trial-protocol.docx",
    "upload_date": "2024-01-15T10:30:00Z",
    "file_size": 2048576,
    "s3_url": "s3://your-bucket/ingested/doc-abc123def456/clinical-trial-protocol.docx",
    "file_type": "docx",
    "document_type": "Clinical Trial Protocol",
    "description": "Phase III randomized controlled trial protocol",
    "file_synopsis": "A comprehensive protocol for evaluating efficacy and safety...",
    "processing_status": "completed"
  },
  {
    "document_id": "doc-def456ghi789",
    "filename": "regulatory-guidelines.html",
    "upload_date": "2024-01-14T15:22:00Z",
    "file_size": 1024000,
    "s3_url": "s3://your-bucket/ingested/doc-def456ghi789/regulatory-guidelines.html",
    "file_type": "html",
    "document_type": "Regulatory Guidelines",
    "description": "FDA guidance document for clinical trials",
    "file_synopsis": "Updated FDA guidelines covering trial design requirements...",
    "processing_status": "completed"
  }
]
Permission-Based Access
  • Admin users: See all documents in the organization
  • Regular users: Only see documents assigned to them
  • Note: Only documents with processing_status: "completed" are returned

Get Specific Document

Retrieve detailed information about a specific document by filename. Endpoint
GET /documents/{filename}
Path Parameters
ParameterTypeRequiredDescription
filenameStringYesFull filename including any group/folder structure
Headers
Authorization: Bearer <access_token>
Content-Type: application/json
Example Request
curl -X GET "https://api.yourdomain.com/documents/clinical-trial-protocol.docx" \
  -H "Authorization: Bearer your-access-token" \
  -H "Content-Type: application/json"
Success Response (200 OK)
{
  "document_id": "doc-abc123def456",
  "filename": "clinical-trial-protocol.docx",
  "upload_date": "2024-01-15T10:30:00Z",
  "file_size": 2048576,
  "s3_url": "s3://your-bucket/ingested/doc-abc123def456/clinical-trial-protocol.docx",
  "file_type": "docx",
  "document_type": "Clinical Trial Protocol",
  "description": "Phase III randomized controlled trial protocol",
  "file_synopsis": "A comprehensive protocol for evaluating efficacy and safety of the investigational drug across multiple clinical sites...",
  "processing_status": "completed",
  "extraction_metadata": {
    "total_pages": 45,
    "total_words": 12847,
    "total_images": 8,
    "total_tables": 12,
    "extraction_format": "json",
    "last_processed": "2024-01-15T10:34:12Z"
  }
}

Document Deletion

Only admin users can delete documents from the system. Endpoint
DELETE /documents/{document_id}
Headers
Authorization: Bearer <access_token>
Content-Type: application/json
Permission Required: Admin level access Success Response (200 OK)
{
  "message": "Document successfully deleted",
  "document_id": "doc-abc123def456",
  "deleted_at": "2024-01-15T16:30:00Z"
}
Note: Deleting a document removes both the original file and all extracted content from the system.

Content Retrieval

The retrieval system uses hybrid search (vector embeddings + keyword matching with reranking) to find relevant passages from your ingested documents. This functionality is specifically designed to support document generation workflows.

Important Limitation

This is not a general-purpose search API. The retrieval functionality is optimized for finding content that will be used in AI document generation. For comprehensive document search needs, consider dedicated search solutions.

Retrieve Relevant Passages

Endpoint
GET /retrieve
Headers
Authorization: Bearer <access_token>
Content-Type: application/json
Query Parameters
ParameterTypeRequiredDescription
qStringYesSearch query or question
sourceUrlsStringNoComma-separated list of S3 URLs to search within
topKIntegerNoNumber of results to return (default: 5, recommended: 10-25)
search_typeStringNoSearch method: “hybrid” (default), “vector”, “keyword”

Search Types

Hybrid Search (Default)

Combines vector embeddings and keyword matching with intelligent reranking for optimal results.
curl -X GET "https://api.yourdomain.com/retrieve?q=patient%20eligibility%20criteria&topK=15" \
  -H "Authorization: Bearer your-access-token"

Vector Search Only

Uses semantic similarity through vector embeddings to find conceptually related content.
curl -X GET "https://api.yourdomain.com/retrieve?q=adverse%20events%20reporting&search_type=vector&topK=10" \
  -H "Authorization: Bearer your-access-token"

Keyword Search Only

Traditional text-based search using keyword matching.
curl -X GET "https://api.yourdomain.com/retrieve?q=FDA%20approval%20process&search_type=keyword&topK=20" \
  -H "Authorization: Bearer your-access-token"

Filtering by Source Documents

Search within specific documents by providing their S3 URLs:
curl -X GET "https://api.yourdomain.com/retrieve?q=dosage%20recommendations&sourceUrls=s3://bucket/doc1.docx,s3://bucket/doc2.html&topK=10" \
  -H "Authorization: Bearer your-access-token"

Success Response (200 OK)

{
  "results": [
    {
      "url": "s3://your-bucket/ingested/doc-abc123def456/clinical-trial-protocol.docx",
      "snippet": "Primary efficacy endpoint will be assessed using the modified intention-to-treat population, defined as all randomized patients who received at least one dose of study medication...",
      "score": 0.8547,
      "document_title": "Phase III Clinical Trial Protocol",
      "document_type": "Clinical Trial Protocol",
      "file_synopsis": "A comprehensive protocol for evaluating efficacy and safety of the investigational drug"
    },
    {
      "url": "s3://your-bucket/ingested/doc-def456ghi789/regulatory-guidelines.html",
      "snippet": "The FDA recommends that primary endpoints be clinically meaningful and directly related to patient benefit. Statistical significance should be accompanied by clinical relevance...",
      "score": 0.7923,
      "document_title": "FDA Regulatory Guidelines",
      "document_type": "Regulatory Guidelines", 
      "file_synopsis": "Updated FDA guidelines covering trial design requirements and endpoint selection"
    },
    {
      "url": "s3://your-bucket/ingested/doc-ghi789jkl012/statistical-plan.docx",
      "snippet": "Sample size calculations are based on detecting a 15% difference between treatment groups with 80% power and alpha level of 0.05...",
      "score": 0.7654,
      "document_title": "Statistical Analysis Plan",
      "document_type": "Statistical Analysis Plan",
      "file_synopsis": "Detailed statistical methodology for the clinical trial analysis"
    }
  ],
  "query": "primary efficacy endpoint",
  "search_type": "hybrid",
  "total_results": 3,
  "processing_time_ms": 247
}

Response Fields

FieldDescription
urlS3 URL of the source document
snippetRelevant text passage (length not configurable)
scoreRelevance score (0.0 to 1.0, higher = more relevant)
document_titleHuman-readable document title
document_typeDocument classification
file_synopsisBrief summary of document content

Empty Results Response

When no relevant content is found:
{
  "results": [],
  "query": "nonexistent topic",
  "search_type": "hybrid",
  "total_results": 0,
  "processing_time_ms": 156
}

Rate Limiting

Both documents and retrieval endpoints implement rate limiting due to underlying LLM processing requirements.

Rate Limits by User Level

Admin Users
  • Document listing: 100 requests per minute
  • Document retrieval: 100 requests per minute
  • Content search: 50 requests per minute
Regular Users
  • Document listing: 50 requests per minute
  • Document retrieval: 50 requests per minute
  • Content search: 20 requests per minute

Rate Limit Headers

X-RateLimit-Limit: 50
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1640995200

Rate Limit Exceeded Response (429)

{
  "error_code": "RATE_LIMIT_EXCEEDED",
  "message": "Search rate limit exceeded",
  "retry_after": 60,
  "requests_remaining": 0,
  "reset_time": "2024-01-15T10:35:00Z"
}

Document Access & Download

Accessing Full Document Content

To access the complete content of a document, use the S3 URL provided in the document metadata:
# Python example using the document class method
import boto3

class DocumentAccess:
    def __init__(self, aws_credentials):
        self.s3_client = boto3.client('s3', **aws_credentials)
    
    def get_document_content(self, s3_url):
        """Access full document content via S3 URL"""
        # Parse S3 URL to extract bucket and key
        bucket_name = s3_url.split('/')[2]
        object_key = '/'.join(s3_url.split('/')[3:])
        
        try:
            response = self.s3_client.get_object(
                Bucket=bucket_name,
                Key=object_key
            )
            return response['Body'].read()
        except Exception as e:
            raise Exception(f"Failed to access document: {str(e)}")

# Usage
doc_access = DocumentAccess(your_aws_credentials)
content = doc_access.get_document_content("s3://your-bucket/path/to/document.docx")

JavaScript Example

// Using AWS SDK for JavaScript
import AWS from 'aws-sdk';

class DocumentAccess {
    constructor(awsConfig) {
        this.s3 = new AWS.S3(awsConfig);
    }
    
    async getDocumentContent(s3Url) {
        const urlParts = s3Url.replace('s3://', '').split('/');
        const bucket = urlParts[0];
        const key = urlParts.slice(1).join('/');
        
        try {
            const response = await this.s3.getObject({
                Bucket: bucket,
                Key: key
            }).promise();
            
            return response.Body;
        } catch (error) {
            throw new Error(`Failed to access document: ${error.message}`);
        }
    }
}

Error Handling

Common Error Codes

Error CodeHTTP StatusDescription
DOCUMENT_NOT_FOUND404Requested document does not exist
ACCESS_DENIED403User lacks permission to access document
INVALID_QUERY400Search query is malformed or empty
INVALID_SOURCE_URL400One or more source URLs are invalid
RATE_LIMIT_EXCEEDED429Too many requests within time window
SEARCH_TIMEOUT408Search query took too long to process

Error Response Format

{
  "error_code": "DOCUMENT_NOT_FOUND",
  "message": "The requested document could not be found",
  "details": {
    "document_id": "doc-nonexistent123",
    "user_id": "user-abc123"
  },
  "request_id": "req-def456ghi789"
}

Best Practices

Efficient Document Management

  1. Use pagination for large document collections
  2. Filter by file type when looking for specific document formats
  3. Monitor processing status before attempting retrieval
  4. Cache document lists when possible to reduce API calls

Effective Content Retrieval

  1. Be specific in queries - more specific queries yield better results
  2. Use appropriate topK values - recommended range is 10-25
  3. Choose the right search type:
    • Hybrid: Best for most use cases
    • Vector: Better for conceptual/semantic queries
    • Keyword: Better for exact term matching
  4. Limit source URLs when searching specific documents
  5. Implement proper error handling for empty results

Rate Limit Management

// Example: Implement request queuing for rate limit compliance
class APIClient {
    constructor(baseUrl, accessToken) {
        this.baseUrl = baseUrl;
        this.accessToken = accessToken;
        this.requestQueue = [];
        this.isProcessing = false;
        this.rateLimit = {
            remaining: 50,
            resetTime: null
        };
    }
    
    async search(query, options = {}) {
        return new Promise((resolve, reject) => {
            this.requestQueue.push({ query, options, resolve, reject });
            this.processQueue();
        });
    }
    
    async processQueue() {
        if (this.isProcessing || this.requestQueue.length === 0) return;
        if (this.rateLimit.remaining <= 0) {
            const waitTime = this.rateLimit.resetTime - Date.now();
            if (waitTime > 0) {
                setTimeout(() => this.processQueue(), waitTime);
                return;
            }
        }
        
        this.isProcessing = true;
        const request = this.requestQueue.shift();
        
        try {
            const response = await this.makeSearchRequest(request.query, request.options);
            this.updateRateLimit(response.headers);
            request.resolve(response.data);
        } catch (error) {
            request.reject(error);
        } finally {
            this.isProcessing = false;
            setTimeout(() => this.processQueue(), 100); // Small delay between requests
        }
    }
}

Integration Example

Complete Document Search Workflow

import requests
import time
from typing import List, Dict, Any

class DocumentSearchClient:
    def __init__(self, base_url: str, access_token: str):
        self.base_url = base_url
        self.access_token = access_token
        self.headers = {
            'Authorization': f'Bearer {access_token}',
            'Content-Type': 'application/json'
        }
    
    def get_all_documents(self, file_type: str = None) -> List[Dict]:
        """Retrieve all accessible documents"""
        params = {}
        if file_type:
            params['file_type'] = file_type
            
        response = requests.get(
            f"{self.base_url}/documents/all",
            headers=self.headers,
            params=params
        )
        response.raise_for_status()
        return response.json()
    
    def search_documents(self, query: str, source_urls: List[str] = None, 
                        top_k: int = 15, search_type: str = "hybrid") -> Dict:
        """Search for relevant content across documents"""
        params = {
            'q': query,
            'topK': top_k,
            'search_type': search_type
        }
        
        if source_urls:
            params['sourceUrls'] = ','.join(source_urls)
        
        response = requests.get(
            f"{self.base_url}/retrieve",
            headers=self.headers,
            params=params
        )
        response.raise_for_status()
        return response.json()
    
    def search_specific_documents(self, query: str, document_types: List[str]) -> Dict:
        """Search within documents of specific types"""
        # First, get documents of specified types
        all_docs = self.get_all_documents()
        target_docs = [
            doc for doc in all_docs 
            if doc['document_type'] in document_types
        ]
        
        if not target_docs:
            return {"results": [], "message": "No documents found of specified types"}
        
        # Extract S3 URLs for targeted search
        source_urls = [doc['s3_url'] for doc in target_docs]
        
        # Perform search within these documents
        return self.search_documents(query, source_urls=source_urls)

# Usage example
client = DocumentSearchClient("https://api.yourdomain.com", "your-access-token")

# Search for clinical trial information in protocol documents
results = client.search_specific_documents(
    query="patient inclusion exclusion criteria",
    document_types=["Clinical Trial Protocol", "Study Protocol"]
)

print(f"Found {len(results['results'])} relevant passages:")
for result in results['results']:
    print(f"- {result['document_title']}: {result['snippet'][:100]}...")