Tier 2: Deep Research Process

Trigger: User clicks "Get detailed documents"
Completion: Enhanced overview + chat functionality enabled

Step 1: Document Collection & Download

Job Type: processing_jobs.type = 'collection'

Entity Navigation
- Use entity identifier from basic discovery
- Navigate to specific company page on Handelsregister
- Access document overview sections
Document Metadata Extraction
- Collect available document information (DK tree structure, AD/CD PDFs)
- Extract document identifiers, types, and access patterns
- Build document download queue
Intelligent Document Selection
- Determine which documents provide maximum value vs processing cost
- Prioritize recent documents and key document types
- Skip obvious duplicates or low-value documents
Bulk Download
- Download selected PDFs from AD/CD categories
- Navigate DK document tree for original source documents
- Handle ZIP files and extract contents
File Management
- Store files with 48-hour retention policy
- Create documents table records with metadata
- Generate file hashes for deduplication

Next Job: Creates processing_jobs.type = 'ocr' for the entity

Job Type: processing_jobs.type = 'ocr'

Hash-Based Deduplication
- Calculate content hash for each document page
- Check document_pages table for existing results with same hash
- Skip processing for pages already processed
OCR Processing
- Convert PDFs/TIFs to text page by page
- Handle mixed content (text + images + signatures)
- Process only pages not found in deduplication check
Quality Validation
- Verify OCR results for readability and completeness
- Flag low-quality pages for potential manual review
- Generate confidence scores for OCR accuracy
Database Storage
- Store page-level content in document_pages table
- Include source document references and page numbers
- Link to original files for verification

Next Job: Creates processing_jobs.type = 'chunking' for the entity

Job Type: processing_jobs.type = 'chunking'

Content Aggregation
- Collect all processed pages for entity's documents
- Organize content by document type and chronology
- Prepare text for semantic analysis
LLM-Based Chunking
- Use lightweight model to create semantic chunks with context
- Maintain logical document boundaries and relationships
- Generate chunk summaries and topic identification
Chunk Enhancement
- Generate importance scores for each chunk
- Add document type tags and metadata
- Create cross-references between related chunks
Embedding Generation
- Create vector embeddings for each chunk
- Optimize embeddings for RAG-based chat functionality
- Store embedding vectors for similarity search
Database Storage
- Store chunks in document_chunks table
- Include complete source traceability to original documents
- Link chunks to specific pages and document sections

Completion: Update entities.processing_status = 'deep_research_complete'