Tier 2: Deep Research Process
Trigger: User clicks "Get detailed documents"
Completion: Enhanced overview + chat functionality enabled
Step 1: Document Collection & Download
Job Type: processing_jobs.type = 'collection'
Process Flow
Entity Navigation
- Use entity identifier from basic discovery
- Navigate to specific company page on Handelsregister
- Access document overview sections
Document Metadata Extraction
- Collect available document information (DK tree structure, AD/CD PDFs)
- Extract document identifiers, types, and access patterns
- Build document download queue
Intelligent Document Selection
- Determine which documents provide maximum value vs processing cost
- Prioritize recent documents and key document types
- Skip obvious duplicates or low-value documents
Bulk Download
- Download selected PDFs from AD/CD categories
- Navigate DK document tree for original source documents
- Handle ZIP files and extract contents
File Management
- Store files with 48-hour retention policy
- Create
documentstable records with metadata - Generate file hashes for deduplication
Next Job: Creates processing_jobs.type = 'ocr' for the entity
Step 2: OCR & Text Extraction
Job Type: processing_jobs.type = 'ocr'
Process Flow
Hash-Based Deduplication
- Calculate content hash for each document page
- Check
document_pagestable for existing results with same hash - Skip processing for pages already processed
OCR Processing
- Convert PDFs/TIFs to text page by page
- Handle mixed content (text + images + signatures)
- Process only pages not found in deduplication check
Quality Validation
- Verify OCR results for readability and completeness
- Flag low-quality pages for potential manual review
- Generate confidence scores for OCR accuracy
Database Storage
- Store page-level content in
document_pagestable - Include source document references and page numbers
- Link to original files for verification
- Store page-level content in
Next Job: Creates processing_jobs.type = 'chunking' for the entity
Step 3: Semantic Chunking & Preparation
Job Type: processing_jobs.type = 'chunking'
Process Flow
Content Aggregation
- Collect all processed pages for entity's documents
- Organize content by document type and chronology
- Prepare text for semantic analysis
LLM-Based Chunking
- Use lightweight model to create semantic chunks with context
- Maintain logical document boundaries and relationships
- Generate chunk summaries and topic identification
Chunk Enhancement
- Generate importance scores for each chunk
- Add document type tags and metadata
- Create cross-references between related chunks
Embedding Generation
- Create vector embeddings for each chunk
- Optimize embeddings for RAG-based chat functionality
- Store embedding vectors for similarity search
Database Storage
- Store chunks in
document_chunkstable - Include complete source traceability to original documents
- Link chunks to specific pages and document sections
- Store chunks in
Completion: Update entities.processing_status = 'deep_research_complete'