Skip to content

Tier 2: Deep Research Process

Trigger: User clicks "Get detailed documents"
Completion: Enhanced overview + chat functionality enabled

Step 1: Document Collection & Download

Job Type: processing_jobs.type = 'collection'

Process Flow

  1. Entity Navigation

    • Use entity identifier from basic discovery
    • Navigate to specific company page on Handelsregister
    • Access document overview sections
  2. Document Metadata Extraction

    • Collect available document information (DK tree structure, AD/CD PDFs)
    • Extract document identifiers, types, and access patterns
    • Build document download queue
  3. Intelligent Document Selection

    • Determine which documents provide maximum value vs processing cost
    • Prioritize recent documents and key document types
    • Skip obvious duplicates or low-value documents
  4. Bulk Download

    • Download selected PDFs from AD/CD categories
    • Navigate DK document tree for original source documents
    • Handle ZIP files and extract contents
  5. File Management

    • Store files with 48-hour retention policy
    • Create documents table records with metadata
    • Generate file hashes for deduplication

Next Job: Creates processing_jobs.type = 'ocr' for the entity

Step 2: OCR & Text Extraction

Job Type: processing_jobs.type = 'ocr'

Process Flow

  1. Hash-Based Deduplication

    • Calculate content hash for each document page
    • Check document_pages table for existing results with same hash
    • Skip processing for pages already processed
  2. OCR Processing

    • Convert PDFs/TIFs to text page by page
    • Handle mixed content (text + images + signatures)
    • Process only pages not found in deduplication check
  3. Quality Validation

    • Verify OCR results for readability and completeness
    • Flag low-quality pages for potential manual review
    • Generate confidence scores for OCR accuracy
  4. Database Storage

    • Store page-level content in document_pages table
    • Include source document references and page numbers
    • Link to original files for verification

Next Job: Creates processing_jobs.type = 'chunking' for the entity

Step 3: Semantic Chunking & Preparation

Job Type: processing_jobs.type = 'chunking'

Process Flow

  1. Content Aggregation

    • Collect all processed pages for entity's documents
    • Organize content by document type and chronology
    • Prepare text for semantic analysis
  2. LLM-Based Chunking

    • Use lightweight model to create semantic chunks with context
    • Maintain logical document boundaries and relationships
    • Generate chunk summaries and topic identification
  3. Chunk Enhancement

    • Generate importance scores for each chunk
    • Add document type tags and metadata
    • Create cross-references between related chunks
  4. Embedding Generation

    • Create vector embeddings for each chunk
    • Optimize embeddings for RAG-based chat functionality
    • Store embedding vectors for similarity search
  5. Database Storage

    • Store chunks in document_chunks table
    • Include complete source traceability to original documents
    • Link chunks to specific pages and document sections

Completion: Update entities.processing_status = 'deep_research_complete'