Content Schema Tables
Access: Read-only for authenticated users, full access for workers
entities Table
Stores company records with extracted data and processing status.
Field Definitions
| Field | Type | Description | Enum Values |
|---|---|---|---|
id | uuid | Primary key | - |
name | text | Company name | - |
legal_form | text | Legal form of the company | - |
register_court | text | Court where company is registered | - |
register_type | text | Register type (e.g., HRB, HRA, VR) | - |
register_number | text | Official registration number | - |
seat | text | Company location/seat | - |
state | text | German state where the company is located (e.g., "Berlin", "Bayern") | - |
registration_status | text | Current registration status (e.g., "currently registered") | - |
address | text | Company address | - |
raw | text | Raw data dump from the source system | - |
si_document_xml | xml | Raw XML document from Handelsregister | - |
si_document_json | JSONB | Parsed XML document | - |
si_document_retrieved_at | timestamp | When XML was last retrieved | - |
processing_status | processing_status_enum | Current basic discovery processing stage | discovered, basic_discovery_running, xml_ready, xml_parsing_running, basic_discovery_complete, xml_download_failed, xml_parsing_failed |
shareholder_research_status | shareholder_research_status_enum | Status of optional shareholder research | not_started, ready, downloading, download_failed, downloaded, parsing, parsing_failed, complete |
deep_research_status | deep_research_status_enum | Status of optional deep research | not_started, ready, running, failed, complete |
created_at | timestamp | Record creation time | - |
updated_at | timestamp | Last update time | - |
files Table
Stores metadata for each unique physical file downloaded from the Handelsregister. This table uses a content-based hash (file_hash) as a unique key to ensure that the same file is never stored more than once, saving significant storage space.
Field Definitions
| Field | Type | Description |
|---|---|---|
id | uuid | Primary key |
file_hash | text | Unique SHA256 hash of the file's content for deduplication. |
storage_path | text | The full path to the file in the entity_documents storage bucket. |
file_size_bytes | bigint | The size of the file in bytes. |
mime_type | text | The MIME type of the file (e.g., application/pdf, application/zip). |
created_at | timestamp | Timestamp of when the file record was first created. |
documents Table
This table serves as a bridge, linking a physical file (content.files) to a specific company (content.entities) and enriching it with essential business context and metadata from the Handelsregister. It also supports hierarchical relationships, allowing us to track files that were extracted from an archive (like a ZIP).
Field Definitions
| Field | Type | Description |
|---|---|---|
id | uuid | Primary key |
entity_id | uuid | Foreign key linking the document to a specific company in content.entities. |
file_id | uuid | Foreign key linking to the actual physical file in content.files. |
parent_document_id | uuid | If extracted from an archive, this links to the parent archive's record in this table. |
original_filename | text | The original, often cryptic, filename as downloaded from the source. |
display_name | text | A clean, human-readable name for the document, potentially generated by AI. |
hr_document_path | text | The navigational path from the Handelsregister portal (e.g., VÖ/1/2). |
document_date | date | The date printed on the document itself (e.g., date of signing). |
received_on | date | The date the document was received by the register. |
published_on | date | The date the document was officially published. |
created_by | text | The source named in the Handelsregister |
type_of_document | text | The specific type of the document (e.g., 'Gesellschafterliste', 'Jahresabschluss'). |
language_identifier | text | The language of the document in the Handelsregister |
created_at | timestamp | Record creation time. |
updated_at | timestamp | Last update time. |
document_pages
- Stores OCR results page by page with content hashes
- Enables intelligent deduplication to avoid reprocessing identical pages
- Most expensive operation (OCR) gets cached here
document_chunks
- Stores semantic chunks created by LLM processing
- Contains embeddings and metadata for RAG-based chat
- Prepares data for efficient question answering
entity_shareholders Table
Stores shareholder information for companies, extracted from shareholder list documents via LLM processing.
Field Definitions
| Field | Type | Description |
|---|---|---|
id | uuid | Primary key |
entity_id | uuid | Foreign key to the company in content.entities |
shareholder_type | text | Type of shareholder: natural_person or organization |
first_name | text | First name (for natural persons) |
last_name | text | Last name (for natural persons) |
date_of_birth | date | Date of birth (for natural persons) |
residence | text | Residence/location (wohnort, for natural persons) |
company_name | text | Company name (for organizations) |
register_court | text | Register court (registergericht, for organizations) |
register_type | text | Register type (e.g., HRB, HRA, VR, for organizations) |
register_number | text | Register number (registernummer, for organizations) |
seat | text | Company seat/location (for organizations) |
foreign_entity | boolean | If true, indicates a foreign entity that should not trigger automatic discovery |
resolved_entity_id | uuid | Foreign key to content.entities if shareholder organization has been resolved to an existing entity |
share_nominal_amount | numeric | Nominal amount of shares held (nennbetrag_anteil) |
share_percentage | numeric | Percentage of ownership (anteil_prozent_einzel) |
sequence_number | integer | Order/sequence number in the shareholder list (lfd_nummer) |
source_document_id | uuid | Foreign key to the source document in content.documents |
created_at | timestamp | Record creation time |
updated_at | timestamp | Last update time |