What is a Data Model for Unstructured Data?
Here is an example of a data model in the context of unstructured data:
Data Model Example: Document Search System
This is an example of a data model tailored for unstructured data in the context of a document search system, such as one built using a vector database like Qdrant.
Entity: Document
Field Name | Data Type | Description |
---|---|---|
id |
String (UUID) | Unique identifier for the document. |
title |
String | The title of the document. |
content_vector |
Float Array | Dense vector representation of the document content, generated using a pre-trained language model (e.g., OpenAI, BERT). |
metadata |
Object (JSON) | Key-value pairs storing metadata about the document (e.g., author, date, tags). |
categories |
Array of Strings | List of categories the document belongs to (e.g., "contract law", "intellectual property"). |
created_at |
DateTime | Timestamp when the document was created. |
updated_at |
DateTime | Timestamp when the document was last updated. |
Example JSON Representation
{
"id": "123e4567-e89b-12d3-a456-426614174000",
"title": "Copyright Law in the Digital Age",
"content_vector": [0.123, 0.987, 0.456, ...],
"metadata": {
"author": "Jane Doe",
"publish_date": "2024-01-15",
"language": "English"
},
"categories": ["copyright law", "digital media"],
"created_at": "2024-01-15T10:00:00Z",
"updated_at": "2024-11-30T12:00:00Z"
}
Why This Model?
- Flexibility: The unstructured content_vector enables similarity search, while structured metadata supports filtering and faceting.
- Extensibility: You can add new fields (e.g., βrelated documentsβ) without major schema changes.
- Efficiency: Vector-based retrieval is efficient for unstructured text, while metadata aids precise filtering.