Unstructured AI workflow
Collibra automates the transformation of unstructured enterprise content, such as documents, transcripts, and emails, into structured, governed knowledge assets. This ensures that data is high-quality, traceable, and trusted for use in Generative AI (GenAI), Retrieval-Augmented Generation (RAG), and enterprise search applications.
The automated metadata workflow
Unstructured AI replaces manual, time-intensive tagging processes with intelligent automation, reducing dependency on domain experts and custom AI pipelines. It is designed to scale efficiently, capable of tagging millions of documents per day.
Connecting to unstructured data
This step begins the process by connecting the solution to the source of unstructured enterprise content. By scanning file repositories, the system identifies and indexes unstructured content, making it discoverable and ready for further processing. This solves the common issue of fragmented data sources and poor discoverability.
To connect to unstructured data, the Unstructured AI establishes a connection to raw files stored in repositories or existing vector databases. It then enters a smart discovery process and scans file repositories to identify and prioritize relevant content.
AI-assisted schema generation
In this step, Unstructured AI automatically determines the semantic structure, or schema, required to organize the unstructured data. This eliminates the need for manual taxonomy creation, which is often time-consuming and dependent on subject matter experts. By deriving meaningful taxonomies and metadata tags, the system creates a structured framework for unstructured data.
Unstructured AI uses machine learning algorithms and large language models (LLMs) to perform AI-assisted schema generation. It discovers taxonomies and reverse engineers metadata tags and schemas directly from the data corpus. You can also define custom tags if needed.
LLM-based tagging and metadata extraction
Once the schema is established, Unstructured AI applies tagging and metadata extraction to the unstructured files. This process is highly scalable and captures detailed metadata at both the chunk and file levels. Unstructured AI also supports multimodal tagging, identifying information in text, images, charts, and tables. Each tag is accompanied by confidence scores and traceability, ensuring transparency and reliability.
Unstructured AI uses LLM-based tagging to extract metadata at the document chunk level and synthesize it into file-level metadata. Multimodal tagging allows the solution to process diverse content types, and all tags include confidence scores to support explainability.
Curating data slices
The structured metadata generated in the previous steps enables you to filter and curate datasets. This step addresses the challenge of managing duplicate, outdated, or irrelevant documents by allowing you to create subsets of files tailored to specific use cases. These curated data slices are particularly useful for data science teams and consumers who need quick access to high-quality, relevant data:
- Data slices
- A data slice is defined as a relevant subset of files curated from a large volume of unstructured content. The purpose of creating a data slice is to rapidly identify and isolate the most pertinent documents for a specific objective.
You can apply standardized filters and categories to the structured metadata, enabling rapid curation of data slices. This process streamlines the selection of relevant files from millions of documents.
Exporting and synchronizing metadata
To ensure an effective use of the governed metadata, this step involves exporting it to downstream systems. Unstructured AI integrates seamlessly with AI tools, enabling these systems to consume enriched, contextualized, and governed data. This enhances the performance of AI workflows, such as GenAI assistants and RAG systems, by providing high-quality data inputs.
You can export metadata directly to file storage systems or vector databases.
Ongoing maintenance and validation
To maintain the quality and relevance of the metadata, Unstructured AI continuously updates taxonomies and allows for validation of generated metadata. This ensures that the metadata evolves as new data is ingested and as business definitions change. Validation and fine-tuning enhance the accuracy and reliability of the metadata, increasing trust and transparency.
Unstructured AI automatically updates metadata taxonomies as new data enters the platform. You can also validate and fine-tune metadata using a Human-in-the-Loop (HITL) testing studio, which supports classification accuracy and traceability.
Conclusion
Unstructured AI capabilities bridge structured and unstructured data, while Collibra Platform provides a comprehensive metadata foundation that serves as the control plane for AI-driven data management, providing key benefits, such as:
- Unified governance: Enable seamless governance across structured and unstructured data, ensuring end-to-end lineage, policy enforcement, and compliance.
- Scalability: Automate metadata generation and tagging at scale, reducing manual effort and accelerating workflows.
- Trust and transparency: Deliver explainable metadata with traceability, ensuring compliance with regulatory and audit requirements.
- Improved AI outcomes: Provide high-quality, governed data inputs to AI systems, improving model accuracy and retrieval relevance.
By automating the transformation of unstructured content into governed knowledge assets, Collibra Unstructured AI empowers organizations to unlock the value of their data for AI, search, and analytics applications.