GitHub user xtangcode created a discussion: [Ideas]  AI-Powered Parser Function 
for Unstructured Data

### Description

We propose adding an AI-driven parser function to Apache Cloudberry that 
converts unstructured data from diverse formats into structured JSON and 
enables seamless storage in tables. This function will serve as a critical 
bridge between unstructured data sources and structured systems, with core 
capabilities such as:

- Broad Format Support: Natively handle common unstructured formats, including:
  - PDFs (both text-based and scanned, with OCR for image-heavy files).
  - Microsoft Word documents (.docx).
  - Images/figures (e.g., PNG, JPG, SVG) containing text, charts, or diagrams.
  - Plain text files and markdown.

- LLM Agnosticism: Allow users to integrate their preferred large language 
models (LLMs) or vision-language models (VLMs), whether open-source (e.g., 
Llama 3, Mistral, Vicuna) or proprietary (e.g., GPT-4, Claude, PaLM). This 
flexibility ensures compatibility with existing user workflows, privacy 
requirements, and cost constraints.

The parser will extract context-aware structured data (e.g., key-value pairs, 
tables, entities, relationships) from unstructured sources, normalize it into 
JSON (with auto-generated schemas or user-defined schema overrides), and 
persist the output in Cloudberry tables for querying, analytics, or integration 
with downstream agents, tools or apps.



### Use case/motivation

Unstructured data (documents, images, etc.) is a cornerstone of modern data 
ecosystems, but its lack of structure blocks seamless integration with AI 
agents, analytics pipelines, and downstream applications. This feature 
addresses this gap, with key use cases:

- Enterprise Data Pipelines: Teams can parse invoices (PDF), contracts (Word), 
or HR documents to extract critical fields (dates, amounts, employee IDs) into 
JSON, storing them in Cloudberry tables for ERP integration or automated 
reporting. By supporting user-chosen LLMs, organizations with strict privacy 
policies (e.g., healthcare, finance) can use on-premises open-source models 
instead of proprietary cloud LLMs.
- Research & Academic Workflows: Scientists can convert scanned lab reports 
(PDF) or experimental figures (images) into structured data (e.g., results, 
methodologies, chart values) to feed into AI agents for literature reviews or 
cross-study analysis. Flexibility in LLMs lets researchers use specialized 
models trained on scientific text (e.g., BioLlama) for higher accuracy.
- Open-Source Ecosystem Integration: Apache Cloudberry’s community users (e.g., 
startups, nonprofits) often rely on cost-effective or custom-trained LLMs. This 
parser’s LLM-agnostic design lets them leverage their existing model 
investments (e.g., a fine-tuned Mistral model) to process unstructured data 
without vendor lock-in.
- AI Agent Enablement: By structuring unstructured data into JSON/tables, the 
parser empowers Cloudberry-integrated AI agents to access and act on diverse 
data (e.g., technical manuals, customer feedback) without manual preprocessing, 
unlocking automation for support, content moderation, and more.

Industry tools like Databricks’ ai_parse_document and Snowflake Cortex’s 
parse_document validate this need, but Apache Cloudberry’s open-source nature 
and focus on user choice make LLM agnosticism a differentiator—ensuring the 
feature serves a broader, more diverse user base.

### Related issues

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/1442

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to