GitHub user xtangcode created a discussion: [Ideas] AI-Powered Parser Function for Unstructured Data
### Description We propose adding an AI-driven parser function to Apache Cloudberry that converts unstructured data from diverse formats into structured JSON and enables seamless storage in tables. This function will serve as a critical bridge between unstructured data sources and structured systems, with core capabilities such as: - Broad Format Support: Natively handle common unstructured formats, including: - PDFs (both text-based and scanned, with OCR for image-heavy files). - Microsoft Word documents (.docx). - Images/figures (e.g., PNG, JPG, SVG) containing text, charts, or diagrams. - Plain text files and markdown. - LLM Agnosticism: Allow users to integrate their preferred large language models (LLMs) or vision-language models (VLMs), whether open-source (e.g., Llama 3, Mistral, Vicuna) or proprietary (e.g., GPT-4, Claude, PaLM). This flexibility ensures compatibility with existing user workflows, privacy requirements, and cost constraints. The parser will extract context-aware structured data (e.g., key-value pairs, tables, entities, relationships) from unstructured sources, normalize it into JSON (with auto-generated schemas or user-defined schema overrides), and persist the output in Cloudberry tables for querying, analytics, or integration with downstream agents, tools or apps. ### Use case/motivation Unstructured data (documents, images, etc.) is a cornerstone of modern data ecosystems, but its lack of structure blocks seamless integration with AI agents, analytics pipelines, and downstream applications. This feature addresses this gap, with key use cases: - Enterprise Data Pipelines: Teams can parse invoices (PDF), contracts (Word), or HR documents to extract critical fields (dates, amounts, employee IDs) into JSON, storing them in Cloudberry tables for ERP integration or automated reporting. By supporting user-chosen LLMs, organizations with strict privacy policies (e.g., healthcare, finance) can use on-premises open-source models instead of proprietary cloud LLMs. - Research & Academic Workflows: Scientists can convert scanned lab reports (PDF) or experimental figures (images) into structured data (e.g., results, methodologies, chart values) to feed into AI agents for literature reviews or cross-study analysis. Flexibility in LLMs lets researchers use specialized models trained on scientific text (e.g., BioLlama) for higher accuracy. - Open-Source Ecosystem Integration: Apache Cloudberry’s community users (e.g., startups, nonprofits) often rely on cost-effective or custom-trained LLMs. This parser’s LLM-agnostic design lets them leverage their existing model investments (e.g., a fine-tuned Mistral model) to process unstructured data without vendor lock-in. - AI Agent Enablement: By structuring unstructured data into JSON/tables, the parser empowers Cloudberry-integrated AI agents to access and act on diverse data (e.g., technical manuals, customer feedback) without manual preprocessing, unlocking automation for support, content moderation, and more. Industry tools like Databricks’ ai_parse_document and Snowflake Cortex’s parse_document validate this need, but Apache Cloudberry’s open-source nature and focus on user choice make LLM agnosticism a differentiator—ensuring the feature serves a broader, more diverse user base. ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! GitHub link: https://github.com/apache/cloudberry/discussions/1442 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
