[I] [FEATURE] Add a LLM-based TextExtractor (OpenAI API compatible) [stormcrawler]

via GitHub Thu, 12 Jun 2025 08:01:22 -0700


rzo1 opened a new issue, #1558:
URL: https://github.com/apache/stormcrawler/issues/1558


   ## [FEATURE] Add a LLM-based TextExtractor (OpenAI API compatible)
   
   ### Summary
   
   Introduce a new `TextExtractor` implementation for StormCrawler that 
leverages an OpenAI-compatible Large Language Model (LLM) API to perform 
intelligent content extraction from HTML pages.
   
   ### Motivation
   
   Traditional rule-based content extraction (e.g., using tag 
inclusion/exclusion patterns) can be brittle and context-insensitive. By 
incorporating a modern LLM (such as LLaMA 3 or GPT-4), we can significantly 
improve the quality of extracted text by understanding page semantics, layout 
context, and user-specific extraction intents.
   
   ### Proposed Solution
   
   Develop a new text extractor by first enabling the flexibility to switch 
extractors through a dedicated interface and dynamically instantiate the 
extractor via reflection based on configuration options.
   
   The LLM-based extractor should provide:
   
   - Customizable system and user prompts
   - Support for an optional "user request" field to guide extraction (e.g., 
“extract only article body”)
   - An optional listener hook for purposes such as logging, monitoring, or 
billing
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [FEATURE] Add a LLM-based TextExtractor (OpenAI API compatible) [stormcrawler]

Reply via email to