rzo1 opened a new issue, #1558: URL: https://github.com/apache/stormcrawler/issues/1558
## [FEATURE] Add a LLM-based TextExtractor (OpenAI API compatible) ### Summary Introduce a new `TextExtractor` implementation for StormCrawler that leverages an OpenAI-compatible Large Language Model (LLM) API to perform intelligent content extraction from HTML pages. ### Motivation Traditional rule-based content extraction (e.g., using tag inclusion/exclusion patterns) can be brittle and context-insensitive. By incorporating a modern LLM (such as LLaMA 3 or GPT-4), we can significantly improve the quality of extracted text by understanding page semantics, layout context, and user-specific extraction intents. ### Proposed Solution Develop a new text extractor by first enabling the flexibility to switch extractors through a dedicated interface and dynamically instantiate the extractor via reflection based on configuration options. The LLM-based extractor should provide: - Customizable system and user prompts - Support for an optional "user request" field to guide extraction (e.g., “extract only article body”) - An optional listener hook for purposes such as logging, monitoring, or billing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
