seawinde opened a new pull request, #61004:
URL: https://github.com/apache/doris/pull/61004

   ### What problem does this PR solve?
   
   ## What problem does this PR solve?
   
   This PR introduces a **data lineage collection framework** for Apache Doris, 
enabling automatic extraction and reporting of table-level and column-level 
lineage from DML operations (`INSERT INTO SELECT`, `INSERT OVERWRITE`, `CREATE 
TABLE AS SELECT`).
   
   The framework is designed as an **SPI (Service Provider Interface)** to 
allow pluggable lineage consumers, aligned with the existing 
`fe-authentication` SPI architecture using `fe-extension-spi` and 
`fe-extension-loader`.
   
   ## Architecture Design
   
   ```
   ┌─────────────────────────────────────────────────────────────────────┐
   │                         SQL Execution                               │
   │  InsertIntoTableCommand / InsertOverwriteTableCommand / CTAS        │
   └──────────────────────────┬──────────────────────────────────────────┘
                              │ PlannerHook callback
                              ▼
                       ┌──────────────┐
                       │ LineageUtils │  Extracts lineage from Nereids plan
                       └──────┬───────┘
                              │ submitLineageEvent(LineageInfo)
                              ▼
                 ┌────────────────────────┐
                 │ LineageEventProcessor  │  Async event queue + worker thread
                 │                        │
                 │  ┌──────────────────┐  │
                 │  │  Dual Discovery  │  │
                 │  │  1. ServiceLoader│  │  Built-in (classpath)
                 │  │  2. DirPluginMgr │  │  External ($plugin_dir/lineage/)
                 │  └──────────────────┘  │
                 │                        │
                 │  ClassLoadingPolicy    │  Parent-first: 
org.apache.doris.nereids.lineage.*
                 │  PluginContext         │  Configuration injection 
(plugin.path, plugin.name)
                 └───────────┬────────────┘
                             │ plugin.exec(LineageInfo)
                             ▼
                 ┌────────────────────────┐
                 │   LineagePlugin (SPI)  │  Extends fe-extension-spi Plugin
                 │   LineagePluginFactory │  Extends fe-extension-spi 
PluginFactory
                 └────────────────────────┘
                             │
                 ┌───────────┴────────────┐
                 │  External Plugins      │  Discovered from 
$plugin_dir/lineage/
                 │  (e.g. DataWorks)      │  via DirectoryPluginRuntimeManager
                 └────────────────────────┘
   ```
   
   ## Key Components
   
   ### 1. Lineage Extraction (`LineageInfoExtractor`)
   - Traverses the Nereids logical/physical plan tree to extract source-target 
relationships
   - Supports **table-level lineage**: which source tables contribute to each 
target table
   - Supports **column-level lineage**: which source columns map to each target 
column, including through expressions (functions, casts, aggregations)
   - Handles complex query patterns: JOINs, subqueries, UNIONs, window 
functions, CTEs
   
   ### 2. SPI Interfaces (aligned with `fe-authentication`)
   
   | Interface | Base Class | Description |
   |-----------|-----------|-------------|
   | `LineagePlugin` | `extends Plugin` (`fe-extension-spi`) | Plugin contract: 
`eventFilter()` + `exec(LineageInfo)` |
   | `LineagePluginFactory` | `extends PluginFactory` (`fe-extension-spi`) | 
Factory contract: `name()` + `create()` |
   
   ### 3. Event Processing (`LineageEventProcessor`)
   - **Dual discovery** (aligned with `AuthenticationPluginManager`):
     - `ServiceLoader.load()` for built-in classpath plugins
     - `DirectoryPluginRuntimeManager.loadAll()` for external directory plugins
   - **ClassLoadingPolicy**: parent-first for 
`org.apache.doris.nereids.lineage.*` to ensure SPI interface identity
   - **Async processing**: bounded `BlockingQueue` (configurable via 
`lineage_event_queue_size`) + daemon worker thread
   - **Fault tolerance**: plugin exceptions are caught and logged, never crash 
the worker
   
   ### 4. Integration Points
   - `NereidsPlanner` → `PlannerHook` callback after plan optimization
   - `InsertIntoTableCommand` / `InsertOverwriteTableCommand` / 
`CreateTableCommand` → trigger lineage extraction on successful execution
   - `ConnectContext` → carries lineage metadata (query ID, user, database, 
timing)
   - `Env.java` → `lineageEventProcessor.start()` at FE startup
   
   ## Configuration
   
   | Config | Default | Description |
   |--------|---------|-------------|
   | `activate_lineage_plugin` | `{}` (empty) | Active plugin names, e.g. 
`dataworks` |
   | `lineage_event_queue_size` | `50000` | Max pending lineage events |
   | `plugin_dir` | `$DORIS_HOME/plugins` | External plugin root directory |
   
   ## How to Use
   
   1. Implement `LineagePlugin` + `LineagePluginFactory` in an external module
   2. Register via 
`META-INF/services/org.apache.doris.nereids.lineage.LineagePluginFactory`
   3. Deploy jar + `plugin.conf` to `$plugin_dir/lineage/{plugin_name}/`
   4. Set `activate_lineage_plugin = {plugin_name}` in `fe.conf`
   
   ## Testing
   
   - `LineageEventProcessorTest`: 11 tests covering plugin lifecycle, event 
dispatch, error handling, SPI factory contract
   - `LineageInfoExtractorTest`: 25+ tests covering INSERT INTO, CTAS, INSERT 
OVERWRITE, JOINs, subqueries, expressions, multi-catalog
   - `LineageUtilsSkipTest`: Tests for lineage skip conditions
   
   
   
   Issue Number: close #xxx
   
   Related PR: #xxx
   
   Problem Summary:
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [x] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to