sjyango opened a new pull request, #59543:
URL: https://github.com/apache/doris/pull/59543

   ## Description
   
   This PR introduces native support for **Python User-Defined Functions (UDF), 
User-Defined Aggregate Functions (UDAF), and User-Defined Table Functions 
(UDTF)** in Doris, enabling users to extend SQL capabilities with custom Python 
logic for complex data processing scenarios.
   
   ## Key Features
   
   ### 🚀 **Three Function Types**
   - **UDF**: Scalar functions with row-by-row or vectorized execution (10-100x 
performance gain with Pandas/Arrow mode)
   - **UDAF**: Snowflake-style stateful aggregation with distributed merge 
support
   - **UDTF**: Table-valued functions that generate multiple output rows from a 
single input row
   
   ### 🔧 **Production-Grade Architecture**
   - **High-Performance Communication**: Arrow Flight RPC over Unix sockets 
with zero-copy columnar data transfer
   - **Multi-Version Support**: Flexible environment management via Conda or 
venv, allowing different UDFs to use different Python versions (e.g., 3.9, 
3.10, 3.12)
   
   ### 🎯 **Deep Integration**
   - Seamless integration with Doris vectorized execution engine
   - Native support for Doris data types (including complex types like ARRAY, 
MAP, STRUCT)
   - Automatic conversion between Doris types and Python/Arrow types
   
   ## Architecture Highlights
   
   ```
   ┌──────────────────────────────────────────────────────┐
   │  Doris BE (C++)                                      │
   │  ┌────────────────────────────────────────────────┐  │
   │  │  PythonServerManager (Process Pool)            │  │
   │  │  ├─ Health Check Thread (60s interval)         │  │
   │  │  ├─ Load Balancing (min ref count)             │  │
   │  │  └─ Auto Recovery (dead process detection)     │  │
   │  └────────────────────────────────────────────────┘  │
   │  ┌────────────────────────────────────────────────┐  │
   │  │  PythonClient (Arrow Flight RPC)               │  │
   │  │  ├─ UDF: Scalar/Vectorized evaluation          │  │
   │  │  ├─ UDAF: Stateful aggregation with merge      │  │
   │  │  └─ UDTF: ListArray batch processing           │  │
   │  └────────────────────────────────────────────────┘  │
   └──────────────────────────────────────────────────────┘
                         ↕ Unix Socket
   ┌──────────────────────────────────────────────────────┐
   │  Python Process (python_server.py)                   │
   │  ┌────────────────────────────────────────────────┐  │
   │  │  FlightServer (Arrow Flight bidirectional)     │  │
   │  │  ├─ AdaptivePythonUDF (auto mode selection)    │  │
   │  │  ├─ UDAFStateManager (Snowflake interface)     │  │
   │  │  └─ UDFLoader (inline/module code execution)   │  │
   │  └────────────────────────────────────────────────┘  │
   └──────────────────────────────────────────────────────┘
   ```
   
   ## Configuration
   
   Add to `be.conf`:
   
   ```properties
   # Enable Python UDF support
   enable_python_udf_support = true
   
   # Choose environment management mode (conda or venv)
   python_env_mode = conda
   
   # For Conda mode
   python_conda_root_path = /path/to/miniconda3
   
   # For venv mode
   python_env_mode = venv
   python_venv_root_path = /doris/python_envs
   python_venv_interpreter_paths = 
/opt/python3.9/bin/python3.9:/opt/python3.12/bin/python3.12
   
   # Process pool size (0 = use CPU core count)
   max_python_process_num = 0
   ```
   
   ## Technical Highlights
   
   ### 1. Environment Management
   - **Multi-version support**: Each UDF can specify its own Python version
   - **Two modes**: Conda (full environment isolation) or venv (lightweight)
   - **Automatic discovery**: Scans available Python environments at BE startup
   
   ### 2. Process Pool Management
   - **Shared pool**: One pool per Python version, shared across all threads
   - **Load balancing**: Distributes requests to processes with minimum load
   - **Health monitoring**: Background thread checks process health every 60 
seconds
   - **Auto recovery**: Automatically recreates dead processes
   
   ### 3. Communication Protocol
   - **Arrow Flight RPC**: High-performance, language-agnostic RPC framework
   - **Unix Socket**: Local IPC for minimal latency and enhanced security
   - **Bidirectional streaming**: Efficient batch data transfer
   
   ### 4. Execution Modes
   - **Scalar mode**: Process one value at a time (simple functions)
   - **Vectorized mode**: Process entire columns with NumPy/Pandas (10-100x 
faster)
   - **Adaptive selection**: Automatically chooses mode based on function 
signature
   
   ### 5. UDAF State Management (Snowflake Style)
   - **5 lifecycle methods**: `__init__`, `accumulate`, `merge`, `finish`, 
`aggregate_state`
   - **Distributed aggregation**: Serialization/deserialization for shuffle 
operations
   - **Efficient state handling**: Place-based mapping avoids redundant 
transfers
   
   ## Limitations
   
   1. **Performance**: Python UDFs are slower than native C++ built-in 
functions. Best suited for complex logic that's difficult to implement in SQL.
   2. **Type support**: Special Doris types like HLL and Bitmap are not yet 
supported.
   3. **Concurrency**: Parallelism is limited by `max_python_process_num` 
setting.
   
   ## Related Documentation
   
   - Python UDF User Guide
   - Python UDAF User Guide
   - Python UDTF User Guide
   - Python Environment Configuration Guide
   
   ---
   
   **This PR enables users to leverage the rich Python ecosystem (NumPy, 
Pandas, scikit-learn, etc.) directly within Doris SQL queries, significantly 
expanding the platform's data processing capabilities.**
   
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to