This is an automated email from the ASF dual-hosted git repository. kassiez pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new b15ca037128 [doc](udf) refactor the java-udf function doc (#1383) b15ca037128 is described below commit b15ca03712855fd099a5345c1bb4840cc02d3403 Author: zhangstar333 <zhangs...@selectdb.com> AuthorDate: Wed Nov 27 15:49:22 2024 +0800 [doc](udf) refactor the java-udf function doc (#1383) # Versions - [x] dev - [ ] 3.0 - [ ] 2.1 - [ ] 2.0 # Languages - [x] Chinese - [x] English --- docs/query-data/udf/java-user-defined-function.md | 635 ++++++++++++--------- .../query-data/udf/java-user-defined-function.md | 604 +++++++++++--------- .../Create/CREATE-FUNCTION.md | 3 +- 3 files changed, 718 insertions(+), 524 deletions(-) diff --git a/docs/query-data/udf/java-user-defined-function.md b/docs/query-data/udf/java-user-defined-function.md index 9a6ce7d54f9..1887debd08e 100644 --- a/docs/query-data/udf/java-user-defined-function.md +++ b/docs/query-data/udf/java-user-defined-function.md @@ -1,6 +1,6 @@ --- { -"title": "Java UDF", +"title": "Java UDF、UDAF、UDTF", "language": "en" } --- @@ -25,75 +25,15 @@ under the License. --> -## Introduction to Java UDF +## Overview +Java UDF provides a Java interface for users to implement user-defined functions (UDFs) conveniently using the Java programming language. +Doris supports the use of Java to develop UDFs, UDAFs, and UDTFs. Unless otherwise specified, "UDF" in the following text refers to all types of user-defined functions. -Java UDF provides users with a Java interface for writing UDFs, making it convenient for users to execute custom functions using the Java language. +1. Java UDF: A Java UDF is a commonly used scalar function, where each input row produces a corresponding output row. Common examples include ABS and LENGTH. Notably, Hive UDFs can be directly migrated to Doris, which is convenient for users. -Doris supports writing UDFs, UDAFs, and UDTFs using JAVA. Unless otherwise specified, UDF is used as a general term for all user-defined functions in the following text. +2. Java UDAF: A Java UDAF is a user-defined aggregate function that aggregates multiple input rows into a single output row. Common examples include MIN, MAX, and COUNT. -## Creating UDF - -The implemented jar package can be placed locally or stored on a remote server for download via HTTP, but each FE and BE node must be able to access the jar package. - -Otherwise, an error message `Couldn't open file ......` will be returned. - -For more syntax help, refer to [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md). - -### UDF - -```sql -CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES ( - "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar", - "symbol"="org.apache.doris.udf.AddOne", - "always_nullable"="true", - "type"="JAVA_UDF" -); -``` - -### UDAF - -```sql -CREATE AGGREGATE FUNCTION middle_quantiles(DOUBLE,INT) RETURNS DOUBLE PROPERTIES ( - "file"="file:///pathTo/java-udaf.jar", - "symbol"="org.apache.doris.udf.demo.MiddleNumberUDAF", - "always_nullable"="true", - "type"="JAVA_UDAF" -); -``` - -### UDTF - -:::tip -UDTF is supported starting from Doris version 3.0. -::: - -Similar to UDFs, UDTFs require users to implement an `evaluate` method. However, the return value of a UDTF must be of the Array type. - -Additionally, table functions in Doris may exhibit different behaviors due to the `_outer` suffix. For more details, refer to [OUTER combinator](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md). - -```java -public class UDTFStringTest { - public ArrayList<String> evaluate(String value, String separator) { - if (value == null || separator == null) { - return null; - } else { - return new ArrayList<>(Arrays.asList(value.split(separator))); - } - } -} -``` - -## Using UDF - -To utilize UDFs, users must possess the `SELECT` privilege for the corresponding database. - -The usage of UDFs is identical to standard functions, with the primary distinction being that built-in functions have a global scope, while UDFs are scoped within the DB. - -When the session is linked within the database, directly using the UDF name will search for the corresponding UDF within the current DB. Otherwise, users must explicitly specify the UDF's database name, for example, `dbName.funcName`. - -## Dropping UDF - -If a UDF is no longer needed, it can be dropped using the following command, as detailed in [DROP FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION). +3. Java UDTF: A Java UDTF is a user-defined table function, where a single input row can generate one or multiple output rows. In Doris, UDTFs must be used with Lateral View to achieve row-to-column transformations. Common examples include EXPLODE and EXPLODE_SPLIT. ## Type Correspondence @@ -123,284 +63,397 @@ If a UDF is no longer needed, it can be dropped using the following command, as When creating functions, avoid using `varchar` in place of `string`, as this may cause the function to fail. ::: -## Writing UDF +## Usage Notes + +1. Complex data types (HLL, Bitmap) are not supported. + +2. Users are currently allowed to specify the maximum JVM heap size. The configuration item is the `-Xmx` part of `JAVA_OPTS` in `be.conf`. The default is 1024m. If you need to aggregate data, it is recommended to increase this value to enhance performance and reduce the risk of memory overflow. + +3. Due to issues with JVM loading classes with the same name, do not use multiple classes with the same name as UDF implementations simultaneously. If you want to update a UDF with a class of the same name, you need to restart BE to reload the classpath. + + +## Getting Started This section mainly introduces how to develop a Java UDF. Examples are provided in `samples/doris-demo/java-udf-demo/` for reference. Click [here](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo) to view details. -When writing a UDF in Java, the main entry point must be the `evaluate` function. This is consistent with other engines like Hive. In this example, we write an `AddOne` UDF to perform an increment operation on integer inputs. +The usage of UDFs is identical to standard functions, with the primary distinction being that built-in functions have a global scope, while UDFs are scoped within the DB. -It is worth mentioning that this example not only supports Java UDFs in Doris but is also a UDF supported by Hive. This means that Hive UDFs can be directly migrated to Doris. +When the session is linked within the database, directly using the UDF name will search for the corresponding UDF within the current DB. Otherwise, users must explicitly specify the UDF's database name, for example, `dbName.funcName`. -Additionally, if the UDF being defined needs to load large resource files or if you want to define global static variables, you can refer to the static variable loading method described at the bottom of the document. +In the following sections, examples will use the table `test_table`. The corresponding table creation script is as follows: -### UDF +```sql +CREATE TABLE `test_table` ( + id int NULL, + d1 double NULL, + str string NULL +) ENGINE=OLAP +DUPLICATE KEY(`id`) +DISTRIBUTED BY HASH(`id`) BUCKETS 1 +PROPERTIES ( +"replication_num" = "1"); + +insert into test_table values (1, 111.11, "a,b,c"); +insert into test_table values (6, 666.66, "d,e"); +``` -```java -public class AddOne extends UDF { - public Integer evaluate(Integer value) { - return value == null ? null : value + 1; + +### Introduction to Java-UDF Example +When writing a UDF in Java, the main entry point must be the `evaluate` function. This is consistent with other engines like Hive. In this example, we write an `AddOne` UDF to perform an increment operation on integer inputs. + +1. Write the corresponding Java code and package it into a JAR file. + + ```java + public class AddOne extends UDF { + public Integer evaluate(Integer value) { + return value == null ? null : value + 1; + } } -} -``` + ``` -### UDAF +2. Register and create the Java-UDF function in Doris. For more details on the syntax, refer to [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md). -When writing a UDAF using Java code, there are some required functions (marked as required) and an inner class State that must be implemented. Below is a specific example to illustrate. + ```sql + CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES ( + "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar", + "symbol"="org.apache.doris.udf.AddOne", + "always_nullable"="true", + "type"="JAVA_UDF" + ); + ``` -**Example 1** +3. To utilize UDFs, users must possess the `SELECT` privilege for the corresponding database. And to verify the successful registration of the UDF, you can use the [SHOW FUNCTIONS](../../sql-manual/sql-statements/Show-Statements/SHOW-FUNCTIONS.md) command. -The following SimpleDemo will implement a simple function similar to sum, with the input parameter being INT and the output parameter being INT. + ``` sql + select id,java_udf_add_one(id) from test_table; + +------+----------------------+ + | id | java_udf_add_one(id) | + +------+----------------------+ + | 1 | 2 | + | 6 | 7 | + +------+----------------------+ + ``` -```java -package org.apache.doris.udf.demo; +4. If a UDF is no longer needed, it can be dropped using the following command, as detailed in [DROP FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION). -import java.io.DataInputStream; -import java.io.DataOutputStream; -import java.io.IOException; -import java.util.logging.Logger; +Additionally, if your UDF requires loading large resource files or defining global static variables, you can refer to the method for loading static variables described later in this document. -public class SimpleDemo { +### Introduction to Java-UDAF Example - Logger log = Logger.getLogger("SimpleDemo"); +When writing a `UDAF` using Java, there are some functions that must be implemented (marked as required) along with an internal class State. The following example will illustrate how to implement them. - //Need an inner class to store data - /*required*/ - public static class State { - /*some variables if you need */ - public int sum = 0; - } +1. Write the corresponding Java UDAF code and package it into a JAR file. - /*required*/ - public State create() { - /* here could do some init work if needed */ - return new State(); - } + <details><summary> Example 1: SimpleDemo will implement a simple function similar to sum, where the input parameter is INT and the output parameter is INT.</summary> - /*required*/ - public void destroy(State state) { - /* here could do some destroy work if needed */ - } + ```java + package org.apache.doris.udf; - /*Not Required*/ - public void reset(State state) { - /*if you want this udaf function can work with window function.*/ - /*Must impl this, it will be reset to init state after calculate every window frame*/ - state.sum = 0; - } + import java.io.DataInputStream; + import java.io.DataOutputStream; + import java.io.IOException; + import java.util.logging.Logger; + + public class SimpleDemo { + + Logger log = Logger.getLogger("SimpleDemo"); - /*required*/ - //first argument is State, then other types your input - public void add(State state, Integer val) throws Exception { - /* here doing update work when input data*/ - if (val != null) { - state.sum += val; + //Need an inner class to store data + /*required*/ + public static class State { + /*some variables if you need */ + public int sum = 0; + } + + /*required*/ + public State create() { + /* here could do some init work if needed */ + return new State(); + } + + /*required*/ + public void destroy(State state) { + /* here could do some destroy work if needed */ } - } - /*required*/ - public void serialize(State state, DataOutputStream out) { - /* serialize some data into buffer */ - try { + /*Not Required*/ + public void reset(State state) { + /*if you want this udaf function can work with window function.*/ + /*Must impl this, it will be reset to init state after calculate every window frame*/ + state.sum = 0; + } + + /*required*/ + //first argument is State, then other types your input + public void add(State state, Integer val) throws Exception { + /* here doing update work when input data*/ + if (val != null) { + state.sum += val; + } + } + + /*required*/ + public void serialize(State state, DataOutputStream out) throws Exception { + /* serialize some data into buffer */ out.writeInt(state.sum); - } catch (Exception e) { - /* Do not throw exceptions */ - log.info(e.getMessage()); } - } - /*required*/ - public void deserialize(State state, DataInputStream in) { - /* deserialize get data from buffer before you put */ - int val = 0; - try { - val = in.readInt(); - } catch (Exception e) { - /* Do not throw exceptions */ - log.info(e.getMessage()); + /*required*/ + public void deserialize(State state, DataInputStream in) throws Exception { + /* deserialize get data from buffer before you put */ + int val = in.readInt(); + state.sum = val; } - state.sum = val; - } - /*required*/ - public void merge(State state, State rhs) throws Exception { - /* merge data from state */ - state.sum += rhs.sum; - } + /*required*/ + public void merge(State state, State rhs) throws Exception { + /* merge data from state */ + state.sum += rhs.sum; + } - /*required*/ - //return Type you defined - public Integer getValue(State state) throws Exception { - /* return finally result */ - return state.sum; + /*required*/ + //return Type you defined + public Integer getValue(State state) throws Exception { + /* return finally result */ + return state.sum; + } } -} -``` + ``` + </details> -**Example 2** -```java -package org.apache.doris.udf.demo; - -import java.io.DataInputStream; -import java.io.DataOutputStream; -import java.math.BigDecimal; -import java.util.Arrays; -import java.util.logging.Logger; - -/* UDAF to calculate the median */ -public class MedianUDAF { - Logger log = Logger.getLogger("MedianUDAF"); - - // State storage - public static class State { - // Precision of the return result - int scale = 0; - // Whether it is the first time to execute the add method for a certain aggregation condition under a certain tablet - boolean isFirst = true; - // Data storage - public StringBuilder stringBuilder; - } - - // Initialize the state - public State create() { - State state = new State(); - // Pre-initialize based on the amount of data that needs to be aggregated under each aggregation condition of each tablet to increase performance - state.stringBuilder = new StringBuilder(1000); - return state; - } - - // Process each data under respective aggregation conditions for each tablet - public void add(State state, Double val, int scale) { - try { + <details><summary> Example 2: MedianUDAF is a function that calculates the median. The input types are (DOUBLE, INT), and the output type is DOUBLE. </summary> + + ```java + package org.apache.doris.udf.demo; + + import java.io.DataInputStream; + import java.io.DataOutputStream; + import java.io.IOException; + import java.math.BigDecimal; + import java.util.Arrays; + import java.util.logging.Logger; + + /* UDAF to calculate the median */ + public class MedianUDAF { + Logger log = Logger.getLogger("MedianUDAF"); + + // State storage + public static class State { + // Precision of the return result + int scale = 0; + // Whether it is the first time to execute the add method for a certain aggregation condition under a certain tablet + boolean isFirst = true; + // Data storage + public StringBuilder stringBuilder; + } + + // Initialize the state + public State create() { + State state = new State(); + // Pre-initialize based on the amount of data that needs to be aggregated under each aggregation condition of each tablet to increase performance + state.stringBuilder = new StringBuilder(1000); + return state; + } + + // Process each data under respective aggregation conditions for each tablet + public void add(State state, Double val, int scale) { if (val != null && state.isFirst) { state.stringBuilder.append(scale).append(",").append(val).append(","); state.isFirst = false; } else if (val != null) { state.stringBuilder.append(val).append(","); } - } catch (Exception e) { - // If it cannot be guaranteed that there will be no exceptions, it is recommended to maximize exception capture in each method, as handling of exceptions thrown by Java is currently not supported - log.info("Data acquisition exception: " + e.getMessage()); } - } - - // Data needs to be output for aggregation after processing - public void serialize(State state, DataOutputStream out) { - try { + + // Data needs to be output for aggregation after processing + public void serialize(State state, DataOutputStream out) throws IOException { // Currently, only DataOutputStream is provided. If serialization of objects is required, methods such as concatenating strings, converting to JSON, or serializing into byte arrays can be considered // If the State object needs to be serialized, it may be necessary to implement a serialization interface for the State inner class // Ultimately, everything needs to be transmitted via DataOutputStream out.writeUTF(state.stringBuilder.toString()); - } catch (Exception e) { - log.info("Serialization exception: " + e.getMessage()); } - } - - // Obtain the output data from the data processing execution unit - public void deserialize(State state, DataInputStream in) { - try { + + // Obtain the output data from the data processing execution unit + public void deserialize(State state, DataInputStream in) throws IOException { String string = in.readUTF(); state.scale = Integer.parseInt(String.valueOf(string.charAt(0))); StringBuilder stringBuilder = new StringBuilder(string.substring(2)); - state.stringBuilder = stringBuilder; - } catch (Exception e) { - log.info("Deserialization exception: " + e.getMessage()); + state.stringBuilder = stringBuilder; } - } - - // The aggregation execution unit merges the processing results of data under certain aggregation conditions for a given key. The state1 parameter is the initialized instance during the first merge of each key - public void merge(State state1, State state2) { - try { + + // The aggregation execution unit merges the processing results of data under certain aggregation conditions for a given key. The state1 parameter is the initialized instance during the first merge of each key + public void merge(State state1, State state2) { state1.scale = state2.scale; state1.stringBuilder.append(state2.stringBuilder.toString()); - } catch (Exception e) { - log.info("Merge result exception: " + e.getMessage()); } - } - - // Output the final result after merging the data for each key - public Double getValue(State state) { - try { + + // Output the final result after merging the data for each key + public Double getValue(State state) { String[] strings = state.stringBuilder.toString().split(","); double[] doubles = new double[strings.length]; for (int i = 0; i < strings.length - 1; i++) { doubles[i] = Double.parseDouble(strings[i + 1]); } - + Arrays.sort(doubles); double n = doubles.length; if (n == 0) { return 0.0; } double index = (n - 1) / 2.0; - + int low = (int) Math.floor(index); int high = (int) Math.ceil(index); - + double value = low == high ? (doubles[low] + doubles[high]) / 2 : doubles[high]; - + BigDecimal decimal = new BigDecimal(value); return decimal.setScale(state.scale, BigDecimal.ROUND_HALF_UP).doubleValue(); - } catch (Exception e) { - log.info("Calculation exception: " + e.getMessage()); } - return 0.0; - } - - // Executed after each execution unit completes - public void destroy(State state) { - } -} -``` - -## Best Practices -### Loading static variables + // Executed after each execution unit completes + public void destroy(State state) { + } + } + ``` +</details> -Currently, in Doris, executing a UDF function, e.g., `select udf(col) from table`, will load the udf.jar package for each concurrent instance, and unload the udf.jar package when the instance ends. If the udf.jar file needs to load a file of several hundred MBs, the memory usage will increase sharply due to concurrency, potentially leading to OOM (Out of Memory). +2. Register and create the Java-UDAF function in Doris. For more syntax details, please refer to [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md). -The solution is to split the resource loading code, generate a separate jar package, and have other packages directly reference this resource jar package. + ```sql + CREATE AGGREGATE FUNCTION simple_demo(INT) RETURNS INT PROPERTIES ( + "file"="file:///pathTo/java-udaf.jar", + "symbol"="org.apache.doris.udf.SimpleDemo", + "always_nullable"="true", + "type"="JAVA_UDF" + ); + ``` -Assume the files have been split into DictLibrary and FunctionUdf. +3. When using Java-UDAF, you can perform aggregation either by grouping or by aggregating all results: -1. Compile the DictLibrary file separately to generate an independent jar package, resulting in a resource file DictLibrary.jar: + ```sql + select simple_demo(id) from test_table group by id; + +-----------------+ + | simple_demo(id) | + +-----------------+ + | 1 | + | 6 | + +-----------------+ + ``` - ```shell - javac ./DictLibrary.java - jar -cf ./DictLibrary.jar ./DictLibrary.class + ```sql + select simple_demo(id) from test_table; + +-----------------+ + | simple_demo(id) | + +-----------------+ + | 7 | + +-----------------+ ``` - ```java - public class DictLibrary { - private static HashMap<String, String> res = new HashMap<>(); - - static { - // suppose we built this dictionary from a certain local file. - res.put("key1", "value1"); - res.put("key2", "value2"); - res.put("key3", "value3"); - res.put("0", "value4"); - res.put("1", "value5"); - res.put("2", "value6"); - } +### Introduction to Java-UDTF Example + +:::tip +UDTF is supported starting from Doris version 3.0. +::: - public static String evaluate(String key) { - if (key == null) { +1. Similar to UDFs, UDTFs require users to implement an `evaluate` method. However, the return value of a UDTF must be of the Array type. + + ```JAVA + public class UDTFStringTest { + public ArrayList<String> evaluate(String value, String separator) { + if (value == null || separator == null) { return null; + } else { + return new ArrayList<>(Arrays.asList(value.split(separator))); } - return res.get(key); } } ``` - ```java - public class FunctionUdf { - public String evaluate(String key) { - String value = DictLibrary.evaluate(key); - return value; +2. Register and create the Java-UDTF function in Doris. Two UDTF functions will be registered. Table functions in Doris may exhibit different behaviors due to the `_outer` suffix. For more details, refer to [OUTER combinator](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md). +For more syntax details, please refer to [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md). + + ```sql + CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string> PROPERTIES ( + "file"="file:///pathTo/java-udaf.jar", + "symbol"="org.apache.doris.udf.demo.UDTFStringTest", + "always_nullable"="true", + "type"="JAVA_UDF" + ); + ``` + +3. When using Java-UDTF, in Doris, UDTFs must be used with [`Lateral View`](../lateral-view.md) to achieve the row-to-column transformation effect: + + ```sql + select id, str, e1 from test_table lateral view java_utdf(str,',') tmp as e1; + +------+-------+------+ + | id | str | e1 | + +------+-------+------+ + | 1 | a,b,c | a | + | 1 | a,b,c | b | + | 1 | a,b,c | c | + | 6 | d,e | d | + | 6 | d,e | e | + +------+-------+------+ + ``` + +## Best Practices + +*Loading static variables* + +Currently, in Doris, executing a UDF function, e.g., `select udf(col) from table`, will load the udf.jar package for each concurrent instance, and unload the udf.jar package when the instance finish. + +If the udf.jar file needs to load a file of several hundred MBs, the memory usage will increase sharply due to concurrency, potentially leading to OOM (Out of Memory). + +Alternatively, if you want to use a connection pool, this approach will not allow you to initialize it only once in the static area. + +Here are two solutions, with the second solution requiring Doris version branch-3.0 or above. + +*Solution 1:* + +The solution is to split the resource loading code, generate a separate jar package, and have other packages directly reference this resource jar package. + +Assume the files have been split into `DictLibrary` and `FunctionUdfAR`. + +```java +public class DictLibrary { + private static HashMap<String, String> res = new HashMap<>(); + + static { + // suppose we built this dictionary from a certain local file. + res.put("key1", "value1"); + res.put("key2", "value2"); + res.put("key3", "value3"); + res.put("0", "value4"); + res.put("1", "value5"); + res.put("2", "value6"); + } + + public static String evaluate(String key) { + if (key == null) { + return null; } + return res.get(key); + } +} +``` + +```java +public class FunctionUdf { + public String evaluate(String key) { + String value = DictLibrary.evaluate(key); + return value; } +} +``` + +1. Compile the DictLibrary file separately to generate an independent jar package, resulting in a resource file DictLibrary.jar: + + ```shell + javac ./DictLibrary.java + jar -cf ./DictLibrary.jar ./DictLibrary.class ``` 2. Then compile the FunctionUdf file, directly referencing the resource package from the previous step, resulting in the FunctionUdf.jar package: @@ -410,28 +463,76 @@ Assume the files have been split into DictLibrary and FunctionUdf. jar -cvf ./FunctionUdf.jar ./FunctionUdf.class ``` -3. After the above two steps, you will get two jar packages. To allow the resource jar package to be referenced by all concurrent instances, place it in the deployment path `fe/custom_lib` 和 `be/custom_lib`. After the restarting, it will be loaded with the JVM startup. +3. After the above two steps, you will get two jar packages. To allow the resource jar package to be referenced by all concurrent instances, place it in the deployment path `be/custom_lib`. After the restarting, it will be loaded with the JVM startup. As a result, the resources will be loaded when the service starts and released when the service stops. 4. Finally, use the `create function` statement to create a UDF function ```sql CREATE FUNCTION java_udf_dict(string) RETURNS string PROPERTIES ( + "file"="file:///pathTo/FunctionUdf.jar", "symbol"="org.apache.doris.udf.FunctionUdf", "always_nullable"="true", "type"="JAVA_UDF" ); ``` -In this loading mode, both FunctionUdf.jar and DictLibrary.jar are in the custom_lib path of FE and BE. This way, the packages will be loaded and released with the service startup and shutdown. +*Solution 2:* -You can also customize the path to `FunctionUdf.jar` using `file:///`, but only under `custom_lib`. +The BE (Backend) globally caches the JAR file and customizes the expiration and eviction time. When creating a function, two additional properties are added: -## Usage Notes +static_load: This defines whether to use the static cache loading method. +expiration_time: This defines the expiration time of the JAR file, in minutes. +If the static cache loading method is used, the UDF instance will be cached after the first call and initialization. On subsequent calls to the UDF, the system will first search in the cache. If not found, the initialization process will be triggered. -1. Complex data types (HLL, Bitmap) are not supported. +Additionally, a background thread regularly checks the cache. If the function has not been called within the configured expiration time, it will be evicted from the cache. If the function is called, the cache timestamp will be automatically updated. -2. Users are currently allowed to specify the maximum JVM heap size. The configuration item is the `-Xmx` part of `JAVA_OPTS` in `be.conf`. The default is 1024m. If you need to aggregate data, it is recommended to increase this value to enhance performance and reduce the risk of memory overflow. +```sql +public class Print extends UDF { + static Integer val = 0; + public Integer evaluate() { + val = val + 1; + return val; + } +} +``` + +```sql +CREATE FUNCTION print_12() RETURNS int +PROPERTIES ( + "file" = "file:///path/to/java-udf-demo-jar-with-dependencies.jar", + "symbol" = "org.apache.doris.udf.Print", + "always_nullable"="true", + "type" = "JAVA_UDF", + "static_load" = "true", // default value is false + "expiration_time" = "60" // default value is 360 minutes +); +``` -3. For Char type UDFs, use the String type when creating the function. +As we can see, the result keeps incrementing, which proves that the loaded JAR file is not being unloaded and reloaded. Instead, the variables are being re-initialized to 0. -4. Due to issues with JVM loading classes with the same name, do not use multiple classes with the same name as UDF implementations simultaneously. If you want to update a UDF with a class of the same name, you need to restart BE to reload the classpath. +```sql +mysql [test_query_qa]>select print_12(); ++------------+ +| print_12() | ++------------+ +| 1 | ++------------+ +1 row in set (0.40 sec) + +mysql [test_query_qa]>select print_12(); ++------------+ +| print_12() | ++------------+ +| 2 | ++------------+ +1 row in set (0.03 sec) + +mysql [test_query_qa]>select print_12(); ++------------+ +| print_12() | ++------------+ +| 3 | ++------------+ +1 row in set (0.04 sec) + +``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/udf/java-user-defined-function.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/udf/java-user-defined-function.md index b09b136c204..f1ff590dcdb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/udf/java-user-defined-function.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/udf/java-user-defined-function.md @@ -1,6 +1,6 @@ --- { -"title": "Java UDF", +"title": "Java UDF、UDAF、UDTF", "language": "zh-CN" } --- @@ -24,73 +24,15 @@ specific language governing permissions and limitations under the License. --> - -## Java UDF 介绍 - +## 概述 Java UDF 为用户提供 UDF 编写的 Java 接口,以方便用户使用 Java 语言进行自定义函数的执行。 - Doris 支持使用 JAVA 编写 UDF、UDAF 和 UDTF。下文如无特殊说明,使用 UDF 统称所有用户自定义函数。 - -## 创建 UDF - -实现的 jar 包可以放在本地也可以存放在远程服务端通过 HTTP 下载,但必须让每个 FE 和 BE 节点都能获取到 jar 包。 - -否则将会返回错误状态信息 `Couldn't open file ......`。 - -更多语法帮助可参阅 [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION). - -### UDF - -```sql -CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES ( - "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar", - "symbol"="org.apache.doris.udf.AddOne", - "always_nullable"="true", - "type"="JAVA_UDF" -); -``` - -### UDAF - -```sql -CREATE AGGREGATE FUNCTION middle_quantiles(DOUBLE,INT) RETURNS DOUBLE PROPERTIES ( - "file"="file:///pathTo/java-udaf.jar", - "symbol"="org.apache.doris.udf.demo.MiddleNumberUDAF", - "always_nullable"="true", - "type"="JAVA_UDAF" -); -``` - -### UDTF - -:::tip -UDTF 自 Doris 3.0 版本开始支持 -::: - -```sql -CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string> PROPERTIES ( - "file"="file:///pathTo/java-udaf.jar", - "symbol"="org.apache.doris.udf.demo.UDTFStringTest", - "always_nullable"="true", - "type"="JAVA_UDF" -); -``` - -## 使用 UDF - -用户使用 UDF 必须拥有对应数据库的 `SELECT` 权限。 - -UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函数的作用域是全局的,而 UDF 的作用域是 DB 内部。 - -当链接 Session 位于数据内部时,直接使用 UDF 名字会在当前 DB 内部查找对应的 UDF。否则用户需要显示的指定 UDF 的数据库名字,例如 `dbName.funcName`。 - -## 删除 UDF - -当你不再需要 UDF 函数时,你可以通过下述命令来删除一个 UDF 函数,可以参考 [DROP FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION.md) +1. Java UDF 是较为常见的自定义标量函数(Scalar Function),即每输入一行数据,就会有一行对应的结果输出,较为常见的有ABS,LENGTH等。值得一提的是对于用户来讲,Hive UDF 是可以直接迁移至 Doris 的。 +2. Java UDAF 即为自定义的聚合函数(Aggregate Function),即在输入多行数据进行聚合后,仅输出一行对应的结果,较为常见的有MIN,MAX,COUNT等。 +3. JAVA UDTF 即为自定义的表函数(Table Function),即每输一行数据,可以产生一行或多行的结果,在 Doris 中需要结合Lateral View 使用可以达到行转列的效果,较为常见的有EXPLODE,EXPLODE_SPLIT等。 ## 类型对应关系 - | Doris 数据类型 | Java UDF 参数类型 | | ---------------- | ------------------------------------------ | | Bool | Boolean | @@ -118,208 +60,231 @@ UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函 在创建函数时,请务必使用 `string` 类型而不是 `varchar`,否则可能会导致函数执行失败。 ::: -## UDF 的编写 +## 使用限制 -本小节主要介绍如何开发一个 Java UDF。在 `samples/doris-demo/java-udf-demo/` 下提供示例,可点击查看[参考示例](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo) +1. 不支持复杂数据类型(HLL,Bitmap)。 +2. 当前允许用户自己指定JVM最大堆大小,配置项是 be.conf 中的 `JAVA_OPTS` 的 -Xmx 部分。默认 1024m,如果需要聚合数据,建议调大一些,增加性能,减少内存溢出风险。 +3. 由于 jvm 加载同名类的问题,不要同时使用多个同名类作为 udf 实现,如果想更新某个同名类的 udf,需要重启 be 重新加载 classpath。 -使用 Java 代码编写 UDF,UDF 的主入口必须为 `evaluate` 函数。这一点与 Hive 等其他引擎保持一致。在本示例中,我们编写了 `AddOne` UDF 来完成对整型输入进行加一的操作。 -值得一提的是,本例不只是 Doris 支持的 Java UDF,同时还是 Hive 支持的 UDF,也就是说,对于用户来讲,Hive UDF 是可以直接迁移至 Doris 的。 +## 快速上手 +本小节主要介绍如何开发一个 Java UDF。在 `samples/doris-demo/java-udf-demo/` 下提供了示例,可供参考,查看点击[这里](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo) + +UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函数的作用域是全局的,而 UDF 的作用域是 DB 内部。 +所以如果当前链接 session 位于数据库DB 内部时,直接使用 UDF 名字会在当前 DB 内部查找对应的 UDF。否则用户需要显示的指定 UDF 的数据库名字,例如 `dbName.funcName`。 + +接下来的章节介绍实例,均会在`test_table` 上做测试,对应建表如下: -另外,如果定义的 UDF 中需要加载很大的资源文件,或者希望可以定义全局的 Static 变量,可以参照文档下方的 Static 变量加载方式。 +```sql +CREATE TABLE `test_table` ( + id int NULL, + d1 double NULL, + str string NULL +) ENGINE=OLAP +DUPLICATE KEY(`id`) +DISTRIBUTED BY HASH(`id`) BUCKETS 1 +PROPERTIES ( +"replication_num" = "1"); + +insert into test_table values (1, 111.11, "a,b,c"); +insert into test_table values (6, 666.66, "d,e"); +``` +### Java-UDF 实例介绍 -### UDF +使用 Java 代码编写 UDF,UDF 的主入口必须为 `evaluate` 函数。这一点与 Hive 等其他引擎保持一致。在本示例中,我们编写了 `AddOne` UDF 来完成对整型输入进行加一的操作。 -```java -public class AddOne extends UDF { - public Integer evaluate(Integer value) { - return value == null ? null : value + 1; +1. 首先编写对应的Java 代码,打包生成JAR 包。 + + ```java + public class AddOne extends UDF { + public Integer evaluate(Integer value) { + return value == null ? null : value + 1; + } } -} -``` + ``` -### UDAF +2. 在Doris 中注册创建 Java-UDF 函数。 更多语法帮助可参阅 [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md). -在使用 Java 代码编写 UDAF 时,有一些必须实现的函数 (标记 `required`) 和一个内部类 State,下面将以一个具体的实例来说明。 + ```sql + CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES ( + "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar", + "symbol"="org.apache.doris.udf.AddOne", + "always_nullable"="true", + "type"="JAVA_UDF" + ); + ``` -**示例 1** +3. 用户使用 UDF 必须拥有对应数据库的 `SELECT` 权限。 + 如果想查看注册成功的对应UDF 函数,可以使用[SHOW FUNCTIONS](../../sql-manual/sql-statements/Show-Statements/SHOW-FUNCTIONS.md) 命令。 + + ``` sql + select id,java_udf_add_one(id) from test_table; + +------+----------------------+ + | id | java_udf_add_one(id) | + +------+----------------------+ + | 1 | 2 | + | 6 | 7 | + +------+----------------------+ + ``` -下面的 SimpleDemo 将实现一个类似的 SUM 的简单函数,输入参数 INT,输出参数是 INT。 +4. 当不再需要 UDF 函数时,可以通过下述命令来删除一个 UDF 函数,可以参考 [DROP FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION.md) -```java -package org.apache.doris.udf.demo; +另外,如果定义的 UDF 中需要加载很大的资源文件,或者希望可以定义全局的 static 变量,可以参照文档下方的 static 变量加载方式。 -import java.io.DataInputStream; -import java.io.DataOutputStream; -import java.io.IOException; -import java.util.logging.Logger; +### Java-UDAF 实例介绍 -public class SimpleDemo { +在使用 Java 代码编写 UDAF 时,有一些必须实现的函数 (标记 required) 和一个内部类 State,下面将以具体的实例来说明。 - Logger log = Logger.getLogger("SimpleDemo"); +1. 首先编写对应的Java UDAF 代码,打包生成JAR 包。 - //Need an inner class to store data - /*required*/ - public static class State { - /*some variables if you need */ - public int sum = 0; - } + <details><summary> 示例1: SimpleDemo 将实现一个类似的 sum 的简单函数,输入参数 INT,输出参数是 INT</summary> - /*required*/ - public State create() { - /* here could do some init work if needed */ - return new State(); - } + ```java + package org.apache.doris.udf; - /*required*/ - public void destroy(State state) { - /* here could do some destroy work if needed */ - } + import java.io.DataInputStream; + import java.io.DataOutputStream; + import java.io.IOException; + import java.util.logging.Logger; - /*Not Required*/ - public void reset(State state) { - /*if you want this udaf function can work with window function.*/ - /*Must impl this, it will be reset to init state after calculate every window frame*/ - state.sum = 0; - } + public class SimpleDemo { + + Logger log = Logger.getLogger("SimpleDemo"); - /*required*/ - //first argument is State, then other types your input - public void add(State state, Integer val) throws Exception { - /* here doing update work when input data*/ - if (val != null) { - state.sum += val; + //Need an inner class to store data + /*required*/ + public static class State { + /*some variables if you need */ + public int sum = 0; + } + + /*required*/ + public State create() { + /* here could do some init work if needed */ + return new State(); + } + + /*required*/ + public void destroy(State state) { + /* here could do some destroy work if needed */ } - } - /*required*/ - public void serialize(State state, DataOutputStream out) { - /* serialize some data into buffer */ - try { + /*Not Required*/ + public void reset(State state) { + /*if you want this udaf function can work with window function.*/ + /*Must impl this, it will be reset to init state after calculate every window frame*/ + state.sum = 0; + } + + /*required*/ + //first argument is State, then other types your input + public void add(State state, Integer val) throws Exception { + /* here doing update work when input data*/ + if (val != null) { + state.sum += val; + } + } + + /*required*/ + public void serialize(State state, DataOutputStream out) throws IOException { + /* serialize some data into buffer */ out.writeInt(state.sum); - } catch (Exception e) { - /* Do not throw exceptions */ - log.info(e.getMessage()); } - } - /*required*/ - public void deserialize(State state, DataInputStream in) { - /* deserialize get data from buffer before you put */ - int val = 0; - try { - val = in.readInt(); - } catch (Exception e) { - /* Do not throw exceptions */ - log.info(e.getMessage()); + /*required*/ + public void deserialize(State state, DataInputStream in) throws IOException { + /* deserialize get data from buffer before you put */ + int val = in.readInt(); + state.sum = val; } - state.sum = val; - } - /*required*/ - public void merge(State state, State rhs) throws Exception { - /* merge data from state */ - state.sum += rhs.sum; - } + /*required*/ + public void merge(State state, State rhs) throws Exception { + /* merge data from state */ + state.sum += rhs.sum; + } - /*required*/ - //return Type you defined - public Integer getValue(State state) throws Exception { - /* return finally result */ - return state.sum; + /*required*/ + //return Type you defined + public Integer getValue(State state) throws Exception { + /* return finally result */ + return state.sum; + } } -} -``` + ``` + </details> -**示例 2** -```java -package org.apache.doris.udf.demo; - - -import java.io.DataInputStream; -import java.io.DataOutputStream; -import java.math.BigDecimal; -import java.util.Arrays; -import java.util.logging.Logger; - -/*UDAF 计算中位数*/ -public class MedianUDAF { - Logger log = Logger.getLogger("MedianUDAF"); - - //状态存储 - public static class State { - //返回结果的精度 - int scale = 0; - //是否是某一个 tablet 下的某个聚合条件下的数据第一次执行 add 方法 - boolean isFirst = true; - //数据存储 - public StringBuilder stringBuilder; - } + <details><summary> 示例2: MedianUDAF是一个计算中位数的功能,输入类型为(DOUBLE, INT), 输出为DOUBLE </summary> - //状态初始化 - public State create() { - State state = new State(); - //根据每个 tablet 下的聚合条件需要聚合的数据量大小,预先初始化,增加性能 - state.stringBuilder = new StringBuilder(1000); - return state; - } + ```java + package org.apache.doris.udf.demo; + + import java.io.DataInputStream; + import java.io.DataOutputStream; + import java.math.BigDecimal; + import java.util.Arrays; + import java.util.logging.Logger; + + /*UDAF 计算中位数*/ + public class MedianUDAF { + Logger log = Logger.getLogger("MedianUDAF"); + + //状态存储 + public static class State { + //返回结果的精度 + int scale = 0; + //是否是某一个 tablet 下的某个聚合条件下的数据第一次执行 add 方法 + boolean isFirst = true; + //数据存储 + public StringBuilder stringBuilder; + } + //状态初始化 + public State create() { + State state = new State(); + //根据每个 tablet 下的聚合条件需要聚合的数据量大小,预先初始化,增加性能 + state.stringBuilder = new StringBuilder(1000); + return state; + } - //处理执行单位处理各自 tablet 下的各自聚合条件下的每个数据 - public void add(State state, Double val, int scale) { - try { + + //处理执行单位处理各自 tablet 下的各自聚合条件下的每个数据 + public void add(State state, Double val, int scale) throws IOException { if (val != null && state.isFirst) { state.stringBuilder.append(scale).append(",").append(val).append(","); state.isFirst = false; } else if (val != null) { state.stringBuilder.append(val).append(","); } - } catch (Exception e) { - //如果不能保证一定不会异常,建议每个方法都最大化捕获异常,因为目前不支持处理 java 抛出的异常 - log.info("获取数据异常:" + e.getMessage()); } - } - //处理数据完需要输出等待聚合 - public void serialize(State state, DataOutputStream out) { - try { + //处理数据完需要输出等待聚合 + public void serialize(State state, DataOutputStream out) throws IOException { //目前暂时只提供 DataOutputStream,如果需要序列化对象可以考虑拼接字符串,转换 json,序列化成字节数组等方式 //如果要序列化 State 对象,可能需要自己将 State 内部类实现序列化接口 //最终都是要通过 DataOutputStream 传输 out.writeUTF(state.stringBuilder.toString()); - } catch (Exception e) { - log.info("序列化异常:" + e.getMessage()); } - } - //获取处理数据执行单位输出的数据 - public void deserialize(State state, DataInputStream in) { - try { + //获取处理数据执行单位输出的数据 + public void deserialize(State state, DataInputStream in) throws IOException { String string = in.readUTF(); state.scale = Integer.parseInt(String.valueOf(string.charAt(0))); StringBuilder stringBuilder = new StringBuilder(string.substring(2)); state.stringBuilder = stringBuilder; - } catch (Exception e) { - log.info("反序列化异常:" + e.getMessage()); } - } - //聚合执行单位按照聚合条件合并某一个键下数据的处理结果 ,每个键第一次合并时,state1 参数是初始化的实例 - public void merge(State state1, State state2) { - try { + //聚合执行单位按照聚合条件合并某一个键下数据的处理结果 ,每个键第一次合并时,state1 参数是初始化的实例 + public void merge(State state1, State state2) throws IOException { state1.scale = state2.scale; state1.stringBuilder.append(state2.stringBuilder.toString()); - } catch (Exception e) { - log.info("合并结果异常:" + e.getMessage()); } - } - //对每个键合并后的数据进行并输出最终结果 - public Double getValue(State state) { - try { + //对每个键合并后的数据进行并输出最终结果 + public Double getValue(State state) throws IOException { String[] strings = state.stringBuilder.toString().split(","); double[] doubles = new double[strings.length + 1]; doubles = Arrays.stream(strings).mapToDouble(Double::parseDouble).toArray(); @@ -335,101 +300,228 @@ public class MedianUDAF { BigDecimal decimal = new BigDecimal(value); return decimal.setScale(state.scale, BigDecimal.ROUND_HALF_UP).doubleValue(); - } catch (Exception e) { - log.info("计算异常:" + e.getMessage()); } - return 0.0; + + //每个执行单位执行完都会执行 + public void destroy(State state) { + } + } + ``` +</details> - //每个执行单位执行完都会执行 - public void destroy(State state) { + +2. 在Doris 中注册创建 Java-UADF 函数。 更多语法帮助可参阅 [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md). + + ```sql + CREATE AGGREGATE FUNCTION simple_demo(INT) RETURNS INT PROPERTIES ( + "file"="file:///pathTo/java-udaf.jar", + "symbol"="org.apache.doris.udf.SimpleDemo", + "always_nullable"="true", + "type"="JAVA_UDF" + ); + ``` + +3. 使用 Java-UDAF, 可以分组聚合或者聚合全部结果: + + ```sql + select simple_demo(id) from test_table group by id; + +-----------------+ + | simple_demo(id) | + +-----------------+ + | 1 | + | 6 | + +-----------------+ + ``` + + ```sql + select simple_demo(id) from test_table; + +-----------------+ + | simple_demo(id) | + +-----------------+ + | 7 | + +-----------------+ + ``` + +### Java-UDTF 实例介绍 +:::tip +UDTF 自 Doris 3.0 版本开始支持 +::: + +1. 首先编写对应的Java UDTF 代码,打包生成JAR 包。 +UDTF 和 UDF 函数一样,需要用户自主实现一个 `evaluate` 方法, 但是 UDTF 函数的返回值必须是 Array 类型。 + + ```JAVA + public class UDTFStringTest { + public ArrayList<String> evaluate(String value, String separator) { + if (value == null || separator == null) { + return null; + } else { + return new ArrayList<>(Arrays.asList(value.split(separator))); + } + } } + ``` -} +2. 在Doris 中注册创建 Java-UDTF 函数。 此时会注册两个UTDF 函数,另外一个是在函数名后面加上`_outer`后缀, 其中带后缀`_outer` 的是针对结果为0行时的特殊处理,具体可查看[OUTER 组合器](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md)。 +更多语法帮助可参阅 [CREATE FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md). -``` + ```sql + CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string> PROPERTIES ( + "file"="file:///pathTo/java-udaf.jar", + "symbol"="org.apache.doris.udf.demo.UDTFStringTest", + "always_nullable"="true", + "type"="JAVA_UDF" + ); + ``` + +3. 使用 Java-UDTF, 在Doris 中使用UDTF 需要结合 [Lateral View](../lateral-view.md), 实现行转列的效果 : + + ```sql + select id, str, e1 from test_table lateral view java_utdf(str,',') tmp as e1; + +------+-------+------+ + | id | str | e1 | + +------+-------+------+ + | 1 | a,b,c | a | + | 1 | a,b,c | b | + | 1 | a,b,c | c | + | 6 | d,e | d | + | 6 | d,e | e | + +------+-------+------+ + ``` ## 最佳实践 -### Static 变量加载 +*Static 变量加载* -当前在 Doris 中,执行一个 UDF 函数,例如 `select udf(col) from table`, 每一个并发 Instance 会加载一次 udf.jar 包,在该 instance 结束时卸载掉 udf.jar 包。 +当前在 Doris 中,执行一个 UDF 函数,例如 `select udf(col) from table`, 每一个并发 Instance 会加载一次 udf.jar 包,在该 Instance 结束时卸载掉 udf.jar 包。 所以当 udf.jar 文件中需要加载一个几百 MB 的文件时,会因为并发的原因,使得占据的内存急剧增大,容易 OOM。 +或者想使用一个连接池时,这样无法做到仅在static 区域初始化一次。 -解决方法是可以将资源加载代码拆分开,单独生成一个 jar 包文件,其他包直接引用该资源 jar 包。 +这里提供两个解决方案,其中方案二需要Doris 版本在branch-3.0 以上才行。 -假设已经拆分为了 DictLibrary 和 FunctionUdf 两个文件。 +*解决方案1:* -1. 单独编译 DictLibrary 文件,使其生成一个独立的 jar 包,这样可以得到一个资源文件 DictLibrary.jar: +是可以将资源加载代码拆分开,单独生成一个 JAR 包文件,然后其他包直接引用该资源 JAR 包。 - ```shell - javac ./DictLibrary.java - jar -cf ./DictLibrary.jar ./DictLibrary.class - ``` +假设已经将代码拆分为了 DictLibrary 和 FunctionUdf 两个文件。 - ```java - public class DictLibrary { - private static HashMap<String, String> res = new HashMap<>(); - - static { - // suppose we built this dictionary from a certain local file. - res.put("key1", "value1"); - res.put("key2", "value2"); - res.put("key3", "value3"); - res.put("0", "value4"); - res.put("1", "value5"); - res.put("2", "value6"); - } +```java +public class DictLibrary { + private static HashMap<String, String> res = new HashMap<>(); + + static { + // suppose we built this dictionary from a certain local file. + res.put("key1", "value1"); + res.put("key2", "value2"); + res.put("key3", "value3"); + res.put("0", "value4"); + res.put("1", "value5"); + res.put("2", "value6"); + } - public static String evaluate(String key) { - if (key == null) { - return null; - } - return res.get(key); + public static String evaluate(String key) { + if (key == null) { + return null; } + return res.get(key); } - ``` +} +``` - ```java - public class FunctionUdf { - public String evaluate(String key) { - String value = DictLibrary.evaluate(key); - return value; - } +```java +public class FunctionUdf { + public String evaluate(String key) { + String value = DictLibrary.evaluate(key); + return value; } +} +``` + +1. 单独编译 DictLibrary 文件,使其生成一个独立的 JAR 包,这样可以得到一个资源文件包 DictLibrary.jar: + + ```shell + javac ./DictLibrary.java + jar -cf ./DictLibrary.jar ./DictLibrary.class ``` -2. 编译 FunctionUdf 文件,可以直接引用上一步的到的资源包,这样可以得到 UDF 的 FunctionUdf.jar 包。 +2. 编译 FunctionUdf 文件,需要引用上一步的到的资源包最为库使用,这样打包后可以得到 UDF 的 FunctionUdf.jar 包。 ```shell javac -cp ./DictLibrary.jar ./FunctionUdf.java jar -cvf ./FunctionUdf.jar ./FunctionUdf.class ``` -3. 经过上面两步之后,会得到两个 jar 包,由于想让资源 jar 包被所有的并发引用,所以需要将它放到指定路径 `fe/custom_lib` 和 `be/custom_lib` 下面,服务重启之后就可以随着 JVM 的启动加载进来。 +3. 由于想让资源 JAR 包被所有的并发引用,所以想让它被JVM 直接加载,可以将它放到指定路径 `be/custom_lib` 下面,BE 服务重启之后就可以随着 JVM 的启动加载进来,因此都会随着服务启动而加载,停止而释放。 -4. 最后利用 `CREATE FUNCTION` 语句创建一个 UDF 函数 +4. 最后利用 `CREATE FUNCTION` 语句创建一个 UDF 函数, 这样每次卸载仅是FunctionUdf.jar。 ```sql CREATE FUNCTION java_udf_dict(string) RETURNS string PROPERTIES ( + "file"="file:///pathTo/FunctionUdf.jar", "symbol"="org.apache.doris.udf.FunctionUdf", "always_nullable"="true", "type"="JAVA_UDF" ); ``` -使用该加载方式时,FunctionUdf.jar 和 DictLibrary.jar 都在 FE 和 BE 的 custom_lib 路径下,因此都会随着服务启动而加载,停止而释放,不再需要指定 File 的路径。 +*解决方案2:* -也可以使用 file:/// 方式自定义 FunctionUdf.jar 的路径,但是 DictLibrary.jar 只能放在 custom_lib 下。 +BE 全局缓存 JAR 包,自定义过期淘汰时间,在create function 时增加两个属性字段,其中 +static_load: 用于定义是否使用静态cache 加载的方式。 -## 使用须知 +expiration_time: 用于定义 JAR 包的过期时间,单位为分钟。 -1. 不支持复杂数据类型(HLL,Bitmap)。 +若使用静态cache 加载方式,则在第一次调用该UDF 函数时,在初始化之后会将该UDF 的实例缓存起来,在下次调用该UDF 时,首先会在cache 中进行查找,如果没有找到,则会进行相关初始化操作。 -2. 当前允许用户自己指定 JVM 最大堆大小,配置项是 be.conf 中的 `JAVA_OPTS` 的 -Xmx 部分。默认 1024m,如果需要聚合数据,建议调大一些,增加性能,减少内存溢出风险。 +并且后台有线程定期检查,如果在配置的过期淘汰时间内,一直没有被调用过,则会从缓存cache 中清理掉。如果被调用时,则会自动更新缓存时间点。 -3. Char 类型的 UDF 在 `CREATE FUNCTION` 时需要使用 String 类型。 +```sql +public class Print extends UDF { + static Integer val = 0; + public Integer evaluate() { + val = val + 1; + return val; + } +} +``` -4. 由于 JVM 加载同名类的问题,不要同时使用多个同名类作为 UDF 实现,如果想更新某个同名类的 UDF,需要重启 BE 重新加载 Classpath。 +```sql +CREATE FUNCTION print_12() RETURNS int +PROPERTIES ( + "file" = "file:///path/to/java-udf-demo-jar-with-dependencies.jar", + "symbol" = "org.apache.doris.udf.Print", + "always_nullable"="true", + "type" = "JAVA_UDF", + "static_load" = "true", // default value is false + "expiration_time" = "60" // default value is 360 minutes +); +``` +可以看到结果是一直在递增的,证明加载的 JAR 包没有被卸载后又加载,导致重新初始化变量为 0. +```sql +mysql [test_query_qa]>select print_12(); ++------------+ +| print_12() | ++------------+ +| 1 | ++------------+ +1 row in set (0.40 sec) + +mysql [test_query_qa]>select print_12(); ++------------+ +| print_12() | ++------------+ +| 2 | ++------------+ +1 row in set (0.03 sec) + +mysql [test_query_qa]>select print_12(); ++------------+ +| print_12() | ++------------+ +| 3 | ++------------+ +1 row in set (0.04 sec) +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-FUNCTION.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-FUNCTION.md index 0b28fa1a407..ea580424c20 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-FUNCTION.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-FUNCTION.md @@ -79,7 +79,8 @@ CREATE [GLOBAL] [AGGREGATE] [ALIAS] FUNCTION function_name - `properties`: 用于设定函数相关属性,能够设置的属性包括: - - `file`: 表示的包含用户UDF的jar包,当在多机环境时,也可以使用http的方式下载jar包。这个参数是必须设定的。 + - `file`: 表示的包含用户UDF的jar包,实现的 jar 包可以放在本地也可以存放在远程服务端通过 HTTP 下载,但必须让每个 FE 和 BE 节点都能获取到 jar 包。 + 这个参数是必须设定的。 - `symbol`: 表示的是包含UDF类的类名。这个参数是必须设定的 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org