xinyiZzz opened a new pull request #4959: URL: https://github.com/apache/incubator-doris/pull/4959
## Proposed changes By cast Decimal to Double, realize the variance calculation of Decimal type column. In my local Spark2.4 and Mysql, the variance of Decimal is also calculated by cast to Double. NOTE: There may be two parts of precision loss: - Decimal cast to Double - In the variance calculation process, the precision of the Double type intermediate variable is lost (the effective number of digits of Double is 16) ## Types of changes What types of changes does your code introduce to Doris? _Put an `x` in the boxes that apply_ - [x] Bugfix (non-breaking change which fixes an issue) ## Checklist - [x] I have create an issue on (Fix #3127), and have described the bug/feature there in detail - [x] Compiling and unit tests pass locally with my changes ## Further comments Before fix the variance calculation by Decimal cast to Double, I implemented a method of directly calculating the variance on Decimal, using DecimalV2Value as the type of intermediate variable. see Commit: https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066 Compared with cast to Double, the considerations and advantages of calculating variance directly based on Decimal: - Less loss of precision. - Faster performance. - Consistent with other functions, such as `avg()`, `sum()`, etc., these functions use Decimal to calculate directly in Doris, and use DecimalV2Value as the type of intermediate variable. However, there is an overflow problem in calculating variance based on Decimal. When `Variance> 9223372036854775807999999999/N` (9223372036854775807999999999 is the maximum value of DecimalV2Value, N is the number of data rows), the intermediate variable in the calculation process will overflow and return a maximum value. I feel that theoretical business requirements will not calculate the variance greater than `9223372036854775807999999999/N`, so cast to Double means lower accuracy and worse performance, but both Spark and Mysql use cast to double methods, so I Pull this code. # precision comparison: - When the first 16 bits of the data are the same, such as `12345678901234560` and `12345678901234561`, at this time, the result calculated by cast to double is 0 (loss of precision), and the correct result by direct calculation by Decimal is 0.5 - When the variance result is large, compared to the result after cast to double `42981942961.912369`, the result directly calculated by Decimal `42981942961.913988699` is more accurate # Performance comparison: The test results of variance/stddev and other functions are as follows: - Cast to Double calculation: 0.054s - Decimal direct calculation: 0.034s Table building method: ``` CREATE TABLE stddev_samp_test29 (event_day int, money DECIMAL(27,9) DEFAULT "0") DISTRIBUTED BY HASH(event_day) BUCKETS 5 PROPERTIES("replication_num" = "1"); ``` Data with 10w rows of equal length, such as `99.00066`, is generated by: ``` f = open('data.txt','w') n = 100000 while (n): f.write('0,' + "%.5f"% (99 + random.random()) +'\n') n -= 1 ``` It’s too important to complete the preliminary research, so sad... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org