xinyiZzz opened a new pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959


   ## Proposed changes
   
   By cast Decimal to Double, realize the variance calculation of Decimal type 
column.
   In my local Spark2.4 and Mysql, the variance of Decimal is also calculated 
by cast to Double.
   NOTE: There may be two parts of precision loss:
       - Decimal cast to Double
       - In the variance calculation process, the precision of the Double type 
intermediate variable is lost (the effective number of digits of Double is 16)
   
   ## Types of changes
   
   What types of changes does your code introduce to Doris?
   _Put an `x` in the boxes that apply_
   
   - [x] Bugfix (non-breaking change which fixes an issue)
   
   ## Checklist
   
   - [x] I have create an issue on (Fix #3127), and have described the 
bug/feature there in detail
   - [x] Compiling and unit tests pass locally with my changes
   
   ## Further comments
   
   Before fix the variance calculation by Decimal cast to Double, I implemented 
a method of directly calculating the variance on Decimal, using DecimalV2Value 
as the type of intermediate variable. see Commit: 
https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066
   
   Compared with cast to Double, the considerations and advantages of 
calculating variance directly based on Decimal:
       - Less loss of precision.
       - Faster performance.
       - Consistent with other functions, such as `avg()`, `sum()`, etc., these 
functions use Decimal to calculate directly in Doris, and use DecimalV2Value as 
the type of intermediate variable.
   
   However, there is an overflow problem in calculating variance based on 
Decimal. When `Variance> 9223372036854775807999999999/N` 
(9223372036854775807999999999 is the maximum value of DecimalV2Value, N is the 
number of data rows), the intermediate variable in the calculation process will 
overflow and return a maximum value.
   
   I feel that theoretical business requirements will not calculate the 
variance greater than `9223372036854775807999999999/N`, so cast to Double means 
lower accuracy and worse performance, but both Spark and Mysql use cast to 
double methods, so I Pull this code.
   
   # precision comparison:
       - When the first 16 bits of the data are the same, such as 
`12345678901234560` and `12345678901234561`, at this time, the result 
calculated by cast to double is 0 (loss of precision), and the correct result 
by direct calculation by Decimal is 0.5
       - When the variance result is large, compared to the result after cast 
to double `42981942961.912369`, the result directly calculated by Decimal 
`42981942961.913988699` is more accurate
   
   # Performance comparison:
   The test results of variance/stddev and other functions are as follows:
       - Cast to Double calculation: 0.054s
       - Decimal direct calculation: 0.034s
   
   Table building method:
   ```
   CREATE TABLE stddev_samp_test29
   (event_day int, money DECIMAL(27,9) DEFAULT "0")
   DISTRIBUTED BY HASH(event_day) BUCKETS 5
   PROPERTIES("replication_num" = "1");
   ```
   
   Data with 10w rows of equal length, such as `99.00066`, is generated by:
   ```
   f = open('data.txt','w')
   n = 100000
   while (n):
    f.write('0,' + "%.5f"% (99 + random.random()) +'\n')
    n -= 1
   ```
   
   It’s too important to complete the preliminary research, so sad...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to