mcvsubbu commented on PR #11943:
URL: https://github.com/apache/pinot/pull/11943#issuecomment-1795513639

   I wrote this program to test out some hypothesis. According to this, it 
takes slightly more number of iterations to stabilize to the right segment size 
if  we apply the algorithm for all partitions. I tried with numPartitions = 1 
and numPartitions=32.
   
   Assumptions made here:
   - The segment size varies with numrows in a logarithmic fashion (see 
program). This may be wrong, open to corrections.
   - Segment size reported has a noise of 5% 
   
   I think all partitions are somewhat same in behavior, so we should count 
each partition once so as to learn best from previous completions. One way we 
could improve is (with the assumptions that partitions complete roughly at the 
same time), pick the partition that completes first (may not be partition 0, of 
course). This was considered at the time, but was hard to do, given that 
partitions may complete within minutes in some cases.
   
   ```
   public class SegSizeAlgorithmTester {
     static Double sizeToRowsRatio = null;
   
     static final double ALPHA = 0.9;
     static final long optimalSegmentSize = 350*1024*1024;
   
     // Assume segment size varies with number of rows as:
     // segmentSize = A + B * ln (numRows/C), with a noise variation of 5%
     static final double A = 4096;
     static final double B = 9.297e+7;
     static final double C = 9.7e+4;
     static final double noisePercent = 5;
   
     final static int nPartitions = 32;
   
     public static void main(String args[]) {
       int nIterations = 0;  // One iteration is completion of all partitions 
(roughly at the same time)
   
       int reportedNumRows = computeNumRows(0, 0);
       for (int s = 0; s < 50; s++) {
         double reportedSegSize = 0;
         int numRows = 0;
         for (int j = 0; j < nPartitions; j++) {
           // All partitions report the same number of rows
           reportedSegSize = A + B * Math.log((double) reportedNumRows / C);
           reportedSegSize = addNoiseToReportedSegSize(reportedSegSize, 
noisePercent);
           numRows = computeNumRows(reportedNumRows, (long) reportedSegSize);
         }
         nIterations++;
         double deviationPercent = Math.abs(optimalSegmentSize-reportedSegSize) 
*100/optimalSegmentSize;
         System.out.println(nIterations + " " + deviationPercent + " " +  
reportedNumRows + " " + reportedSegSize);
         reportedNumRows = numRows;
       }
     }
   
     private static double addNoiseToReportedSegSize(double segSize, double 
noisePercent) {
       double noise = noisePercent * Math.random() * segSize/100.0;
       if (Math.random() > 0.5) {
         return segSize + noise;
       } else {
         return segSize - noise;
       }
     }
   
     private static int computeNumRows(int numRowsConsumed, long segmentSize) {
       if (segmentSize == 0) { // first segment of table
         return 100_000; // First guess on number of rows
       }
   
       double   currentRatio = (double)segmentSize/numRowsConsumed;
       if (sizeToRowsRatio != null) {
         sizeToRowsRatio = sizeToRowsRatio * (1 - ALPHA) + currentRatio * ALPHA;
       } else {
         sizeToRowsRatio = currentRatio;
       }
   
       int newNumRows;
   
       if (segmentSize <= 0.5 * optimalSegmentSize) {
         // Quicker ramp up
         newNumRows = numRowsConsumed * 3/2;
       } else if (segmentSize >= 2 * optimalSegmentSize) {
         // Even quicker ramp down
         newNumRows = numRowsConsumed/2;
       } else {
         newNumRows = (int) (optimalSegmentSize / sizeToRowsRatio);
       }
   
       return newNumRows;
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to