mcvsubbu commented on PR #11943:
URL: https://github.com/apache/pinot/pull/11943#issuecomment-1795513639
I wrote this program to test out some hypothesis. According to this, it
takes slightly more number of iterations to stabilize to the right segment size
if we apply the algorithm for all partitions. I tried with numPartitions = 1
and numPartitions=32.
Assumptions made here:
- The segment size varies with numrows in a logarithmic fashion (see
program). This may be wrong, open to corrections.
- Segment size reported has a noise of 5%
I think all partitions are somewhat same in behavior, so we should count
each partition once so as to learn best from previous completions. One way we
could improve is (with the assumptions that partitions complete roughly at the
same time), pick the partition that completes first (may not be partition 0, of
course). This was considered at the time, but was hard to do, given that
partitions may complete within minutes in some cases.
```
public class SegSizeAlgorithmTester {
static Double sizeToRowsRatio = null;
static final double ALPHA = 0.9;
static final long optimalSegmentSize = 350*1024*1024;
// Assume segment size varies with number of rows as:
// segmentSize = A + B * ln (numRows/C), with a noise variation of 5%
static final double A = 4096;
static final double B = 9.297e+7;
static final double C = 9.7e+4;
static final double noisePercent = 5;
final static int nPartitions = 32;
public static void main(String args[]) {
int nIterations = 0; // One iteration is completion of all partitions
(roughly at the same time)
int reportedNumRows = computeNumRows(0, 0);
for (int s = 0; s < 50; s++) {
double reportedSegSize = 0;
int numRows = 0;
for (int j = 0; j < nPartitions; j++) {
// All partitions report the same number of rows
reportedSegSize = A + B * Math.log((double) reportedNumRows / C);
reportedSegSize = addNoiseToReportedSegSize(reportedSegSize,
noisePercent);
numRows = computeNumRows(reportedNumRows, (long) reportedSegSize);
}
nIterations++;
double deviationPercent = Math.abs(optimalSegmentSize-reportedSegSize)
*100/optimalSegmentSize;
System.out.println(nIterations + " " + deviationPercent + " " +
reportedNumRows + " " + reportedSegSize);
reportedNumRows = numRows;
}
}
private static double addNoiseToReportedSegSize(double segSize, double
noisePercent) {
double noise = noisePercent * Math.random() * segSize/100.0;
if (Math.random() > 0.5) {
return segSize + noise;
} else {
return segSize - noise;
}
}
private static int computeNumRows(int numRowsConsumed, long segmentSize) {
if (segmentSize == 0) { // first segment of table
return 100_000; // First guess on number of rows
}
double currentRatio = (double)segmentSize/numRowsConsumed;
if (sizeToRowsRatio != null) {
sizeToRowsRatio = sizeToRowsRatio * (1 - ALPHA) + currentRatio * ALPHA;
} else {
sizeToRowsRatio = currentRatio;
}
int newNumRows;
if (segmentSize <= 0.5 * optimalSegmentSize) {
// Quicker ramp up
newNumRows = numRowsConsumed * 3/2;
} else if (segmentSize >= 2 * optimalSegmentSize) {
// Even quicker ramp down
newNumRows = numRowsConsumed/2;
} else {
newNumRows = (int) (optimalSegmentSize / sizeToRowsRatio);
}
return newNumRows;
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]