walterddr opened a new issue, #11921: URL: https://github.com/apache/pinot/issues/11921
when we send data over the mailboxes we are estimating the data size and cut the inbound messges into chunks. however ``` block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES; ``` // Use estimated row size, this estimate is not accurate and is used to estimate numRowsPerChunk only. int estimatedRowSizeInBytes = block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES; int numRowsPerChunk = maxBlockSize / estimatedRowSizeInBytes; while (currentRow < totalNumRows) { List<Object[]> chunk = allRows.subList(currentRow, Math.min(currentRow + numRowsPerChunk, allRows.size())); ``` this is not an accurate estimate when there's high-cardinality string/bytes column that can be super large. simple solution is to use the first row to estimate the size of the row when there's variable length columns found, but - there's no easy way to tell cardinality - it is expensive to compute a row size of `Object[]` which needs to loop through everything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org