gortiz commented on code in PR #13303: URL: https://github.com/apache/pinot/pull/13303#discussion_r1670135123
########## pinot-common/src/main/java/org/apache/pinot/common/datablock/BaseDataBlock.java: ########## @@ -426,168 +401,202 @@ public Map<Integer, String> getExceptions() { return _errCodeToExceptionMap; } - /** - * Serialize this data block to a byte array. - * <p> - * In order to deserialize it, {@link DataBlockUtils#getDataBlock(ByteBuffer)} should be used. - */ @Override - public byte[] toBytes() + public List<ByteBuffer> serialize() throws IOException { - ThreadResourceUsageProvider threadResourceUsageProvider = new ThreadResourceUsageProvider(); - - UnsynchronizedByteArrayOutputStream byteArrayOutputStream = new UnsynchronizedByteArrayOutputStream(8192); - DataOutputStream dataOutputStream = new DataOutputStream(byteArrayOutputStream); - writeLeadingSections(dataOutputStream); - - // Write metadata: length followed by actual metadata bytes. - // NOTE: We ignore metadata serialization time in "responseSerializationCpuTimeNs" as it's negligible while - // considering it will bring a lot code complexity. - serializeMetadata(dataOutputStream); - - return byteArrayOutputStream.toByteArray(); - } - - private void writeLeadingSections(DataOutputStream dataOutputStream) - throws IOException { - dataOutputStream.writeInt(getDataBlockVersionType()); - dataOutputStream.writeInt(_numRows); - dataOutputStream.writeInt(_numColumns); - int dataOffset = HEADER_SIZE; - - // Write exceptions section offset(START|SIZE). - dataOutputStream.writeInt(dataOffset); - byte[] exceptionsBytes; - exceptionsBytes = serializeExceptions(); - dataOutputStream.writeInt(exceptionsBytes.length); - dataOffset += exceptionsBytes.length; - - // Write dictionary map section offset(START|SIZE). - dataOutputStream.writeInt(dataOffset); - byte[] dictionaryBytes = null; - if (_stringDictionary != null) { - dictionaryBytes = serializeStringDictionary(); - dataOutputStream.writeInt(dictionaryBytes.length); - dataOffset += dictionaryBytes.length; - } else { - dataOutputStream.writeInt(0); - } - - // Write data schema section offset(START|SIZE). - dataOutputStream.writeInt(dataOffset); - byte[] dataSchemaBytes = null; - if (_dataSchema != null) { - dataSchemaBytes = _dataSchema.toBytes(); - dataOutputStream.writeInt(dataSchemaBytes.length); - dataOffset += dataSchemaBytes.length; - } else { - dataOutputStream.writeInt(0); + if (_serialized == null) { + _serialized = DataBlockUtils.serialize(this); Review Comment: I think the opposite. Having serdes as separate classes: 1. Simplifies the class. BaseDataBlock is very complex, by removing the serialization/deserialization code we simplify this class. 2. Separates responsibilities. * DataBlock defines how to use the block in the OpChain. On a cluster running different versions of Pinot, the memory layout of a data block may be different. * DataBlockSerde defines the layout used to send the blocks through the network. Different Pinot instances may support different block versions, but in order to be able to talk to each other they have to have at least one version in common. 3. Opens the possibility of having different implementations. * Imagine BaseDataBlock includes the serde code. If we want to try a new implementation, we need to modify the code, destroying the older code. That means it is more difficult to test or benchmark the new version. * In the first commits of this PR I included a class that implemented the V1_V2 protocol version but using almost the same code we use right now in master. I also kept the old `toBytes` code. It was easy to benchmark each implementation by having these options. I decided to remove these classes to make this PR simpler, but during development it was useful. * Imagine we create a new serde for the current V1_V2 version that is more efficient. The code does not support it right now but it is very easy to be able to configure which serde we want to use for each version with PinotConfiguration. This means we can try to use one serde by default but change to the other in case we find a bug. > I think keeping it within the class is easier for version control. I don't understand this. Obviously this PR destroys git history. But once we start using this mechanism there will be no difference in terms of git. Imagine that we add a V3 that serializes blocks using Apache Arrow or Parquet. V1_V2 and this theoretical V3 would be super different. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org