siddharthteotia commented on pull request #6710: URL: https://github.com/apache/incubator-pinot/pull/6710#issuecomment-806259732
> @mcvsubbu Just found a defect of using enum value as key and encode trailer as `(int, int, bytes/blob in utf-8) `: > > * We are able to add new key into the enum, without bumping up version > * We are able to not include a key into trailer, without bumping up version > * **However, we are unable to remove a key from the enum (if the key is no long used in a future version)** > > Namely, say we now have three keys: > > ``` > // old version: > enum { > key1, > key2, > key3, > } > ``` > > Now if we remove key2 from the enum since it's no longer been used. > > ``` > // new version > enum { > key1, > key3, > } > ``` > > Then, when new broker receive bytes from old server, it will interpret value of k2 as value of k3. > > So a better solution is using string as key and encode trailer as `(int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8)`. Which is exactly how we encode metadata in V2. > > However, if we do it in his way, it's equivalent to just moving metadata section to the end of datatable, which does not make too much sense to bump up a version just for rearranging sections in datatable. > > Let's take a step back to what we wanner solve: > > * we wanner add serialization_cost to datatable, but serialization_cost is not available before serialization. > * we wanner keep back-comp > > To add serialization_cost to datatable after serialization, basically we have two options: > > * append it to the end of bytes. > * put a temporary value of serialization_cost when serialization, after serialization is done, replace it as the actual value. > > So, here is another approach: > > * don't add a trailer section > * put serialization_cost into metedata > * we serialize metedata, in V2 we encode it as `(int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8)`. Encoding in this way makes value replacement after serialization impossible, since `String.valueOf("1000").length() != String.valueOf("100000").length()`. > * In V3, keep all existing logic. However, if the value is long, we should encode it as `(int of key length, bytes of key in utf-8, toBigEndian(longValue))`. And the the function of `serializaMetadata()`, we can have a variable to record the start offset of serialization_cost. > > ``` > bytes[] bytes; > int serialization_cost_value_start_offset; > > offset = 0; > for (String key: metadata.keySet()) { > keybytes[] = to-utf8(key); > bytes.append(keybytes.length()) > bytes.append(keybytes) > > offset += 4; > offset += keybytes.length > > if (key.equals("erialization_cost")) { > serialization_cost_value_start_offset = offset; > valuebytes = toBigEndian(value); > bytes.append(valuebytes) > offset += 8; > } else { > valuebytes = to-utf8(value); > bytes.append(valuebytes.length()) > bytes.append(valuebytes) > offset += 4 > offset += keybytes.length > } > } > ``` > > So after serialization, we are able to replace the value of serialization_cost (`toBigEndian(longValue)` is always 8 bytes, which makes replacement possible): > > ``` > offset = metadataStartOffset+serialization_cost_value_start_offset > bytes[offset:offset+8] = toBigEndian(actualValue) > ``` @mqliang @mcvsubbu I don't think we should worry about or even allow removal of enums. It is complicating the design plus it's something that is typically not allowed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org