One correct to Jake's comment about PDX. The cost to access a field does not change if it is the last field. The cost of accessing every variable width field except the first (which is slightly cheaper) is the same. Pdx stores and index at the end of the blob so that it can lookup any of the fields without reading the preceding fields.
I agree with Jake that using the type with the most fields could lead to blob bloat. If you stored every JSON number as a BigDecimal this would mean it would be stored in PDX as an Object field. Since PDX allows an Object field to contain any object, instead of always using BigDecimal you could use a mix of any of the other Number implementations and still just have one PDX type with an Object field. A field of "0" could be serialized as a Byte which would end up using 3 bytes most of the time. BigDecimal is optimized by geode serialization so it would only take an additional 2 bytes for a field of "0". On Thu, Jan 5, 2017 at 10:31 AM, Jacob Barrett <jbarr...@pivotal.io> wrote: > If we are simply looking at ways to avoid the PDX type bloat then some > quick wins would be: > Presort JSON field names or remove the ordering dependency in PDX type > matching. I looked into removing or working around the ordering a while ago > when dealing with GPDB integration. > Stop this silliness of trying to conserver space by putting small numbers > in to smaller int fields and force all JSON numbers to be serialized as > BigDecimal. JSON does not define any other type of number so why are we > trying to? > Don't parse time in JSON. There is no standard or type for time in JSON. > > This won't solve all bloat like that resulting for added or removed fields > but having a superset of those fields defined in the PDX metadata will only > cause memory bloat in storage. If your PDX type defines 100 fields but your > pdx instance only populates 1 the serialized form still records the null > values for the other 99 fields. If your 1 field happens to be the last > field then your performance goes to crap too since a getValue call has to > walk the entire structure from the first variable length field to find the > field you are looking for in the stream. You are way better off with more > types that define these smaller subsets of the document then one superset. > Optimizations should be made in the lookup in the PDX registry. > > -Jake > > > On Wed, Jan 4, 2017 at 12:12 PM Udo Kohlmeyer <u...@apache.org> wrote: > > > I like the idea of the self-describing and that if you are willing to > > take the memory hit, then storing the type definition as part of the > > data entry works. Although it WILL cause huge headaches: > > > > * significant increase in memory > > * performance degradation not only in the processing of the entry but > > every network hop it would incur > > > > My proposal, even if incredibly veiled under the JSON banner, was a type > > catalogue that would replace the current PDXTypeRegistry and that could > > form that basis of a greater data type service. One that does not only > > help with the serialization but also in converting from one type to > > another (JSON->Pdx) and formatting (import/export) of decimal and date > > fields. I even had the idea that it could store things like secure > > fields (masking and obscuring of non-authorized access). But I see that > > this falls squarely in the realm of security and should not be put in > > the catalogue. > > > > I believe that we could live in the "best-of-both-worlds" were you could > > define (or have it automatically define) a type definition. Then the > > current logic would continue working as is. IF one then decides that it > > is too much effort or the structure cannot be concretely defined or one > > just does not care, then the "self-describing" entry type can be used. > > With the added memory footprint,etc... > > > > Serialization has always been something that, was supposed to the > > pluggable. The serialization framework would just take the data entry > > and (de)serialize it. The serialization framework would use the type > > catalogue to (de)serialize the data, like we currently do with Pdx. The > > order of fields for the data is specified and we know how to > > (de)serialize the data. > > > > Improving the current JSONFormatter was really a start, a way how we can > > improve the definition and usage of other, non-POJO like, structures. We > > are currently facing the following problems: > > > > * Too many types created due to inconsistently structured JSON > documents > > * DateTime fields incorrectly processed due to missing the time > > component on import. (Import/Export) > > * Decimal field formatting > > > > @jake, I like the idea of having FieldReadable interface (we can work on > > the naming though). Then we can start getting some conformity around how > > we access data regardless of what type of object stored. > > > > > > On 1/4/17 07:14, William Markito Oliveira wrote: > > > I think bson already stores the field names within the serialized data > > > values, which is indeed more generic but would of course take more > space. > > > > > > These conversations are very interesting, specially considering how > many > > > popular serialization formats exists out there (Parquet, Avro, > Protobuf, > > > etc...) but I'm not sure the serialization itself was the main thing > with > > > Udo's proposal and more the problem that today JSONFormatter + PDXTypes > > is > > > the only way to do it and it could cause the "explosion of types" on > > > unstructured data. > > > > > > Seems to me that fixing the JSONFormatter to be smarter about it is a > > quick > > > path but it would not address the whole picture of making serialization > > > options modular in Geode which could be it's own new proposal as well. > > > Just a thought. > > > > > > On Tue, Jan 3, 2017 at 7:21 PM, Jacob Barrett<jbarr...@pivotal.io> > > wrote: > > > > > >> I don't know that I would be concerned with optimization of > unstructured > > >> data from the start. Given that the data is unstructured it means that > > it > > >> can be restructured at a later time. You could have a lazy task > running > > on > > >> the server the restructures unstructured data to be more uniform and > > >> compact. > > >> > > >> I also don't think there are many good reasons to try wedge this into > > PDX. > > >> The only reason I see for wedging this into PDX is to avoid progress > on > > >> modularizing and extending Geode. > > >> > > >> If all the where we access fields on a stored object, query, indexing, > > >> etc., where made a bit more generic then any object that supports a > > simple > > >> getValue(field) like interface could be accessed without > > deserialization or > > >> specialization. > > >> > > >> Consider: > > >> public interface FieldReadable { > > >> public object getValue(String field); > > >> } > > >> > > >> You could have an implementation that can getValue on PDX, POJO, JSON, > > >> BSON, XML, etc. There is no concern at this level with underlying > > storage > > >> type or the original unserialized form of the object (if any). > > >> > > >> -Jake > > >> > > >> > > >> > > >> > > >> On Tue, Jan 3, 2017 at 4:46 PM Dan Smith<dsm...@pivotal.io> wrote: > > >> > > >>> Hi Hitesh, > > >>> > > >>> There are a few different ways to store self describing data. One way > > >> might > > >>> be to just store the json string, or convert it to bson, and then > > enhance > > >>> the query engine to handle those formats. Another way might be extend > > PDX > > >>> to support self describing serialized values. We xould add a > > >> selfDescribing > > >>> boolean flag to RegionService.createPdxInstanceFactory. If that > flag is > > >>> set, we will not register the PDX type in the type registry but > instead > > >>> store it as part of the value. The JSONFormatter could set that flag > to > > >>> true or expose it as an option. > > >>> > > >>> Storing self describing documents is a different approach than Udo's > > >>> original proposal. I do agree there is value in being able to store > > >>> consistently structured json documents the way we do now to save > > memory. > > >> I > > >>> think maybe I would be happier if the original proposal was more of > an > > >>> external tool or wrapper focused on sanitizing json documents without > > >> being > > >>> concerned with type ids or a central registry service. I could > picture > > >> just > > >>> having a single sanitize method that takes a json string and a > standard > > >>> json > > >>> schema<http://json-schema.org/> and returns a cleaned up json > > document. > > >>> That seems like it would be a lot easier to implement and wouldn't > > >> require > > >>> the user to add typeIds to their json documents. > > >>> > > >>> I still feel like storing self describing values might serve more > > users. > > >> It > > >>> is probably more work than a simple sanitize method like above > though. > > >>> > > >>> -Dan > > >>> > > >>> > > >>> On Tue, Jan 3, 2017 at 4:07 PM, Hitesh Khamesra > > >>> <hitesh...@yahoo.com.invalid > > >>>> wrote: > > >>>>>> If we give people the option to store > > >>>> and query self describing values, then users with inconsistent json > > >> could > > >>>> just use that option and pay the extra storage cost. > > >>>> Dan, are you saying expose some interface to serialize/de and "query > > >> the > > >>>> some field in data - getFieldValue(fieldname)" dtata? Some sort of > > >>>> ExternalSerializer with getFieldValue() capability. > > >>>> > > >>>> > > >>>> From: Dan Smith<dsm...@pivotal.io> > > >>>> To:dev@geode.apache.org > > >>>> Sent: Wednesday, December 21, 2016 6:20 PM > > >>>> Subject: Re: New proposal for type definitons > > >>>> > > >>>> I'm assuming the type ids here are a different set than the type ids > > >> used > > >>>> with regular PDX serialization so they won't conflict if the pdx > > >> registry > > >>>> assigns 1 to some class and a user puts @typeId: 1 in their json? > > >>>> > > >>>> I'm concerned that this won't really address the type explosion > issue. > > >>>> Users that are able to go to the effort of adding these typeIds to > all > > >> of > > >>>> their json are probably users that can produce consistently > formatted > > >>> json > > >>>> in the first place. Users that have inconsistently formatted json > are > > >>>> probably not going to want or be able to add these type ids. > > >>>> > > >>>> It might be better for us to pursue a way to store arbitrary > documents > > >>> that > > >>>> are self describing. Our current approach for json documents is > > >> assuming > > >>>> that the documents are all consistently formatted. We are infer a > > >> schema > > >>>> for their documents store the field names in the type registry and > the > > >>>> field values in the serialized data. If we give people the option to > > >>> store > > >>>> and query self describing values, then users with inconsistent json > > >> could > > >>>> just use that option and pay the extra storage cost. > > >>>> > > >>>> -Dan > > >>>> > > >>>> On Tue, Dec 20, 2016 at 4:53 PM, Udo Kohlmeyer<ukohlme...@gmail.com > > > > >>>> wrote: > > >>>> > > >>>>> Hey there, > > >>>>> > > >>>>> I've just completed a new proposal on the wiki for a new mechanism > > >> that > > >>>>> could be used to define a type definition for an object. > > >>>>> https://cwiki.apache.org/confluence/display/GEODE/Custom+ > > >>>>> External+Type+Definition+Proposal+for+JSON > > >>>>> > > >>>>> Primarily the new type definition proposal will hopefully help with > > >> the > > >>>>> "structuring" of JSON document definitions in a manner that will > > >> allow > > >>>>> users to submit JSON documents for data types without the need to > > >>> provide > > >>>>> every field of the whole domain object type. > > >>>>> > > >>>>> Please review and comment as required. > > >>>>> > > >>>>> --Udo > > >>>>> > > >>>>> > > >>>> > > >>>> > > > > > > > >