Re: New proposal for type definitons

Hitesh Khamesra Wed, 04 Jan 2017 10:48:01 -0800

Hi Dan:
If data is self-describing, then it is not necessary to map that data to pdx.  
This translation will just add an unnecessary layer for mapping.  The primary 
purpose of Pdx format is that it can read fieldValue without de-serializing the 
data. And we just need to expose some api to "get FieldValue" and let user 
implement this interface for any format( As you pointed out  it can be any 
format like bson, google-protobuf etc.)
For JSONFormattor, it maps JSON data to pdx and it works very well with geode. 
But there are some issues with pdx typeid generation which we need to tackle 
separately. Here are some issues

1. One String field can generate three typeids. As it can have three 
values(fieldExist, null-value, fieldNotExist).  So if one JSON document has 10 
fields, then theoretically it can generate 1000 pdx typeids. We are now 
planning to fix this issue.

2. We create pdx type id for each integer value by its range(byte, short, int). 
 This can help to reduce the number of pdx typeids. Btw this creates a problem 
with query Engine as well, as query engine cares about type.

3. Field ordering in JSON document can create different pdx typeids. Possibly 
we can solve this issue by sorting the fields.
Thanks.Hitesh.

      From: Dan Smith <[email protected]>
 To: [email protected]; Hitesh Khamesra <[email protected]> 
 Sent: Tuesday, January 3, 2017 4:46 PM
 Subject: Re: New proposal for type definitons

Hi Hitesh,

There are a few different ways to store self describing data. One way might be 
to just store the json string, or convert it to bson, and then enhance the 
query engine to handle those formats. Another way might be extend PDX to 
support self describing serialized values. We xould add a selfDescribing 
boolean flag to RegionService.createPdxInstanceFactory. If that flag is set, we 
will not register the PDX type in the type registry but instead store it as 
part of the value. The JSONFormatter could set that flag to true or expose it 
as an option. 

Storing self describing documents is a different approach than Udo's original 
proposal. I do agree there is value in being able to store consistently 
structured json documents the way we do now to save memory. I think maybe I 
would be happier if the original proposal was more of an external tool or 
wrapper focused on sanitizing json documents without being concerned with type 
ids or a central registry service. I could picture just having a single 
sanitize method that takes a json string and a standard json schema and returns 
a cleaned up json document. That seems like it would be a lot easier to 
implement and wouldn't require the user to add typeIds to their json documents.

I still feel like storing self describing values might serve more users. It is 
probably more work than a simple sanitize method like above though.

-Dan

On Tue, Jan 3, 2017 at 4:07 PM, Hitesh Khamesra <[email protected]> 
wrote:

>>If we give people the option to store
and query self describing values, then users with inconsistent json could
just use that option and pay the extra storage cost.
Dan, are you saying expose some interface to serialize/de and "query the some 
field in data - getFieldValue(fieldname)" dtata?  Some sort of 
ExternalSerializer with getFieldValue() capability.

      From: Dan Smith <[email protected]>
 To: [email protected]
 Sent: Wednesday, December 21, 2016 6:20 PM
 Subject: Re: New proposal for type definitons

I'm assuming the type ids here are a different set than the type ids used
with regular PDX serialization so they won't conflict if the pdx registry
assigns 1 to some class and a user puts @typeId: 1 in their json?

I'm concerned that this won't really address the type explosion issue.
Users that are able to go to the effort of adding these typeIds to all of
their json are probably users that can produce consistently formatted json
in the first place. Users that have inconsistently formatted json are
probably not going to want or be able to add these type ids.

It might be better for us to pursue a way to store arbitrary documents that
are self describing. Our current approach for json documents is assuming
that the documents are all consistently formatted. We are infer a schema
for their documents store the field names in the type registry and the
field values in the serialized data. If we give people the option to store
and query self describing values, then users with inconsistent json could
just use that option and pay the extra storage cost.

-Dan

On Tue, Dec 20, 2016 at 4:53 PM, Udo Kohlmeyer <[email protected]> wrote:

> Hey there,
>
> I've just completed a new proposal on the wiki for a new mechanism that
> could be used to define a type definition for an object.
> https://cwiki.apache.org/ confluence/display/GEODE/ Custom+
> External+Type+Definition+ Proposal+for+JSON
>
> Primarily the new type definition proposal will hopefully help with the
> "structuring" of JSON document definitions in a manner that will allow
> users to submit JSON documents for data types without the need to provide
> every field of the whole domain object type.
>
> Please review and comment as required.
>
> --Udo
>
>

Re: New proposal for type definitons

Reply via email to