Hi Dan:
If data is self-describing, then it is not necessary to map that data to pdx.
This translation will just add an unnecessary layer for mapping. The primary
purpose of Pdx format is that it can read fieldValue without de-serializing the
data. And we just need to expose some api to "get FieldValue" and let user
implement this interface for any format( As you pointed out it can be any
format like bson, google-protobuf etc.)
For JSONFormattor, it maps JSON data to pdx and it works very well with geode.
But there are some issues with pdx typeid generation which we need to tackle
separately. Here are some issues
1. One String field can generate three typeids. As it can have three
values(fieldExist, null-value, fieldNotExist). So if one JSON document has 10
fields, then theoretically it can generate 1000 pdx typeids. We are now
planning to fix this issue.
2. We create pdx type id for each integer value by its range(byte, short, int).
This can help to reduce the number of pdx typeids. Btw this creates a problem
with query Engine as well, as query engine cares about type.
3. Field ordering in JSON document can create different pdx typeids. Possibly
we can solve this issue by sorting the fields.
Thanks.Hitesh.
From: Dan Smith <[email protected]>
To: [email protected]; Hitesh Khamesra <[email protected]>
Sent: Tuesday, January 3, 2017 4:46 PM
Subject: Re: New proposal for type definitons
Hi Hitesh,
There are a few different ways to store self describing data. One way might be
to just store the json string, or convert it to bson, and then enhance the
query engine to handle those formats. Another way might be extend PDX to
support self describing serialized values. We xould add a selfDescribing
boolean flag to RegionService.createPdxInstanceFactory. If that flag is set, we
will not register the PDX type in the type registry but instead store it as
part of the value. The JSONFormatter could set that flag to true or expose it
as an option.
Storing self describing documents is a different approach than Udo's original
proposal. I do agree there is value in being able to store consistently
structured json documents the way we do now to save memory. I think maybe I
would be happier if the original proposal was more of an external tool or
wrapper focused on sanitizing json documents without being concerned with type
ids or a central registry service. I could picture just having a single
sanitize method that takes a json string and a standard json schema and returns
a cleaned up json document. That seems like it would be a lot easier to
implement and wouldn't require the user to add typeIds to their json documents.
I still feel like storing self describing values might serve more users. It is
probably more work than a simple sanitize method like above though.
-Dan
On Tue, Jan 3, 2017 at 4:07 PM, Hitesh Khamesra <[email protected]>
wrote:
>>If we give people the option to store
and query self describing values, then users with inconsistent json could
just use that option and pay the extra storage cost.
Dan, are you saying expose some interface to serialize/de and "query the some
field in data - getFieldValue(fieldname)" dtata? Some sort of
ExternalSerializer with getFieldValue() capability.
From: Dan Smith <[email protected]>
To: [email protected]
Sent: Wednesday, December 21, 2016 6:20 PM
Subject: Re: New proposal for type definitons
I'm assuming the type ids here are a different set than the type ids used
with regular PDX serialization so they won't conflict if the pdx registry
assigns 1 to some class and a user puts @typeId: 1 in their json?
I'm concerned that this won't really address the type explosion issue.
Users that are able to go to the effort of adding these typeIds to all of
their json are probably users that can produce consistently formatted json
in the first place. Users that have inconsistently formatted json are
probably not going to want or be able to add these type ids.
It might be better for us to pursue a way to store arbitrary documents that
are self describing. Our current approach for json documents is assuming
that the documents are all consistently formatted. We are infer a schema
for their documents store the field names in the type registry and the
field values in the serialized data. If we give people the option to store
and query self describing values, then users with inconsistent json could
just use that option and pay the extra storage cost.
-Dan
On Tue, Dec 20, 2016 at 4:53 PM, Udo Kohlmeyer <[email protected]> wrote:
> Hey there,
>
> I've just completed a new proposal on the wiki for a new mechanism that
> could be used to define a type definition for an object.
> https://cwiki.apache.org/ confluence/display/GEODE/ Custom+
> External+Type+Definition+ Proposal+for+JSON
>
> Primarily the new type definition proposal will hopefully help with the
> "structuring" of JSON document definitions in a manner that will allow
> users to submit JSON documents for data types without the need to provide
> every field of the whole domain object type.
>
> Please review and comment as required.
>
> --Udo
>
>