Re: Geode - store and query JSON documents

2020-11-22 Thread ankit Soni
Hello geode-devs, please provide a guidance on this.

Ankit.

On Sat, 21 Nov 2020 at 10:23, ankit Soni  wrote:

> Hello team,
>
> I am *evaluating usage of Geode (1.12) with storing JSON documents and
> querying the same*. I am able to store the json records successfully in
> geode but seeking guidance on how to query them.
> More details on code and sample json is,
>
>
> *Sample client-code*
>
> import org.apache.geode.cache.client.ClientCache;
> import org.apache.geode.cache.client.ClientCacheFactory;
> import org.apache.geode.cache.client.ClientRegionShortcut;
> import org.apache.geode.pdx.JSONFormatter;
> import org.apache.geode.pdx.PdxInstance;
>
> public class MyTest {
>
> *//NOTE: Below is truncated json, single json document can max contain an 
> array of col1...col30 (30 diff attributes) within data. *
> public final static  String jsonDoc_2 = "{" +
> "\"data\":[{" +
> "\"col1\": {" +
> "\"k11\": \"aaa\"," +
> "\"k12\":true," +
> "\"k13\": ," +
> "\"k14\": \"2020-12-31:00:00:00\"" +
> "}," +
> "\"col2\":[{" +
> "\"k21\": \"22\"," +
> "\"k22\": true" +
> "}]" +
> "}]" +
> "}";
>
> * //NOTE: Col1col30 are mix of JSONObject ({}) and JSONArray ([]) as 
> shown above in jsonDoc_2;*
>
> public static void main(String[] args){
>
> //create client-cache
> ClientCache cache = new 
> ClientCacheFactory().addPoolLocator(LOCATOR_HOST, PORT).create();
> Region region = cache. PdxInstance>createClientRegionFactory(ClientRegionShortcut.CACHING_PROXY)
> .create(REGION_NAME);
>
> //store json document
> region.put("key", JSONFormatter.fromJSON(jsonDoc_2));
>
> //How to query json document like,
>
> // 1. select col2.k21, col1, col20 from /REGION_NAME where 
> data.col2.k21 = '22' OR data.col2.k21 = '33'
>
> // 2. select col2.k21, col1.k11, col1 from /REGION_NAME where 
> data.col1.k11 in ('aaa', 'xxx', 'yyy')
> }
> }
>
> *Server: Region-creation*
>
> gfsh> create region --name=REGION_NAME --type=PARTITION --redundant-copies=1 
> --total-num-buckets=61
>
>
> *Setup: Distributed cluster of 3 nodes
> *
>
> *My Observations/Problems*
> -  Put operation takes excessive time: region.put("key",
> JSONFormatter.fromJSON(jsonDoc_2));  - Fetching a single record from () a
> file and Storing in geode approx. takes . 3 secs
>Is there any suggestions/configuration related to JSONFormatter API or
> other to optimize this...?
>
> *Looking forward to guidance on querying this JOSN for above sample
> queries.*
>
> *Thanks*
> *Ankit*
>


Re: Geode - store and query JSON documents

2020-11-22 Thread Mario Salazar de Torres
Hi @ankit Soni,

I would say the kind of request you want to execute can't be (or at least 
easily) done. And let me explain what I mean.
JSON objects are encapsulated as something called PdxInstance's and there are 
certain restrictions when it comes to querying these type objects:

You can't make queries iterating over the elements of an array. I.E's:

  *   SELECT data[*].col1 FROM /REGION_NAME WHERE data[*].col1
  *   SELECT data[*].col1 FROM /REGION_NAME WHERE data[*].col2[*].k21 = '22'
  *   SELECT * FROM /REGION_NAME WHERE data[*].col2[*].k21 = '22'

The query syntax in a Elasticsearch fashion is not available in Geode as I am 
aware.
Maybe, someone else know If there is a way to execute these queries with Lucene?

Sorry not to be able to help too much :S
BR,
Mario.

From: ankit Soni 
Sent: Sunday, November 22, 2020 8:52 PM
To: dev@geode.apache.org 
Subject: Re: Geode - store and query JSON documents

Hello geode-devs, please provide a guidance on this.

Ankit.

On Sat, 21 Nov 2020 at 10:23, ankit Soni  wrote:

> Hello team,
>
> I am *evaluating usage of Geode (1.12) with storing JSON documents and
> querying the same*. I am able to store the json records successfully in
> geode but seeking guidance on how to query them.
> More details on code and sample json is,
>
>
> *Sample client-code*
>
> import org.apache.geode.cache.client.ClientCache;
> import org.apache.geode.cache.client.ClientCacheFactory;
> import org.apache.geode.cache.client.ClientRegionShortcut;
> import org.apache.geode.pdx.JSONFormatter;
> import org.apache.geode.pdx.PdxInstance;
>
> public class MyTest {
>
> *//NOTE: Below is truncated json, single json document can max contain an 
> array of col1...col30 (30 diff attributes) within data. *
> public final static  String jsonDoc_2 = "{" +
> "\"data\":[{" +
> "\"col1\": {" +
> "\"k11\": \"aaa\"," +
> "\"k12\":true," +
> "\"k13\": ," +
> "\"k14\": \"2020-12-31:00:00:00\"" +
> "}," +
> "\"col2\":[{" +
> "\"k21\": \"22\"," +
> "\"k22\": true" +
> "}]" +
> "}]" +
> "}";
>
> * //NOTE: Col1col30 are mix of JSONObject ({}) and JSONArray ([]) as 
> shown above in jsonDoc_2;*
>
> public static void main(String[] args){
>
> //create client-cache
> ClientCache cache = new 
> ClientCacheFactory().addPoolLocator(LOCATOR_HOST, PORT).create();
> Region region = cache. PdxInstance>createClientRegionFactory(ClientRegionShortcut.CACHING_PROXY)
> .create(REGION_NAME);
>
> //store json document
> region.put("key", JSONFormatter.fromJSON(jsonDoc_2));
>
> //How to query json document like,
>
> // 1. select col2.k21, col1, col20 from /REGION_NAME where 
> data.col2.k21 = '22' OR data.col2.k21 = '33'
>
> // 2. select col2.k21, col1.k11, col1 from /REGION_NAME where 
> data.col1.k11 in ('aaa', 'xxx', 'yyy')
> }
> }
>
> *Server: Region-creation*
>
> gfsh> create region --name=REGION_NAME --type=PARTITION --redundant-copies=1 
> --total-num-buckets=61
>
>
> *Setup: Distributed cluster of 3 nodes
> *
>
> *My Observations/Problems*
> -  Put operation takes excessive time: region.put("key",
> JSONFormatter.fromJSON(jsonDoc_2));  - Fetching a single record from () a
> file and Storing in geode approx. takes . 3 secs
>Is there any suggestions/configuration related to JSONFormatter API or
> other to optimize this...?
>
> *Looking forward to guidance on querying this JOSN for above sample
> queries.*
>
> *Thanks*
> *Ankit*
>


Re: Requests taking too long if one member of the cluster fails

2020-11-22 Thread John Blum
Hi Mario-


1) Regarding why only write to the primary (bucket) of a PR (?)... again, it 
has to do with consistency.

Fundamentally, a distributed system is constrained by CAP.  The system can 
either be consistent or available in the face of network partitions.  You 
cannot have your cake and eat it too, 😉.

By design, Apache Geode favors consistency over availability.  However, it 
doesn't mean Geode becomes unavailable when a node or nodes, or the network, 
fails. With the ability to configure redundant copies, it is more like "limited 
availability" when a member or portion of the cluster is severed from the rest 
of the system, until the member(s) or network recovers.

But, to guarantee consistency, a single node (i.e. the "primary") hosting the 
PR must be "the source of truth".  If writes are allowed to go to secondaries, 
then you need a sophisticated consensus algorithm (e.g. Paxos, Raft) to resolve 
conflicts when 2 or more writes in different copies change the same logical 
object but differ in value.  Therefore, writes go to the primary and are then 
distributed to the secondaries (which require an ACK) while holding a lock.

If you think about this in object-oriented terms, the safest object in a highly 
concurrent system is an immutable one.  However, if an object can be modified 
by multiple Threads, then it is safer if all Threads access the object though 
the same control plane to uphold the invariants of the object.

NOTE: For an object, serializing access through synchronization does increase 
contention.  However, keep in mind that a PR does not just have 1 primary.  
Each bucket of the PR (defaults to 113; is tunable) has a primary thereby 
reducing contenting on writes.

Finally, Geode's consistency guarantees are much more sophisticated than what I 
described above. You can read more about Geode's consistency 
here
 [1] (an entire chapter has been dedicated to this very important topic).



2) Regarding member-timeout...

Can this setting be too low?  Yes, of course; you must be careful.

Setting too low of a member-timeout could result in the system thrashing 
between the member being kicked out and the member rejoining the system.

This is costly because, after a member is kicked out, the system must "restore 
redundancy".  When the member rejoins, a "fence & merge" process occurs, then 
the system may need to "rebalance" the data.

Why would a node bounce between being a member, and part of the system, and 
getting kicked out?

Well, it depends on your infrastructure, for one.  If you have an unreliable 
network (more applicable in the cloud environments in certain cases), then 
minor but frequent network blips that severe 1 or more members could cause the 
member(s) to bounce between being kicked out and rejoining.  If enough members 
are severed from the system, then the system might need to decide on a quorum.

If a member is sick (e.g. running low on memory) thereby making the member 
seemingly unresponsive when in fact the member is just overloaded, this can 
cause issues.

There are many factors to consider when configuring Geode.  Don't simply set a 
property thinking it just solved my immediate problem when in fact it might 
have shifted the problem somewhere else.

The setting for member-timeout may very well be what you need, or you may need 
to consider other factors (e.g. the size of your system, both number of nodes 
as well as the size of the data, level of redundancy, you mention collocated 
data (this also is a factor), the environment, etc, etc).

This is the trickiest part of using any system like Geode.  You typically must 
tune it properly to your UC and requirements over several iterations to meet 
your SLAs.

This 
chapter
 [2] in the User Guide will be your friend.

I will let others chime in with their expertise/experience now.  Hopefully, 
this has given you some thoughts and things to consider.  Just remember, always 
test and measure, 🙂

Cheers,
John


[1] 
https://geode.apache.org/docs/guide/113/developing/distributed_regions/region_entry_versions.html
[2] 
https://geode.apache.org/docs/guide/113/managing/monitor_tune/chapter_overview.html










From: Mario Salazar de Torres 
Sent: Saturday, November 21, 2020 1:40 PM
To: dev@geode.apache.org 
Cc: miguel.g.gar...@ericsson.com 
Subject: Re: Requests taking too long if one member of the cluster fails

Thanks @John Blum for your detailed explanation! It 
helped me to better understand how redundancy works.

Thing is that all our use cases requires a really low response time when 
performing operations.
Under normal conditions a "put" takes a few milliseconds, but in the case of a 
cluster member going down, in the described scenario it might take up to 30 
seconds,