Jason, thanks for the help.

I added a new commit to the pull request that solves the issue without 
(apparently) breaking anything.

The problem was that when adding an index entry we need to distinguish between 
the case where the Map does not contain the key from the case where the Map 
contains the key but the value for the key is null. If we use Map.get() we get 
in both cases null but we should only add the index entry in the latter case 
(when the map contains the key but the value corresponding to it is null).

I am not particularly proud of the solution because I use of an arbitrary 
exception to be able to distinguish both cases. Anyway, could you please check 
if we are in the right direction?

Thanks,

Alberto


________________________________
From: Jason Huynh <jhu...@vmware.com>
Sent: Thursday, February 11, 2021 10:57 PM
To: dev@geode.apache.org <dev@geode.apache.org>
Subject: Re: Question about Map indexes

Hi Alberto,

I haven't checked the PR yet, just read through the email.  The first thought 
that comes to mind is when someone does a != query.  The index still has to 
supply the correct answer to the query (all entries with null or undefined 
values possibly)

I'll try to think of other cases where it might matter.  There may be other 
ways to execute the query but it would probably take a bit of reworking.. (ill 
check your pr to see if this is already addressed.   Sorry if it is!)

-Jason

On 2/11/21, 8:28 AM, "Alberto Gomez" <alberto.go...@est.tech> wrote:

    Hi,

    We have observed that creating an index on a Map field causes the creation 
of an index entry for every entry created in the region containing the Map, no 
matter if the Map field contained the key used in the index.
    Nevertheless, we would expect that only entries whose Map field contain the 
key used in the index would have the corresponding index entry. With this 
behavior, the memory consumed by the index could be much higher than needed 
depending on the percentage of entries whose Map field contained the key in the 
index.

    ---------------------------------------------------
    Example:
    We have a region with entries whose key type is a String and the value type 
is an object with a field called "field1" of Map type.

    We expect to run queries on the region like the following:

    SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

    We create a Map index to speed up the above queries:

    gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" 
--region="/example-region1 r"

    We do the following puts:
    - Put entry with key="key1" and with value=<Object whose field "field1" is 
a Map that contains the key 'mapkey1'>
    - Put entry with key="key2" and with value=<Object whose field "field1" is 
a Map that does not contain the key 'mapkey1'>

    The observation is that Geode creates two index entries for each entry. For 
the first entry, the internal indexKey is "key1" and for the second one, the 
internal indexKey is null.

    These are the stats shown by gfsh after doing the above puts:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |  
 Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses 
| Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | 
-------- | ----- | --------------------------------- | ------------------ | 
----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1   
 | 1       | 0           | 1    | 1
    server2     | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1   
 | 1       | 0           | 1    | 1
    ---------------------------------------------------

    Is there any reason why Geode would create an index entry for the second 
entry given that the Map field does not contain the key in the Map index?

    I have created a draft pull request changing the behavior of Geode to not 
create the index entry when the Map field does not contain the key used in the 
index. Only two Unit test cases had to be adjusted. Please see: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6028&amp;data=04%7C01%7Cjhuynh%40vmware.com%7C0957cc0ef91b4b23116408d8ceaa0a8d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637486577011301177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2WDUj6NPEnfX3AXH72MTZYx%2FbXMPJQlVZeKq7KsJDTw%3D&amp;reserved=0

    With this change and the same scenario as the one in the example, only one 
index entry is created. The stats shown by gfsh after the change are the 
following:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |  
 Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses 
| Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | 
-------- | ----- | --------------------------------- | ------------------ | 
----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2   
 | 1       | 0           | 0    | 0
    server2     | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2   
 | 1       | 0           | 1    | 1


    Could someone tell if the current behavior is not correct or if I am 
missing something and with the change I am proposing something else will stop 
working?

    Thanks in advance,

    /Alberto G.

Reply via email to