Hi again,

After investigating a bit more how Map indexes work, I have seen that they are 
also used to support queries that use "!=" as Jason pointed out.

I would have expected that queries using != or NOT would not make use of 
indexes as it is the common practice in databases ([1]) and also as the Geode 
documentation seems to suggest ([2]):

"Indexes are not used in expressions that contain NOT, so in a WHERE clause of 
a query, qty >= 10 could have an index on qty applied for efficiency. However, 
NOT(qty < 10) could not have the same index applied."

Could somebody please confirm or deny if what the documentation states above is 
true or false and also if the conclusion can also be extended to the use of the 
!= operator?

I also think that the documentation about indexes could be improved at least in 
two areas:

  *   Information about range indexes. While there is a section for the 
deprecated Hash Indexes, there is no specific section for Range Indexes.
  *   Information about Map indexes. The information about these indexes lacks 
a bit of detail. For example, how does the index work when the entry does not 
contain the Map field for which there is an index? How does it behave when the 
Map field does not have the key in the index? How does it behave when the key 
is null or when the value is null?

Does anyone have plans to extend the information about indexes in Geode?

Thanks,

Alberto G.

[1] 
https://stackoverflow.com/questions/1759476/database-index-not-used-if-the-where-criteria-is
[2] 
https://geode.apache.org/docs/guide/19/developing/query_index/indexing_guidelines.html


________________________________
From: Alberto Gomez <alberto.go...@est.tech>
Sent: Saturday, February 13, 2021 5:40 PM
To: dev@geode.apache.org <dev@geode.apache.org>
Subject: Re: Question about Map indexes

Jason, thanks for the help.

I added a new commit to the pull request that solves the issue without 
(apparently) breaking anything.

The problem was that when adding an index entry we need to distinguish between 
the case where the Map does not contain the key from the case where the Map 
contains the key but the value for the key is null. If we use Map.get() we get 
in both cases null but we should only add the index entry in the latter case 
(when the map contains the key but the value corresponding to it is null).

I am not particularly proud of the solution because I use of an arbitrary 
exception to be able to distinguish both cases. Anyway, could you please check 
if we are in the right direction?

Thanks,

Alberto


________________________________
From: Jason Huynh <jhu...@vmware.com>
Sent: Thursday, February 11, 2021 10:57 PM
To: dev@geode.apache.org <dev@geode.apache.org>
Subject: Re: Question about Map indexes

Hi Alberto,

I haven't checked the PR yet, just read through the email.  The first thought 
that comes to mind is when someone does a != query.  The index still has to 
supply the correct answer to the query (all entries with null or undefined 
values possibly)

I'll try to think of other cases where it might matter.  There may be other 
ways to execute the query but it would probably take a bit of reworking.. (ill 
check your pr to see if this is already addressed.   Sorry if it is!)

-Jason

On 2/11/21, 8:28 AM, "Alberto Gomez" <alberto.go...@est.tech> wrote:

    Hi,

    We have observed that creating an index on a Map field causes the creation 
of an index entry for every entry created in the region containing the Map, no 
matter if the Map field contained the key used in the index.
    Nevertheless, we would expect that only entries whose Map field contain the 
key used in the index would have the corresponding index entry. With this 
behavior, the memory consumed by the index could be much higher than needed 
depending on the percentage of entries whose Map field contained the key in the 
index.

    ---------------------------------------------------
    Example:
    We have a region with entries whose key type is a String and the value type 
is an object with a field called "field1" of Map type.

    We expect to run queries on the region like the following:

    SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

    We create a Map index to speed up the above queries:

    gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" 
--region="/example-region1 r"

    We do the following puts:
    - Put entry with key="key1" and with value=<Object whose field "field1" is 
a Map that contains the key 'mapkey1'>
    - Put entry with key="key2" and with value=<Object whose field "field1" is 
a Map that does not contain the key 'mapkey1'>

    The observation is that Geode creates two index entries for each entry. For 
the first entry, the internal indexKey is "key1" and for the second one, the 
internal indexKey is null.

    These are the stats shown by gfsh after doing the above puts:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |  
 Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses 
| Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | 
-------- | ----- | --------------------------------- | ------------------ | 
----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1   
 | 1       | 0           | 1    | 1
    server2     | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1   
 | 1       | 0           | 1    | 1
    ---------------------------------------------------

    Is there any reason why Geode would create an index entry for the second 
entry given that the Map field does not contain the key in the Map index?

    I have created a draft pull request changing the behavior of Geode to not 
create the index entry when the Map field does not contain the key used in the 
index. Only two Unit test cases had to be adjusted. Please see: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6028&amp;data=04%7C01%7Cjhuynh%40vmware.com%7C0957cc0ef91b4b23116408d8ceaa0a8d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637486577011301177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2WDUj6NPEnfX3AXH72MTZYx%2FbXMPJQlVZeKq7KsJDTw%3D&amp;reserved=0

    With this change and the same scenario as the one in the example, only one 
index entry is created. The stats shown by gfsh after the change are the 
following:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |  
 Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses 
| Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | 
-------- | ----- | --------------------------------- | ------------------ | 
----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2   
 | 1       | 0           | 0    | 0
    server2     | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2   
 | 1       | 0           | 1    | 1


    Could someone tell if the current behavior is not correct or if I am 
missing something and with the change I am proposing something else will stop 
working?

    Thanks in advance,

    /Alberto G.

Reply via email to