Jason, thanks for the help. I added a new commit to the pull request that solves the issue without (apparently) breaking anything.
The problem was that when adding an index entry we need to distinguish between the case where the Map does not contain the key from the case where the Map contains the key but the value for the key is null. If we use Map.get() we get in both cases null but we should only add the index entry in the latter case (when the map contains the key but the value corresponding to it is null). I am not particularly proud of the solution because I use of an arbitrary exception to be able to distinguish both cases. Anyway, could you please check if we are in the right direction? Thanks, Alberto ________________________________ From: Jason Huynh <jhu...@vmware.com> Sent: Thursday, February 11, 2021 10:57 PM To: dev@geode.apache.org <dev@geode.apache.org> Subject: Re: Question about Map indexes Hi Alberto, I haven't checked the PR yet, just read through the email. The first thought that comes to mind is when someone does a != query. The index still has to supply the correct answer to the query (all entries with null or undefined values possibly) I'll try to think of other cases where it might matter. There may be other ways to execute the query but it would probably take a bit of reworking.. (ill check your pr to see if this is already addressed. Sorry if it is!) -Jason On 2/11/21, 8:28 AM, "Alberto Gomez" <alberto.go...@est.tech> wrote: Hi, We have observed that creating an index on a Map field causes the creation of an index entry for every entry created in the region containing the Map, no matter if the Map field contained the key used in the index. Nevertheless, we would expect that only entries whose Map field contain the key used in the index would have the corresponding index entry. With this behavior, the memory consumed by the index could be much higher than needed depending on the percentage of entries whose Map field contained the key in the index. --------------------------------------------------- Example: We have a region with entries whose key type is a String and the value type is an object with a field called "field1" of Map type. We expect to run queries on the region like the following: SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1" We create a Map index to speed up the above queries: gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" --region="/example-region1 r" We do the following puts: - Put entry with key="key1" and with value=<Object whose field "field1" is a Map that contains the key 'mapkey1'> - Put entry with key="key2" and with value=<Object whose field "field1" is a Map that does not contain the key 'mapkey1'> The observation is that Geode creates two index entries for each entry. For the first entry, the internal indexKey is "key1" and for the second one, the internal indexKey is null. These are the stats shown by gfsh after doing the above puts: gfsh>list indexes --with-stats=yes Member Name | Member ID | Region Path | Name | Type | Indexed Expression | From Clause | Valid Index | Uses | Updates | Update Time | Keys | Values ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------ server1 | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true | 1 | 1 | 0 | 1 | 1 server2 | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true | 1 | 1 | 0 | 1 | 1 --------------------------------------------------- Is there any reason why Geode would create an index entry for the second entry given that the Map field does not contain the key in the Map index? I have created a draft pull request changing the behavior of Geode to not create the index entry when the Map field does not contain the key used in the index. Only two Unit test cases had to be adjusted. Please see: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6028&data=04%7C01%7Cjhuynh%40vmware.com%7C0957cc0ef91b4b23116408d8ceaa0a8d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637486577011301177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2WDUj6NPEnfX3AXH72MTZYx%2FbXMPJQlVZeKq7KsJDTw%3D&reserved=0 With this change and the same scenario as the one in the example, only one index entry is created. The stats shown by gfsh after the change are the following: gfsh>list indexes --with-stats=yes Member Name | Member ID | Region Path | Name | Type | Indexed Expression | From Clause | Valid Index | Uses | Updates | Update Time | Keys | Values ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------ server1 | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true | 2 | 1 | 0 | 0 | 0 server2 | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true | 2 | 1 | 0 | 1 | 1 Could someone tell if the current behavior is not correct or if I am missing something and with the change I am proposing something else will stop working? Thanks in advance, /Alberto G.