Susheel, Just a guess, but carrot2.org might be useful. But it might be overkill. Cheers -- Rick
On August 30, 2017 7:40:08 AM MDT, Susheel Kumar <susheel2...@gmail.com> wrote: >Hello, > >I am looking for different ideas/suggestions to solve the use case am >working on. > >We have couple of fields in schema along with id, business_email and >personal_email. We need to return all records based on unique business >and >personal email's. > >The criteria for unique records is either of business or personal email >has >not repeated again in other records. >The criteria for non-unique records is if any of the business or >personal >email has occurred/repeats in other records then all those records are >non-unique. >E.g considering below documents. >- for unique records below only id=1 should be returned (since john.doe >is >not present in any other records personal or business email) >- non unique records, below id=2,3 should be returned (since >isabel.dora is >present in multiple records. doesn't matter if it is present in >business or >personal email) > >Documents >=== >{id:1,business_email_s:john....@abc.com,personal_email_s:john....@abc.com} >{id:2,business_email_s:isabel.d...@abc.com} >{id:3,personal_email_s:isabel.d...@abc.com} > >I am able to solve this using Streaming expression query but not sure >if >performance will become an bottleneck as the streaming expression is >quite >big. So looking for >different ideas like using de-dupe or during ingestion/pre-process etc. >without impacting performance much. > >Thanks, >Susheel -- Sorry for being brief. Alternate email is rickleir at yahoo dot com