Google like search)

Susheel Kumar Tue, 22 Dec 2015 08:26:26 -0800

Hello,

I am going thru few use cases where we have kind of multiple disparate data
sources which in general doesn't have much common fields and i was thinking
to design different schema/index/collection for each of them and query each
of them separately and provide different result sets to the client.


I have seen one implementation where all different fields from these
disparate data sources are put together in single schema/design/collection
that it can be searched easily using catch all field but this was having
200+ fields including copy fields. The problem i see with this design is
ingestion will be slower (and scaling) as many of the fields for one data
source will not be applicable when ingesting for other data source.
Basically everything is being dumped into one huge schema/index/collection.

After looking above, I am wondering how we can design this better in
another implementation where we have the requirement to search across
disparate source (each having multiple fields 10-15 fields searchable &
10-15 fields stored) with only 1 common field like description in each of
the data sources.  Most of the time user may perform search on description
and rest of the time combination of different fields. Similar to google
like search where you search for "coffee" and it searches in various data
sources (websites, maps, images, places etc.)

My thought is to make separate indexes for each search scenario.  For
example for single search box, we index description, other key fields which
can be searched together  and their data source type into one index/schema
that we don't make a huge index/schema and use the catch all field for
search.

And for other Advance search (field specific) scenario we create separate
index/schema for each data sources.

Any suggestions/guidelines on how we can better address this in terms of
responsiveness and scaling? Each data source may have documents in 50-100+
millions.

Thanks,
Susheel

Schema/Index design for disparate data sources (Federated / Google like search)

Reply via email to