Google like search)

Jack Krupansky Tue, 22 Dec 2015 08:54:33 -0800

Step one is to refine and more clearly state the requirements. Sure,
sometimes (most of the time?) the end user really doesn't know exactly what
they expect or want other than "Gee, I want to search for everything, isn't
that obvious??!!", but that simply means that an analyst is needed to
intervene before you leap to implementation. An analyst is someone who
knows how to interview all relevant parties (not just the approving
manager) to understand their true needs. I mean, who knows, maybe all they
really need is basic keyword search. Or... maybe they actually need a
full-blown data warehouse with precise access to each specific field of
each data source. Without knowing how refined user queries need to get,
there is little to go on here.

My other advice is to be careful not to overthink the problem - to imagine
that some complex solution is needed when the end users really only need to
do super basic queries. In general, managers are very poor when it comes to
analysis and requirement specification.

Do they need to do date searches on a variety of date fields?

Do they need to do numeric or range queries on specific numeric fields?

Do they need to do any exact match queries on raw character fields (as
opposed to tokenized text)?

Do they have fields like product names or numbers in addition to free-form
text?

Do they need to distinguish or weight titles from detailed descriptions?

You could have catchall fields for categories of field types like titles,
bodies, authors/names, locations, dates, numeric values. But... who
knows... this may be more than what an average user really needs.

As far as the concern about fields from different sources that are not
used, Lucene only stores and indexes fields which have values, so no
storage or performance is consumed when you have a lot of fields which are
not present for a particular data source.

-- Jack Krupansky

On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar <susheel2...@gmail.com>
wrote:

> Hello,
>
> I am going thru few use cases where we have kind of multiple disparate data
> sources which in general doesn't have much common fields and i was thinking
> to design different schema/index/collection for each of them and query each
> of them separately and provide different result sets to the client.
>
> I have seen one implementation where all different fields from these
> disparate data sources are put together in single schema/design/collection
> that it can be searched easily using catch all field but this was having
> 200+ fields including copy fields. The problem i see with this design is
> ingestion will be slower (and scaling) as many of the fields for one data
> source will not be applicable when ingesting for other data source.
> Basically everything is being dumped into one huge schema/index/collection.
>
> After looking above, I am wondering how we can design this better in
> another implementation where we have the requirement to search across
> disparate source (each having multiple fields 10-15 fields searchable &
> 10-15 fields stored) with only 1 common field like description in each of
> the data sources.  Most of the time user may perform search on description
> and rest of the time combination of different fields. Similar to google
> like search where you search for "coffee" and it searches in various data
> sources (websites, maps, images, places etc.)
>
> My thought is to make separate indexes for each search scenario.  For
> example for single search box, we index description, other key fields which
> can be searched together  and their data source type into one index/schema
> that we don't make a huge index/schema and use the catch all field for
> search.
>
> And for other Advance search (field specific) scenario we create separate
> index/schema for each data sources.
>
> Any suggestions/guidelines on how we can better address this in terms of
> responsiveness and scaling? Each data source may have documents in 50-100+
> millions.
>
> Thanks,
> Susheel
>

Re: Schema/Index design for disparate data sources (Federated / Google like search)

Reply via email to