Google like search)

Susheel Kumar Tue, 22 Dec 2015 11:40:16 -0800

Thanks, Jack for various points. A question when you have hundreds of
fields from different sources and you also have lot of copy fields
instructions for facets, sort or catch all etc. you suffer some performance
hit during ingestion as many of the copy instructions would just be
executing but doing nothing since they don't have data, do you agree?


Assuming keyword search is required on different data sources and present
result from each data source when user is typing (instant / auto complete)
in single search box and advance search (very field specific) is required
in the advance search option,  how do you suggest to design the
index/schema?

Let me know if i am missing any other info to get your thoughts.

On Tue, Dec 22, 2015 at 11:53 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Step one is to refine and more clearly state the requirements. Sure,
> sometimes (most of the time?) the end user really doesn't know exactly what
> they expect or want other than "Gee, I want to search for everything, isn't
> that obvious??!!", but that simply means that an analyst is needed to
> intervene before you leap to implementation. An analyst is someone who
> knows how to interview all relevant parties (not just the approving
> manager) to understand their true needs. I mean, who knows, maybe all they
> really need is basic keyword search. Or... maybe they actually need a
> full-blown data warehouse with precise access to each specific field of
> each data source. Without knowing how refined user queries need to get,
> there is little to go on here.
>
> My other advice is to be careful not to overthink the problem - to imagine
> that some complex solution is needed when the end users really only need to
> do super basic queries. In general, managers are very poor when it comes to
> analysis and requirement specification.
>
> Do they need to do date searches on a variety of date fields?
>
> Do they need to do numeric or range queries on specific numeric fields?
>
> Do they need to do any exact match queries on raw character fields (as
> opposed to tokenized text)?
>
> Do they have fields like product names or numbers in addition to free-form
> text?
>
> Do they need to distinguish or weight titles from detailed descriptions?
>
> You could have catchall fields for categories of field types like titles,
> bodies, authors/names, locations, dates, numeric values. But... who
> knows... this may be more than what an average user really needs.
>
> As far as the concern about fields from different sources that are not
> used, Lucene only stores and indexes fields which have values, so no
> storage or performance is consumed when you have a lot of fields which are
> not present for a particular data source.
>
> -- Jack Krupansky
>
> On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I am going thru few use cases where we have kind of multiple disparate
> data
> > sources which in general doesn't have much common fields and i was
> thinking
> > to design different schema/index/collection for each of them and query
> each
> > of them separately and provide different result sets to the client.
> >
> > I have seen one implementation where all different fields from these
> > disparate data sources are put together in single
> schema/design/collection
> > that it can be searched easily using catch all field but this was having
> > 200+ fields including copy fields. The problem i see with this design is
> > ingestion will be slower (and scaling) as many of the fields for one data
> > source will not be applicable when ingesting for other data source.
> > Basically everything is being dumped into one huge
> schema/index/collection.
> >
> > After looking above, I am wondering how we can design this better in
> > another implementation where we have the requirement to search across
> > disparate source (each having multiple fields 10-15 fields searchable &
> > 10-15 fields stored) with only 1 common field like description in each of
> > the data sources.  Most of the time user may perform search on
> description
> > and rest of the time combination of different fields. Similar to google
> > like search where you search for "coffee" and it searches in various data
> > sources (websites, maps, images, places etc.)
> >
> > My thought is to make separate indexes for each search scenario.  For
> > example for single search box, we index description, other key fields
> which
> > can be searched together  and their data source type into one
> index/schema
> > that we don't make a huge index/schema and use the catch all field for
> > search.
> >
> > And for other Advance search (field specific) scenario we create separate
> > index/schema for each data sources.
> >
> > Any suggestions/guidelines on how we can better address this in terms of
> > responsiveness and scaling? Each data source may have documents in
> 50-100+
> > millions.
> >
> > Thanks,
> > Susheel
> >
>

Re: Schema/Index design for disparate data sources (Federated / Google like search)

Reply via email to