Hi,

I was just after some advice on how to map some relational metadata to a solr index. The web application I'm working on is based around people and the searching based around properties of these people. Several properties are more complex - for example, a person's occupations have place, from/to dates and other descriptive text; texts about a person have authors, sources and publication dates. Despite the usefulness of facets and the search-based navigation, an advanced search feature is a non-negotiable required feature of the application.

An advanced search needs to be able to query a person on any set of attributes (e.g. gender, birth date, death date, place of birth) etc including the more complex search criteron as described above (occupation, texts). Taking occupation as an example, because occupation has its own metadata and a person could have worked an arbitrary number of occupations throughout their lifetime, I was wondering how/if this information can be denormalised into a single person index document to support such a search. I can't use text concatenation in a multivalued field as I need to be able to run date-based range queries (e.g. publication dates, occupation dates). And I'm not sure that resorting to multiple repeated fields based on the current limits (e.g. occ1, occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach (although that would work).

If there isn't a sensible way to denormalise this, what is the best approach? For example, should I have an occupation document type, a person document type, a text/source document type and (in an advanced search context) each containing the relevant person id and (in the advanced search context) run a query against each document type and then use the intersecting set of person ids as the result used by the application for its display/pagination? And if so, how do I ensure I capture all records - for example if there are 100,000 hits on someone having worked in Australia in 1956, is there any way to ensure all 100,000 are returned in a query (similar to the facet.limit = -1) other than specifying an arbitrary high number in the "rows" parameter and hoping a query doesn't hit more than 100,000 and thus exclude those above the limit from the "intersect" processing?

Or is there a single query solution?

Any advice/hints welcome.

Scott.

Reply via email to