Re: Simple Faceted Searching out of the box

Chris Hostetter Fri, 22 Sep 2006 17:17:56 -0700

: I've been talking with other papers about Solr and I think what bothers many
: is that there a is a deposit of information in a structured database here
: [named A], then we have another set of basically the same data over here
: [named B] and they don't understand why they have to manage to different
: sets of data [A & B] that are virtually the same thing.  Many foresee a


The big issue is that while "SQL Schemas" may be fairly consistent, uses
of those schemas can be very different ... there is no clear cut way to
look at an arbitrary schema and know how far down a chain of foreign key
relationships you should go and still consider the data you find relevant
to the item you started with (from a search perspective) ... ORM tools
tend to get arround this by Lazy-Loading .. if your front end application
starts with a single jobPostId and then asks for the name of the city it's
mapped to, or the named of the company it's mapped to it will dynamicaly
fetch the "Company" object from teh company table, or maybe it will only
fetch the single companyName field ... but when building a search index
you can't get that lazy evaluation -- you have to proactively fetch that
data in advance, which means you have to know in advance how far down the
rabbit hole you want to go.

not all relationships are equal either: you might have a "Skills" table
and a many-to-many relationship between JobPosting and skills, with a
"mappintType" on the mapping indicating which skills are required and
which are just desirable -- those should probably go in seperate fields of
your index, but some code somewhere needs to know that.

once you've solved that problem, once you've got a function that you can
point at your DB, give it a primary key and get back a "flattened" view of
the data that can represent your "Solr/Lucene Document" you're 80% done
... the problem is that 80% isn't a genericly solvable problem ... there
aren't simple rules you can apply to any DB schema to drive that function.

Even the last 20% isn't really generic; knowing when to re-index a
particular "document" ... the needs of a system where individual people
update JobPostings one at a time is very differnet from a system where
JobPostings are bulk imported thousands at a time ... it's hard to write a
usefull "indexer" that can function efficiently in both cases.  Even in
the first case, dealing with individual document updates where the primary
JobPosting data changes is only the "common" problem, there are still the
less-common" situations where a Company name changes and *all* of the
associated Job Postings need reindexed ... for small indexes it might be
worthwhile to just rebuild the index from scratch, for bigger indexes you
might need a more complex solution for dealing with this situation.

The advice i give people at CNET when they need to build a Solr index is:

1) start by deciding what the minimum "freshness" is for your data ... ie:
what is the absolute longest you can live with needing to wait for data to
be added/deleted/updated in your Solr index once it's been
added/deleted/modified in your DB.

2) write a function that can generate a Solr Document from an instance of
your data (be it a bean, a DB row, whatever you've got)

3) write a simple wrapper program that iterates over all of yor data, and
calls the function from #1


If #3 takes less time to run then #1 - cron it to rebuild the index from
scratch over and over again and use snapshooter and snappuller to expose
itto the world ... if #3 takes longer then #1, then look at ways to more
systematically decide docs should be updated, and how.



-Hoss

Re: Simple Faceted Searching out of the box

Reply via email to