: I've been talking with other papers about Solr and I think what bothers many : is that there a is a deposit of information in a structured database here : [named A], then we have another set of basically the same data over here : [named B] and they don't understand why they have to manage to different : sets of data [A & B] that are virtually the same thing. Many foresee a
The big issue is that while "SQL Schemas" may be fairly consistent, uses of those schemas can be very different ... there is no clear cut way to look at an arbitrary schema and know how far down a chain of foreign key relationships you should go and still consider the data you find relevant to the item you started with (from a search perspective) ... ORM tools tend to get arround this by Lazy-Loading .. if your front end application starts with a single jobPostId and then asks for the name of the city it's mapped to, or the named of the company it's mapped to it will dynamicaly fetch the "Company" object from teh company table, or maybe it will only fetch the single companyName field ... but when building a search index you can't get that lazy evaluation -- you have to proactively fetch that data in advance, which means you have to know in advance how far down the rabbit hole you want to go. not all relationships are equal either: you might have a "Skills" table and a many-to-many relationship between JobPosting and skills, with a "mappintType" on the mapping indicating which skills are required and which are just desirable -- those should probably go in seperate fields of your index, but some code somewhere needs to know that. once you've solved that problem, once you've got a function that you can point at your DB, give it a primary key and get back a "flattened" view of the data that can represent your "Solr/Lucene Document" you're 80% done ... the problem is that 80% isn't a genericly solvable problem ... there aren't simple rules you can apply to any DB schema to drive that function. Even the last 20% isn't really generic; knowing when to re-index a particular "document" ... the needs of a system where individual people update JobPostings one at a time is very differnet from a system where JobPostings are bulk imported thousands at a time ... it's hard to write a usefull "indexer" that can function efficiently in both cases. Even in the first case, dealing with individual document updates where the primary JobPosting data changes is only the "common" problem, there are still the less-common" situations where a Company name changes and *all* of the associated Job Postings need reindexed ... for small indexes it might be worthwhile to just rebuild the index from scratch, for bigger indexes you might need a more complex solution for dealing with this situation. The advice i give people at CNET when they need to build a Solr index is: 1) start by deciding what the minimum "freshness" is for your data ... ie: what is the absolute longest you can live with needing to wait for data to be added/deleted/updated in your Solr index once it's been added/deleted/modified in your DB. 2) write a function that can generate a Solr Document from an instance of your data (be it a bean, a DB row, whatever you've got) 3) write a simple wrapper program that iterates over all of yor data, and calls the function from #1 If #3 takes less time to run then #1 - cron it to rebuild the index from scratch over and over again and use snapshooter and snappuller to expose itto the world ... if #3 takes longer then #1, then look at ways to more systematically decide docs should be updated, and how. -Hoss