On 17.01.2008 16:53 Erick Erickson wrote:

I would *strongly* encourage you to store them together
as one document. There's no real method of doing
DB like joins in the underlying Lucene search engine.

Thanks, that was also my preference.

But that's generic advice. The question I have for you is
"What's the big deal about coordinating the sources?"
That is, you have to have something that allows you to
make a 1:1 correspondence between your data sources
or you couldn't relate them in the first place. Is it really
that onerous to check?

I don't have an index to check. Both sources come in huge text files, one of them daily, the other irregular. One has the ID, the other has a different ID that must be mapped first to the ID of the first source. So there is no easy way of saying: "Give me the record to this ID from the other set of records". It is all burried in plain text files.

If it is, why not build an index and search it when you
want to know?

That is what I will do now: Build a SQLite database with just two columns: ID and contents with an index on the ID. Then when I rebuild the SOLR index by processing the other data I will lookup the SQLite DB if there is a corresponding record from the other source.
My hope was that I could avoid this intermediate database.

You haven't described enough of your problem
space for me to render any opinion of whether
this is premature optimization or not, but it
sure smells like it from a distance <G>...

I don't think it was premature optimization. It was just the attempt to keep the nightly rebuild of the index as easy as possible and to avoid unnecessary complexity. But if it is necessary I will go this way.

-Michael

Reply via email to