See below:

On Jan 17, 2008 11:42 AM, Michael Lackhoff <[EMAIL PROTECTED]> wrote:

> On 17.01.2008 16:53 Erick Erickson wrote:
>
> > I would *strongly* encourage you to store them together
> > as one document. There's no real method of doing
> > DB like joins in the underlying Lucene search engine.
>
> Thanks, that was also my preference.
>
> > But that's generic advice. The question I have for you is
> > "What's the big deal about coordinating the sources?"
> > That is, you have to have something that allows you to
> > make a 1:1 correspondence between your data sources
> > or you couldn't relate them in the first place. Is it really
> > that onerous to check?
>
> I don't have an index to check. Both sources come in huge text files,
> one of them daily, the other irregular. One has the ID, the other has a
> different ID that must be mapped first to the ID of the first source. So
> there is no easy way of saying: "Give me the record to this ID from the
> other set of records". It is all burried in plain text files.
>

I didn't explain this well, it's really what you say below.
You already do this.


>
> > If it is, why not build an index and search it when you
> > want to know?
>
> That is what I will do now: Build a SQLite database with just two
> columns: ID and contents with an index on the ID. Then when I rebuild
> the SOLR index by processing the other data I will lookup the SQLite DB
> if there is a corresponding record from the other source.
> My hope was that I could avoid this intermediate database.
>

I don't see a good way of avoiding this. It's important to keep clearly
in mind that Lucene doesn't do DB things, and trying to force it
to is usually a bad idea. Except sometimes <G>.

If you wanted to, you could use a Lucene index in place of your
SQLite DB and search on your ID (or use TermEnum/TermDocs
to find it). But the only argument for doing it this way is if you
do NOT need the SQLIte DB in the first place you could have
one less tool. And it wouldn't even have to be a separate index
since there's no requirement that all documents in Lucene
have the same fields. You could store your meta-data in one
or more documents with fields orthogonal to your "real" data.

True, this trades off complexity of understanding the various
parts of the index against tracking two Lucene indexes, or
a Lucene index and a DB. I can't really argue convincingly
for one or the other approach, except I like as much as
possible to be self-contained.....


>
> > You haven't described enough of your problem
> > space for me to render any opinion of whether
> > this is premature optimization or not, but it
> > sure smells like it from a distance <G>...
>
> I don't think it was premature optimization. It was just the attempt to
> keep the nightly rebuild of the index as easy as possible and to avoid
> unnecessary complexity. But if it is necessary I will go this way.
>

Well, avoiding complexity is good <G>.

There's another thing to consider if (and only if) your data is
stored (as opposed to indexed). Let's say you use the one
document approach. If you both index AND store your non-meta
data, your update process for your meta data could be:
1> find the old document and store away all the
    non-meta data.
2> delete the old document
3> construct a new document with the new meta-data and the
    data from <1>.
4> re-index the document.

This won't work if you only index (but not store) the non-meta
data. It'll depend on how much data you have and how big the
data set is, which I sure don't know. If you choose to do this,
be aware that you probably want to lazy-load the non-meta
data or loading the document may get expensive.

I suppose you could also consider a variant on the two-index
model.

Index (but don't store) the non-meta data in your primary
index. This reduces the size significantly.

Store (but don't index) the non-meta data in your "update"
index along with an indexed ID.

Updates then become
1> look up the non-meta data from your update index.
2> construct the new document by combining things.
3> delete the doc from the primary index.
4> add the new doc to your primary index.

There's some cost here, and I don't know how this
all plays with the sizes of your indexes. It may be
totally impractical.

Anyway, back to work.

Erick


> -Michael
>

Reply via email to