On 11/1/06, Mike Klaas <[EMAIL PROTECTED]> wrote:
DUH2.doDeletions() would also highly benefit from sorting the id terms
before looking them up in these types of cases (as it would trigger
optimizations in lucene as well as being kinder to the os' read-ahead
buffers).

Hmmm, good point.  I wonder how simply using a TreeMap instead of a
HashMap would work.

> If you have a multi-CPU server, you could increase indexing
> performance by using a multithreaded client to keep all the CPUs on
> the server busy.

I thought so, too, but it turns out that there isn't a huge amount of
concurrent updating that can occur, if I am reading the code
correctly.  DUH2.addDoc() calls exactly one of addConditionally,
overwriteBoth, or allowDups, each of which add the document in a
synchronized(this) block.

Good catch.
And with the way that deletes are deferred, moving the add outside of
the sync block should work OK I think... then the analysis if
documents can be done in parallel.

Hmmm, but it may not work well in a mixed-overwriting environment.
Thread 1 overwrites doc 100, Thread 2 adds doc 100 (allowing duplicates).
With add synchronization the index has two possible states:
  Index contains doc_from_thread1  OR index contains both docs
Without sync around the adds, an additional possible state is added:
 Index contains doc_from_thread2

Even though synchronized behavior != unsynchronized behavior, this is
only a problem if someone actually desires to mix overwriting &
non-overwriting on the same document ids, and is OK with the two
possible states in the synchronized case.

I'm tempted to say "mixing overwriting & non-overwriting adds for the
same documents has undefined behavior".  Thoughts?

-Yonik

Reply via email to