Hello Hoss and the list,

We are currently using Lucene payloads to store per-document-per-keyword
scores for our dataset. Our dataset consists of photos with keywords
assigned (only once each) to them. The index is about 90 GB, running on
24-core machines with dedicated 10k SAS drives, and 16/32 GB allocated to
the JVM.

When searching the payloads field, our 98 percentile query time is at 2
seconds even with trivially low queries per second. I have asked several
Lucene committers about this and it's believed that the implementation of
payloads being so general is the cause of the slowness.

Hoss guessed that we could override Term Frequency with PreAnalyzedField[1]
for the per-keyword scores, since keywords (tags) always have a Term
Frequency of 1 and the TF calculation is very fast. However it turns out
that you can't[2] specify TF in the PreAnalyzedField.

Is there any other way to override Term Frequency during index time? If
not, where in the code could this be implemented?

An obvious option is to repeat the keyword as many times as its payload
score, but that would drastically increase the amount of data per document
sent during index time.

I'd welcome any other per-document-per-keyword score solutions, or some way
to speed up searching a payload field.

Thanks,

- Neil

[1] https://issues.apache.org/jira/browse/SOLR-1535
[2]
https://issues.apache.org/jira/browse/SOLR-1535?focusedCommentId=13273501#comment-13273501

Reply via email to