Slow Highlighter Performance Even Using FastVectorHighlighter

2013-05-20 Thread Andy Brown
I'm providing a search feature in a web app that searches for documents
that range in size from 1KB to 200MB of varying MIME types (PDF, DOC,
etc). Currently there are about 3000 documents and this will continue to
grow. I'm providing full word search and partial word search. For each
document, there are three source fields that I'm interested in searching
and highlighting on: name, description, and content. Since I'm providing
both full and partial word search, I've created additional fields that
get tokenized differently: name_par, description_par, and content_par.
Those are indexed and stored as well for querying and highlighting. As
suggested in the Solr wiki, I've got two catch all fields text and
text_par for faster querying. 
 
An average search results page displays 25 results and I provide paging.
I'm just returning the doc ID in my Solr search results and response
times have been quite good (1 to 10 ms). The problem in performance
occurs when I turn on highlighting. I'm already using the
FastVectorHighlighter and depending on the query, it has taken as long
as 15 seconds to get the highlight snippets. However, this isn't always
the case. Certain query terms result in 1 sec or less response time. In
any case, 15 seconds is way too long. 
 
I'm fairly new to Solr but I've spent days coming up with what I've got
so far. Feel free to correct any misconceptions I have. Can anyone
advise me on what I'm doing wrong or offer a better way to setup my core
to improve highlighting performance? 
 
A typical query would look like:
/select?q=foo&start=0&rows=25&fl=id&hl=true 
 
I'm using Solr 4.1. Below the relevant core schema and config details: 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 
 
 
   
   
 
 
 
 

  
 
 
 
   
 
 
 
 
   
   
 
 
 
 
   
 
 
 
 
 
 
  
   explicit 
   10 
   text 
   edismax 
   text^2 text_par^1
   true 
   true 
   true 
   true 
   breakIterator 
   2 
   name name_par description description_par
content content_par 
   162 
   simple 
   default 



   
  
  


Cheers!

- Andy



RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-05-22 Thread Andy Brown
After taking your advice on profiling, I didn't see any memory issues. I
wanted to verify this with a small set of data. So I created a new
sandbox core with the exact same schema and config file settings. I
indexed only 25 PDF documents with an average size of 2.8 MB, the
largest is approx 5 MB (39 pages). I run the exact same query on that
core and I'm seeing response times of 7 secs or more. Without
highlighting the response is usually 1 ms. 
 
I don't understand why it's taking 7 secs to return highlights. The size
of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to
1024 for this verification purpose and that should be more than enough.
The processor is plenty powerful enough as well. 
 
Running VisualVM shows all my CPU time being taken by mainly these 3
methods: 
 
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
nfo.getStartOffset() 
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
nfo.getStartOffset() 
org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
) 
 
My guess is that this has something to do with how I'm handling partial
word matches/highlighting. I have setup another request handler that
only searches the whole word fields and it returns in 850 ms with
highlighting. 
 
Any ideas? 

- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] 
Sent: Monday, May 20, 2013 1:39 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

My guess is that the problem is those 200M documents.
FastVectorHighlighter is fast at deciding whether a match, especially a
phrase, appears in a document, but it still starts out by walking the
entire list of term vectors, and ends by breaking the document into
candidate-snippet fragments, both processes that are proportional to the
length of the document.

It's hard to do much about the first, but for the second you could
choose
to expose FastVectorHighlighter's FieldPhraseList representation, and
return offsets to the caller rather than fragments, building up your own
snippets from a separate store of indexed files. This would also permit
you to set stored="false", improving your memory/core size ratio, which
I'm guessing could use some improving. It would require some work, and
it
would require you to store a representation of what was indexed outside
the Solr core, in some constant-bytes-to-character representation that
you
can use offsets with (e.g. UTF-16, or ASCII+entity references).

However, you may not need to do this -- it may be that you just need
more
memory for your search machine. Not JVM memory, but memory that the O/S
can use as a file cache. What do you have now? That is, how much memory
do
you have that is not used by the JVM or other apps, and how big is your
Solr core?

One way to start getting a handle on where time is being spent is to set
up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight
queries, and look at where the time is being spent. If it's mostly in
methods that are just reading from disk, buy more memory. If you're on
Linux, look at what top is telling you. If the CPU usage is low and the
"wa" number is above 1% more often than not, buy more memory (I don't
know
why that wa number makes sense, I just know that it has been a good rule
of thumb for us).

-- Bryan

> -Original Message-
> From: Andy Brown [mailto:andy_br...@rhoworld.com]
> Sent: Monday, May 20, 2013 9:53 AM
> To: solr-user@lucene.apache.org
> Subject: Slow Highlighter Performance Even Using FastVectorHighlighter
>
> I'm providing a search feature in a web app that searches for
documents
> that range in size from 1KB to 200MB of varying MIME types (PDF, DOC,
> etc). Currently there are about 3000 documents and this will continue
to
> grow. I'm providing full word search and partial word search. For each
> document, there are three source fields that I'm interested in
searching
> and highlighting on: name, description, and content. Since I'm
providing
> both full and partial word search, I've created additional fields that
> get tokenized differently: name_par, description_par, and content_par.
> Those are indexed and stored as well for querying and highlighting. As
> suggested in the Solr wiki, I've got two catch all fields text and
> text_par for faster querying.
>
> An average search results page displays 25 results and I provide
paging.
> I'm just returning the doc ID in my Solr search results and response
> times have been quite good (1 to 10 ms). The problem in performance
> occurs when I turn on highlighting. I'm already using the
> FastVectorHighlighter and depending on the query, it has taken as long
> as 15 seconds to get the highlight snippets. However, this i

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-14 Thread Andy Brown
character representation that
> you
> can use offsets with (e.g. UTF-16, or ASCII+entity references).
>
> However, you may not need to do this -- it may be that you just need
> more
> memory for your search machine. Not JVM memory, but memory that the
O/S
> can use as a file cache. What do you have now? That is, how much
memory
> do
> you have that is not used by the JVM or other apps, and how big is
your
> Solr core?
>
> One way to start getting a handle on where time is being spent is to
set
> up VisualVM. Turn on CPU sampling, send in a bunch of the slow
highlight
> queries, and look at where the time is being spent. If it's mostly in
> methods that are just reading from disk, buy more memory. If you're on
> Linux, look at what top is telling you. If the CPU usage is low and
the
> "wa" number is above 1% more often than not, buy more memory (I don't
> know
> why that wa number makes sense, I just know that it has been a good
rule
> of thumb for us).
>
> -- Bryan
>
> > -Original Message-
> > From: Andy Brown [mailto:andy_br...@rhoworld.com]
> > Sent: Monday, May 20, 2013 9:53 AM
> > To: solr-user@lucene.apache.org
> > Subject: Slow Highlighter Performance Even Using
FastVectorHighlighter
> >
> > I'm providing a search feature in a web app that searches for
> documents
> > that range in size from 1KB to 200MB of varying MIME types (PDF,
DOC,
> > etc). Currently there are about 3000 documents and this will
continue
> to
> > grow. I'm providing full word search and partial word search. For
each
> > document, there are three source fields that I'm interested in
> searching
> > and highlighting on: name, description, and content. Since I'm
> providing
> > both full and partial word search, I've created additional fields
that
> > get tokenized differently: name_par, description_par, and
content_par.
> > Those are indexed and stored as well for querying and highlighting.
As
> > suggested in the Solr wiki, I've got two catch all fields text and
> > text_par for faster querying.
> >
> > An average search results page displays 25 results and I provide
> paging.
> > I'm just returning the doc ID in my Solr search results and response
> > times have been quite good (1 to 10 ms). The problem in performance
> > occurs when I turn on highlighting. I'm already using the
> > FastVectorHighlighter and depending on the query, it has taken as
long
> > as 15 seconds to get the highlight snippets. However, this isn't
> always
> > the case. Certain query terms result in 1 sec or less response time.
> In
> > any case, 15 seconds is way too long.
> >
> > I'm fairly new to Solr but I've spent days coming up with what I've
> got
> > so far. Feel free to correct any misconceptions I have. Can anyone
> > advise me on what I'm doing wrong or offer a better way to setup my
> core
> > to improve highlighting performance?
> >
> > A typical query would look like:
> > /select?q=foo&start=0&rows=25&fl=id&hl=true
> >
> > I'm using Solr 4.1. Below the relevant core schema and config
details:
> >
> > 
> > 
> >  > required="true" multiValued="false"/>
> >
> >
> > 
> >  > multiValued="true" termPositions="true" termVectors="true"
> > termOffsets="true"/>
> >  > stored="true" multiValued="true" termPositions="true"
> termVectors="true"
> > termOffsets="true"/>
> >  > multiValued="true" termPositions="true" termVectors="true"
> > termOffsets="true"/>
> >  > multiValued="true"/>
> >
> > 
> >  > stored="true" multiValued="true" termPositions="true"
> termVectors="true"
> > termOffsets="true"/>
> >  indexed="true"
> > stored="true" multiValued="true" termPositions="true"
> termVectors="true"
> > termOffsets="true"/>
> >  > stored="true" multiValued="true" termPositions="true"
> termVectors="true"
> > termOffsets="true"/>
> >  > stored="false" multiValued="true"/>
> >
> >
> > 
> > 
> > 
> > 
> >
> > 
> > 
> > 
> > 
> >
> > 
> > 
> > 
> > 
> >
> > 
> >  > positionIncrementGap="100">
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true" />
> > 
> >   
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true" />
> >  > ignoreCase="true" expand="true"/>
> > 
> >
> >  
> >
> > 
> >  > positionIncrementGap="100">
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true" />
> > 
> >  > maxGramSize="7"/>
> >   
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true" />
> >  > ignoreCase="true" expand="true"/>
> > 
> >   
> > 
> >
> >
> >
> > 
> > 
> >  
> >explicit
> >10
> >text
> >edismax
> >text^2 text_par^1   
> >true
> >true
> >true
> >true
> >breakIterator
> >2
> >name name_par description description_par
> > content content_par
> >162
> >simple
> >default
> >
> >
> >
> >
> >  
> >  
> >
> >
> > Cheers!
> >
> > - Andy