RE: Please any idea? Highlighting exact phrases with solr

2013-10-14 Thread Bryan Loofbourrow
Sil,

When you switched over to using the Fast Vector Highlighter, did you
change your schema so that the fields that you want to highlight provide
term vector information, and reindex your documents? Term vectors are
necessary when using the Fast Vector Highlighter. Posting your schema may
show valueable clues to the problem you're seeing. For example, the wiki
HighlightingParameters page says you should have 'termVectors=on' etc, but
if you actually put termVectors="on" in your field definition, I don't
think that would work; it's actually termVectors="true".

-- Bryan

> -Original Message-
> From: Silvia Suárez [mailto:s...@anpro21.com]
> Sent: Monday, October 14, 2013 1:17 AM
> To: solr-user@lucene.apache.org; Koji Sekiguchi
> Subject: Re: Please any idea? Highlighting exact phrases with solr
>
> Good morning,
>
> Please, help me giving any idea/solution to the problem?
>
> Thanks a lot in advance
>
> Sil,
>
> Silvia Suárez Barón
> I+D+I
> 
> 972 989 470  / s...@anpro21.com /   
> 
>    phd/10/b30/26a>
>   
>
> *
> *
> *Tecnologías y SaaS para el análisis de marcas comerciales.*
>
>
> Nota:
> Usted ha recibido este mensaje al estar en la libreta de direcciones del
> remitente, en los archivos de la empresa o mediante el sistema de
> “responder” al ser usted la persona que contactó por este medio con el
> remitente. En caso de no querer recibir ningún email mas del remitente o
> de
> cualquier miembro de la organización a la que pertenece, por favor,
> responda a este email solicitando la baja de su dirección en nuestros
> archivos.
>
> Advertencia legal:
> Este mensaje y, en su caso, los ficheros anexos son confidenciales,
> especialmente en lo que respecta a los datos personales, y se dirigen
> exclusivamente al destinatario referenciado. Si usted no lo es y lo ha
> recibido por error o tiene conocimiento del mismo por cualquier motivo,
le
> rogamos que nos lo comunique por este medio y proceda a destruirlo o
> borrarlo, y que en todo caso se abstenga de utilizar, reproducir,
alterar,
> archivar o comunicar a terceros el presente mensaje y ficheros anexos,
> todo
> ello bajo pena de incurrir en responsabilidades legales.
>
>
> 2013/10/11 Silvia Suárez 
>
> > Dear Koji,
> >
> > Thanks a lot for your answer and Sorry about my english
> >
> > I tried to configure
>
FastVectorHighlighter l.useFastVectorHighlighter>
> >
> > However, I have this error:
> >
> >
> > 
> > 
> > fragCharSize(1) is too small. It must be 18 or higher.
> > 
> > 
> > java.lang.IllegalArgumentException: fragCharSize(1) is too small. It
> must
> > be 18 or higher. at
> >
>
org.apache.lucene.search.vectorhighlight.BaseFragListBuilder.createFieldFr
> agList(BaseFragListBuilder.java:51)
> > at
> >
>
org.apache.lucene.search.vectorhighlight.WeightedFragListBuilder.createFie
> ldFragList(WeightedFragListBuilder.java:38)
> > at
> >
>
org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getFieldFra
> gList(FastVectorHighlighter.java:195)
> > at
> >
>
org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFrag
> ments(FastVectorHighlighter.java:184)
> > at
> >
>
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByFastVecto
> rHighlighter(DefaultSolrHighlighter.java:588)
> > at
> >
>
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSol
> rHighlighter.java:413)
> > at
> >
>
org.apache.solr.handler.component.HighlightComponent.process(HighlightComp
> onent.java:139)
> > at
> >
>
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
> ndler.java:208)
> > at
> >
>
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
> e.java:135)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at
> >
>
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java
> :656)
> > at
> >
>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
> a:359)
> > at
> >
>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
> a:155)
> > at
> >
>
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
> er.java:1307)
> > at
> >
>
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> > at
> >
>
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
> 37)
> > at
> >
>
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560
> )
> > at
> >
>
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja
> va:231)
> > at
> >
>
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja
> va:1072)
> > at
> >
>
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
> > at
> >
>
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav

RE: Highlighting externally stored text

2013-07-16 Thread Bryan Loofbourrow
> I'm trying to find a way to best highlight search results even though
> those
> results are not stored in my index.  Has anyone been successful in
reusing
> the SOLR highlighting logic on non-stored data?

I was able to do this by slightly modifying the FastVectorHighlighter so
that it returned before computing snippets, instead returning the term
match offsets in the FieldPhraseList class. Of course you need to make
sure that your files are encoded in such a way that a character always has
the same byte width.

-- Bryan


RE: Highlighting externally stored text

2013-07-31 Thread Bryan Loofbourrow
> Hey Bryan, Thanks for the response!  To make use of the
> FastVectorHighlighter
> you need to enable termVectors, termPositions, and termOffsets correct?
> Which takes a considerable amount of space, but is good to know and I
may
> possibly pursue this solution as well.  Just starting to look at the
code
> now, do you remember how substantial the change was?
>
> Are there any other options?

John,

Yes, you do need to enable those, and yes, it takes a considerable amount
of space.

It has been a while, but the change itself was not too bad, mostly at the
top level, isolating an interface that returns the structure you need, and
transposing that into something for Solr to return.

The only other issues are around queries. If FVH supports all the queries
you use, great. If it's just missing something simple to deal with, like
DisjunctionMaxQuery, then it's just adding another rewrite call.

But if you are using the SpanQuery hierarchy, it's much trickier. I did in
fact do an implementation for that, but it was not very satisfactory --
transposing unordered SpanNearQuery into the representation used by FVH
was an O(n!) operation, and the complexity of the implementation was quite
high, for a number of reasons including lack of FVH representation for
mixed-slop phrases.

I don't know of other options -- except for the one I finally wound up
doing, which was writing my own highlighter, which unfortunately I am not
in a position to share for reasons not my own. But the main reason for
that was the SpanNearQuery support, which may not be a problem you have.

It's possible that something similar could be done with the Postings
highlighter, but I did not look too deeply into that, because the lack of
phrase support was a blocker.

-- Bryan


RE: Solr highlighting fragment issue

2013-09-04 Thread Bryan Loofbourrow
>> I’m having some  issues with Solr search results (using Solr 1.4 ) . I
have enabled highlighting of searched text (hl=true) and set the fragment
size as 500 (hl.fragsize=500) in the search query.

Below is the (screen shot) results shown when I searched for the term
‘grandfather’ (2 results are displayed) .

Now I have couple of problems in this.

1.   In the search results the keyword is appearing inconsistently
towards the start/end of the text. I’d like to control the number of
characters appearing before and after the keyword match (highlighted term).
More specifically I’d like to get the keyword match somewhere around the
middle of the resultant text.

2.   The total number of characters appearing in the search result is
never equals the fragment size I specified (500 characters). It varies in
greater extends (for example  408 or 520).

Please share your thoughts on achieving the above 2 results. <<

I can’t see your screenshot, but it doesn’t really matter.



If I remember correctly how this stuff works, I think you’re going to have
a challenge getting where you want to get. In your position, I would push
back on both of those requirements rather than try to solve the problem.



For (1), the issue is that, IIRC, the highlighter breaks up your documents
into fragments BEFORE it knows where the matches are. I’d think you’d have
to pretty seriously recast the algorithm to get the result you want.



For (2), it may well be that you could tune the fragmenter to get closer to
your desired number of characters, either writing your own, or using the
available regexes and whatnot. But getting an exact number of characters
does not seem reasonable, because I’m pretty sure that there is a
constraint that a matching term must appear in its entirety in one fragment
– and also that sometimes fragments are concatenated. Imagine, for example,
a matched phrase where the start of the phrase is in one fragment, and the
end is in another. Which goes back to the first point.



So if you absolutely must have both of these (and the second one is
strange, since it implies that your fragments will often start and end in
the middles of words), then I guess you would need to rewrite the
fragmenting algorithm to drive fragmenting from the matches.



-- Bryan


RE: Some highlighted snippets aren't being returned

2013-09-08 Thread Bryan Loofbourrow
Eric,

Your example document is quite long. Are you setting hl.maxAnalyzedChars?
If you don't, the highlighter you appear to be using will not look past
the first 51,200 characters of the document for snippet candidates.

http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

-- Bryan


> -Original Message-
> From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
> Sent: Sunday, September 08, 2013 2:01 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Some highlighted snippets aren't being returned
>
> Hi again Everyone,
>
> I didn't get any replies to this, so I thought I'd re-send in case
anyone
> missed it and has any thoughts.
>
> Thanks,
> Eric
>
> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon  wrote:
>
> > Hi Everyone,
> >
> > I'm facing an issue in which my solr query is returning highlighted
> snippets for some, but not all results.  For reference, I'm searching
> through an index that contains web crawls of human-rights-related
> websites.  I'm running solr as a webapp under Tomcat and I've included
the
> query's solr params from the Tomcat log:
> >
> > ...
> > webapp=/solr-4.2
> > path=/select
> >
>
params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.m
>
imetype_code.facet.limit=7&hl.simple.pre=&q.alt=*:*&f.organization_t
>
ype__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of
>
_capture_.facet.limit=6&group.field=original_url&hl.simple.post=
>&facet.field=domain&facet.field=date_of_capture_&facet.field=mimetype
>
_code&facet.field=geographic_focus__facet&facet.field=organization_based_i
>
n__facet&facet.field=organization_type__facet&facet.field=language__facet&
>
facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.face
>
t.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=orig
>
inal_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&r
>
ows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.fac
> et.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8
> status=0 QTime=108
> > ...
> >
> > For the query above (which can be simplified to say: find all
documents
> that contain the word "unangan" and return facets, highlights, etc.), I
> get five search results.  Only three of these are returning highlighted
> snippets.  Here's the "highlighting" portion of the solr response (note:
> printed in ruby notation because I'm receiving this response in a Rails
> app):
> >
> > 
> > "highlighting"=>
> >
>
{"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
> 202002%20tentang%20Perlindungan%20Anak.pdf"=>
> >{},
> >
>
"20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
> >{},
> >
>
"20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
> >{},
> >   "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
> >{"contents"=>
> >  ["...actual snippet is returned here..."]},
> >   "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
> >{"contents"=>
> >  ["...actual snippet is returned here..."]},
> >   "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
> uu-no-39-tahun-1999"=>
> >{"contents"=>
> >  ["...actual snippet is returned here..."]},
> >
"20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
> 39-tahun-1999?tmpl=component&format=raw"=>
> >{"contents"=>
> >  ["...actual snippet is returned here..."]},
> >
>
"20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
> timut_heritage.pdf"=>
> >{}}
> > 
> >
> > I have eight (as opposed to five) results above because I'm also doing
a
> grouped query, grouping by a field called "original_url", and this leads
> to five grouped results.
> >
> > I've confirmed that my highlight-lacking results DO contain the word
> "unangan", as expected, and this term is appearing in a text field
that's
> indexed and stored, and being searched for all text searches.  For
> example, one of the search results is for a crawl of this document:
>
http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
> df
> >
> > And if you view that document on the web, you'll see that it does
> contain "unangan".
> >
> > Has anyone seen this before?  And does anyone have any good
suggestions
> for troubleshooting/fixing the problem?
> >
> > Thanks!
> >
> > - Eric


RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-05-20 Thread Bryan Loofbourrow
My guess is that the problem is those 200M documents.
FastVectorHighlighter is fast at deciding whether a match, especially a
phrase, appears in a document, but it still starts out by walking the
entire list of term vectors, and ends by breaking the document into
candidate-snippet fragments, both processes that are proportional to the
length of the document.

It's hard to do much about the first, but for the second you could choose
to expose FastVectorHighlighter's FieldPhraseList representation, and
return offsets to the caller rather than fragments, building up your own
snippets from a separate store of indexed files. This would also permit
you to set stored="false", improving your memory/core size ratio, which
I'm guessing could use some improving. It would require some work, and it
would require you to store a representation of what was indexed outside
the Solr core, in some constant-bytes-to-character representation that you
can use offsets with (e.g. UTF-16, or ASCII+entity references).

However, you may not need to do this -- it may be that you just need more
memory for your search machine. Not JVM memory, but memory that the O/S
can use as a file cache. What do you have now? That is, how much memory do
you have that is not used by the JVM or other apps, and how big is your
Solr core?

One way to start getting a handle on where time is being spent is to set
up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight
queries, and look at where the time is being spent. If it's mostly in
methods that are just reading from disk, buy more memory. If you're on
Linux, look at what top is telling you. If the CPU usage is low and the
"wa" number is above 1% more often than not, buy more memory (I don't know
why that wa number makes sense, I just know that it has been a good rule
of thumb for us).

-- Bryan

> -Original Message-
> From: Andy Brown [mailto:andy_br...@rhoworld.com]
> Sent: Monday, May 20, 2013 9:53 AM
> To: solr-user@lucene.apache.org
> Subject: Slow Highlighter Performance Even Using FastVectorHighlighter
>
> I'm providing a search feature in a web app that searches for documents
> that range in size from 1KB to 200MB of varying MIME types (PDF, DOC,
> etc). Currently there are about 3000 documents and this will continue to
> grow. I'm providing full word search and partial word search. For each
> document, there are three source fields that I'm interested in searching
> and highlighting on: name, description, and content. Since I'm providing
> both full and partial word search, I've created additional fields that
> get tokenized differently: name_par, description_par, and content_par.
> Those are indexed and stored as well for querying and highlighting. As
> suggested in the Solr wiki, I've got two catch all fields text and
> text_par for faster querying.
>
> An average search results page displays 25 results and I provide paging.
> I'm just returning the doc ID in my Solr search results and response
> times have been quite good (1 to 10 ms). The problem in performance
> occurs when I turn on highlighting. I'm already using the
> FastVectorHighlighter and depending on the query, it has taken as long
> as 15 seconds to get the highlight snippets. However, this isn't always
> the case. Certain query terms result in 1 sec or less response time. In
> any case, 15 seconds is way too long.
>
> I'm fairly new to Solr but I've spent days coming up with what I've got
> so far. Feel free to correct any misconceptions I have. Can anyone
> advise me on what I'm doing wrong or offer a better way to setup my core
> to improve highlighting performance?
>
> A typical query would look like:
> /select?q=foo&start=0&rows=25&fl=id&hl=true
>
> I'm using Solr 4.1. Below the relevant core schema and config details:
>
> 
> 
>  required="true" multiValued="false"/>
>
>
> 
>  multiValued="true" termPositions="true" termVectors="true"
> termOffsets="true"/>
>  stored="true" multiValued="true" termPositions="true" termVectors="true"
> termOffsets="true"/>
>  multiValued="true" termPositions="true" termVectors="true"
> termOffsets="true"/>
>  multiValued="true"/>
>
> 
>  stored="true" multiValued="true" termPositions="true" termVectors="true"
> termOffsets="true"/>
>  stored="true" multiValued="true" termPositions="true" termVectors="true"
> termOffsets="true"/>
>  stored="true" multiValued="true" termPositions="true" termVectors="true"
> termOffsets="true"/>
>  stored="false" multiValued="true"/>
>
>
> 
> 
> 
> 
>
> 
> 
> 
> 
>
> 
> 
> 
> 
>
> 
>  positionIncrementGap="100">
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
> 
>   
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>  ignoreCase="true" expand="true"/>
> 
>
>  
>
> 
>  positionIncrementGap="100">
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
> 
>maxGramSize="7"/>
>   
>   
> 
>  words="stopwords.txt" enablePositionIncr

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-05-29 Thread Bryan Loofbourrow
Andy,

> I don't understand why it's taking 7 secs to return highlights. The size
> of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to
> 1024 for this verification purpose and that should be more than enough.
> The processor is plenty powerful enough as well.
>
> Running VisualVM shows all my CPU time being taken by mainly these 3
> methods:
>
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> nfo.getStartOffset()
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> nfo.getStartOffset()
> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> )

That is a strange and interesting set of things to be spending most of
your CPU time on. The implication, I think, is that the number of term
matches in the document for terms in your query (or, at least, terms
matching exact words or the beginning of phrases in your query) is
extremely high . Perhaps that's coming from this "partial word match" you
mention -- how does that work?

-- Bryan

> My guess is that this has something to do with how I'm handling partial
> word matches/highlighting. I have setup another request handler that
> only searches the whole word fields and it returns in 850 ms with
> highlighting.
>
> Any ideas?
>
> - Andy
>
>
> -Original Message-
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Monday, May 20, 2013 1:39 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Slow Highlighter Performance Even Using
> FastVectorHighlighter
>
> My guess is that the problem is those 200M documents.
> FastVectorHighlighter is fast at deciding whether a match, especially a
> phrase, appears in a document, but it still starts out by walking the
> entire list of term vectors, and ends by breaking the document into
> candidate-snippet fragments, both processes that are proportional to the
> length of the document.
>
> It's hard to do much about the first, but for the second you could
> choose
> to expose FastVectorHighlighter's FieldPhraseList representation, and
> return offsets to the caller rather than fragments, building up your own
> snippets from a separate store of indexed files. This would also permit
> you to set stored="false", improving your memory/core size ratio, which
> I'm guessing could use some improving. It would require some work, and
> it
> would require you to store a representation of what was indexed outside
> the Solr core, in some constant-bytes-to-character representation that
> you
> can use offsets with (e.g. UTF-16, or ASCII+entity references).
>
> However, you may not need to do this -- it may be that you just need
> more
> memory for your search machine. Not JVM memory, but memory that the O/S
> can use as a file cache. What do you have now? That is, how much memory
> do
> you have that is not used by the JVM or other apps, and how big is your
> Solr core?
>
> One way to start getting a handle on where time is being spent is to set
> up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight
> queries, and look at where the time is being spent. If it's mostly in
> methods that are just reading from disk, buy more memory. If you're on
> Linux, look at what top is telling you. If the CPU usage is low and the
> "wa" number is above 1% more often than not, buy more memory (I don't
> know
> why that wa number makes sense, I just know that it has been a good rule
> of thumb for us).
>
> -- Bryan
>
> > -Original Message-
> > From: Andy Brown [mailto:andy_br...@rhoworld.com]
> > Sent: Monday, May 20, 2013 9:53 AM
> > To: solr-user@lucene.apache.org
> > Subject: Slow Highlighter Performance Even Using FastVectorHighlighter
> >
> > I'm providing a search feature in a web app that searches for
> documents
> > that range in size from 1KB to 200MB of varying MIME types (PDF, DOC,
> > etc). Currently there are about 3000 documents and this will continue
> to
> > grow. I'm providing full word search and partial word search. For each
> > document, there are three source fields that I'm interested in
> searching
> > and highlighting on: name, description, and content. Since I'm
> providing
> > both full and partial word search, I've created additional fields that
> > get tokenized differently: name_par, description_par, and content_par.
> > Those are indexed and stored as well for querying and highlighting. As
> > suggested in the Solr wiki, I've got two catch all fields text and
> > text_par for faster querying.
> >
> > An average search results page display

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-18 Thread Bryan Loofbourrow
Andy,

OK, I get what you're doing. As far as alternate paths, you could index
normally and use WildcardQuery, but that wouldn't get you the boost on
exact word matches. That makes me wonder whether there's a way to use
edismax to combine the results of a wildcard search and a non-wildcard
search against the same field, boosting the latter. I haven't looked into
it, but it seems possible that it might be done.

I am perplexed at this point by the poor highlight performance you're
seeing, but we do have your profiling data that suggests that you have a
very large number of matches to contend with, so that's interesting.

At this point, faced with your issue, I would step my way through the
FastVectorHighlighter code. About the first thing it does for each field
is walk the terms in the document, and retain only those that matched some
terms in the query. It may be interesting to see this set of terms it ends
up with -- is it excessively large for some reason?

-- Bryan

> -Original Message-
> From: Andy Brown [mailto:andy_br...@rhoworld.com]
> Sent: Friday, June 14, 2013 1:52 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter
>
> Bryan,
>
> For specifics, I'll refer you back to my original email where I
> specified all the fields/field types/handlers I use. Here's a general
> overview.
>
> I really only have 3 fields that I index and search against: "name",
> "description", and "content". All of which are just general text
> (string) fields. I have a catch-all field called "text" that is only
> used for querying. It's indexed but not stored. The "name",
> "description", and "content" fields are copied into the "text" field.
>
> For partial word matching, I have 4 more fields: "name_par",
> "description_par", "content_par", and "text_par". The "text_par" field
> has the same relationship to the "*_par" fields as "text" does to the
> others (only used for querying). Those partial word matching fields are
> of type "text_general_partial" which I created. That field type is
> analyzed different than the regular text field in that it goes through
> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> at index time.
>
> I query against both "text" and "text_par" fields using edismax deftype
> with my qf set to "text^2 text_par^1" to give full word matches a higher
> score. This part returns back very fast as previously stated. It's when
> I turn on highlighting that I take the huge performance hit.
>
> Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> name_par description description_par content content_par" so that it
> returns highlights for full and partial word matches. All of those
> fields have indexed, stored, termPositions, termVectors, and termOffsets
> set to "true".
>
> It all seems redundant just to allow for partial word
> matching/highlighting but I didn't know of a better way. Does anything
> stand out to you that could be the culprit? Let me know if you need any
> more clarification.
>
> Thanks!
>
> - Andy
>
> -Original Message-
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Wednesday, May 29, 2013 5:44 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Slow Highlighter Performance Even Using
> FastVectorHighlighter
>
> Andy,
>
> > I don't understand why it's taking 7 secs to return highlights. The
> size
> > of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> to
> > 1024 for this verification purpose and that should be more than
> enough.
> > The processor is plenty powerful enough as well.
> >
> > Running VisualVM shows all my CPU time being taken by mainly these 3
> > methods:
> >
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > nfo.getStartOffset()
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > nfo.getStartOffset()
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> > )
>
> That is a strange and interesting set of things to be spending most of
> your CPU time on. The implication, I think, is that the number of term
> matches in the document for terms in your query (or, at least, terms
> matching exact words or the beginning of phrases in your query) is
> extremely high . Perhaps that's coming from this "partial word match"
> you
&

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-18 Thread Bryan Loofbourrow
Also, in your position, I would be very curious what would happen to
highlighting performance, if I just took the EdgeNGramFilter out of the
analysis chain and reindexed. That would immediately tell you that the
problem lives there (or not).

-- Bryan

> -Original Message-
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Tuesday, June 18, 2013 5:16 PM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter
>
> Andy,
>
> OK, I get what you're doing. As far as alternate paths, you could index
> normally and use WildcardQuery, but that wouldn't get you the boost on
> exact word matches. That makes me wonder whether there's a way to use
> edismax to combine the results of a wildcard search and a non-wildcard
> search against the same field, boosting the latter. I haven't looked
into
> it, but it seems possible that it might be done.
>
> I am perplexed at this point by the poor highlight performance you're
> seeing, but we do have your profiling data that suggests that you have a
> very large number of matches to contend with, so that's interesting.
>
> At this point, faced with your issue, I would step my way through the
> FastVectorHighlighter code. About the first thing it does for each field
> is walk the terms in the document, and retain only those that matched
some
> terms in the query. It may be interesting to see this set of terms it
ends
> up with -- is it excessively large for some reason?
>
> -- Bryan
>
> > -Original Message-
> > From: Andy Brown [mailto:andy_br...@rhoworld.com]
> > Sent: Friday, June 14, 2013 1:52 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> FastVectorHighlighter
> >
> > Bryan,
> >
> > For specifics, I'll refer you back to my original email where I
> > specified all the fields/field types/handlers I use. Here's a general
> > overview.
> >
> > I really only have 3 fields that I index and search against: "name",
> > "description", and "content". All of which are just general text
> > (string) fields. I have a catch-all field called "text" that is only
> > used for querying. It's indexed but not stored. The "name",
> > "description", and "content" fields are copied into the "text" field.
> >
> > For partial word matching, I have 4 more fields: "name_par",
> > "description_par", "content_par", and "text_par". The "text_par" field
> > has the same relationship to the "*_par" fields as "text" does to the
> > others (only used for querying). Those partial word matching fields
are
> > of type "text_general_partial" which I created. That field type is
> > analyzed different than the regular text field in that it goes through
> > an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> > at index time.
> >
> > I query against both "text" and "text_par" fields using edismax
deftype
> > with my qf set to "text^2 text_par^1" to give full word matches a
higher
> > score. This part returns back very fast as previously stated. It's
when
> > I turn on highlighting that I take the huge performance hit.
> >
> > Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> > name_par description description_par content content_par" so that it
> > returns highlights for full and partial word matches. All of those
> > fields have indexed, stored, termPositions, termVectors, and
termOffsets
> > set to "true".
> >
> > It all seems redundant just to allow for partial word
> > matching/highlighting but I didn't know of a better way. Does anything
> > stand out to you that could be the culprit? Let me know if you need
any
> > more clarification.
> >
> > Thanks!
> >
> > - Andy
> >
> > -Original Message-
> > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> > Sent: Wednesday, May 29, 2013 5:44 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> > FastVectorHighlighter
> >
> > Andy,
> >
> > > I don't understand why it's taking 7 secs to return highlights. The
> > size
> > > of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> > to
> > > 1024 for this verification purpose and that should be 

Improving proximity search performance

2012-02-16 Thread Bryan Loofbourrow
Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow


RE: Frequent garbage collections after a day of operation

2012-02-16 Thread Bryan Loofbourrow
A couple of thoughts:

We wound up doing a bunch of tuning on the Java garbage collection.
However, the pattern we were seeing was periodic very extreme slowdowns,
because we were then using the default garbage collector, which blocks
when it has to do a major collection. This doesn't sound like your
problem, but it's something to be aware of.

One thing that could fit the pattern you describe would be Solr caches
filling up and getting you too close to your JVM or memory limit. For
example, if you have large documents, and have defined a large document
cache, that might do it.

I found it useful to point jconsole (free with the JDK) at my JVM, and
watch the pattern of memory usage. If the troughs at the bottom of the GC
cycles keep rising, you know you've got something that is continuing to
grab more memory and not let go of it. Now that our JVM is running
smoothly, we just see a sawtooth pattern, with the troughs approximately
level. When the system is under load, the frequency of the wave rises. Try
it and see what sort of pattern you're getting.

-- Bryan

> -Original Message-
> From: Matthias Käppler [mailto:matth...@qype.com]
> Sent: Thursday, February 16, 2012 7:23 AM
> To: solr-user@lucene.apache.org
> Subject: Frequent garbage collections after a day of operation
>
> Hey everyone,
>
> we're running into some operational problems with our SOLR production
> setup here and were wondering if anyone else is affected or has even
> solved these problems before. We're running a vanilla SOLR 3.4.0 in
> several Tomcat 6 instances, so nothing out of the ordinary, but after
> a day or so of operation we see increased response times from SOLR, up
> to 3 times increases on average. During this time we see increased CPU
> load due to heavy garbage collection in the JVM, which bogs down the
> the whole system, so throughput decreases, naturally. When restarting
> the slaves, everything goes back to normal, but that's more like a
> brute force solution.
>
> The thing is, we don't know what's causing this and we don't have that
> much experience with Java stacks since we're for most parts a Rails
> company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
> seeing this, or can you think of a reason for this? Most of our
> queries to SOLR involve the DismaxHandler and the spatial search query
> components. We don't use any custom request handlers so far.
>
> Thanks in advance,
> -Matthias
>
> --
> Matthias Käppler
> Lead Developer API & Mobile
>
> Qype GmbH
> Großer Burstah 50-52
> 20457 Hamburg
> Telephone: +49 (0)40 - 219 019 2 - 160
> Skype: m_kaeppler
> Email: matth...@qype.com
>
> Managing Director: Ian Brotherston
> Amtsgericht Hamburg
> HRB 95913
>
> This e-mail and its attachments may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this e-mail in error) please notify the sender immediately
> and destroy this e-mail and its attachments. Any unauthorized copying,
> disclosure or distribution of this e-mail and  its attachments is
> strictly forbidden. This notice also applies to future messages.


RE: Improving proximity search performance

2012-02-17 Thread Bryan Loofbourrow
Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little
wonder that no one thought the question was interesting, or figured I must
be using Sneakernet to run my searches.



-- Bryan Loofbourrow


  --

*From:* Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
*Sent:* Thursday, February 16, 2012 7:07 PM
*To:* 'solr-user@lucene.apache.org'
*Subject:* Improving proximity search performance



Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow


Exception using distributed field-collapsing

2012-06-20 Thread Bryan Loofbourrow
I am doing a search on three shards with identical schemas (I
double-checked!), using the group feature, and Solr/Lucene 3.5. Solr is
giving me back the exception listed at the bottom of this email:



Other information:



My schema uses the following field types: StrField, DateField,
TrieDateField, TextField, SortableInt, SortableLong, BoolField



My query looks like this (I’ve messed with it to anonymize but, I hope,
kept the essentials:



http://[solr core2] /select/?&start=0&rows=25&q={!qsol}machines&sort=[sort
field] &fl=[list of fields] &shards=[solr core1]%2c[solr core2]%2c[solr
core3]&group=true&group.field=[group field]



java.lang.ClassCastException: java.util.Date cannot be cast to java.lang.String

at 
org.apache.lucene.search.FieldComparator$StringOrdValComparator.compareValues(FieldComparator.java:844)

at 
org.apache.lucene.search.grouping.SearchGroup$GroupComparator.compare(SearchGroup.java:180)

at 
org.apache.lucene.search.grouping.SearchGroup$GroupComparator.compare(SearchGroup.java:154)

at java.util.TreeMap.put(TreeMap.java:547)

at java.util.TreeSet.add(TreeSet.java:255)

at 
org.apache.lucene.search.grouping.SearchGroup$GroupMerger.updateNextGroup(SearchGroup.java:222)

at 
org.apache.lucene.search.grouping.SearchGroup$GroupMerger.merge(SearchGroup.java:285)

at 
org.apache.lucene.search.grouping.SearchGroup.merge(SearchGroup.java:340)

at 
org.apache.solr.search.grouping.distributed.responseprocessor.SearchGroupShardResponseProcessor.process(SearchGroupShardResponseProcessor.java:77)

at 
org.apache.solr.handler.component.QueryComponent.handleGroupedResponses(QueryComponent.java:565)

at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:548)

at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:289)

at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)

at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)

at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)

at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)

at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)

at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)

at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)

at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)

at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)

at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

at java.lang.Thread.run(Thread.java:679)



Any thoughts or advice?



Thanks,



-- Bryan


RE: Exception using distributed field-collapsing

2012-06-20 Thread Bryan Loofbourrow
> Hi Bryan,
>
> What is the fieldtype of the groupField? You can only group by field
> that is of type string as is described in the wiki:
> http://wiki.apache.org/solr/FieldCollapsing#Request_Parameters
>
> When you group by another field type a http 400 should be returned
> instead if this error. At least that what I'd expect.
>
> Martijn

Martijn,

The group-by field is a string. I have been unable to figure how a date
comes into the picture at all, and have basically been wondering if there
is some problem in the grouping code that misaligns the field values from
different results in the group, so that it is not comparing like with
like. Not a strong theory, just the only thing I can think of.

-- Bryan


RE: Exception using distributed field-collapsing

2012-06-21 Thread Bryan Loofbourrow
Cody,

> Does it work in the non distributed case?

Yes.

>
> Is the field you're grouping on stored? What is the type on the
uniqueKey
> field? Is it stored and indexed?

The field I'm grouping on is a string, stored and indexed. The unique key
field is a string, stored and indexed.

> I've had a problem with distributed not working when the uniqueKey field
> was indexed but not stored.

Was it the same exception I'm seeing?

-- Bryan

>
> -Original Message-
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Wednesday, June 20, 2012 1:54 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Exception using distributed field-collapsing
>
> > Hi Bryan,
> >
> > What is the fieldtype of the groupField? You can only group by field
> > that is of type string as is described in the wiki:
> > http://wiki.apache.org/solr/FieldCollapsing#Request_Parameters
> >
> > When you group by another field type a http 400 should be returned
> > instead if this error. At least that what I'd expect.
> >
> > Martijn
>
> Martijn,
>
> The group-by field is a string. I have been unable to figure how a date
> comes into the picture at all, and have basically been wondering if
there
> is some problem in the grouping code that misaligns the field values
from
> different results in the group, so that it is not comparing like with
> like. Not a strong theory, just the only thing I can think of.
>
> -- Bryan


RE: Exception using distributed field-collapsing

2012-06-21 Thread Bryan Loofbourrow
> Does a *:* query with no sorting work?

Well, this is interesting. Leaving q= as it was, but removing the sort,
makes the whole thing work.

And if you were thinking of asking whether the sort field is a date, the
answer is yes, it's an indexed and stored DateField. It's also on the list
of fields whose values I am requesting with fl=.

So I guess this is likely to be the date that is somehow turning up in the
ClassCastException. Great suggestion. Thanks, Cody.

Now I'm wondering if anyone familiar with the Field Collapsing code can
see a possible vector for a bug, given this fleshing out of the bug
conditions.

-- Bryan

> -Original Message-
> From: Young, Cody [mailto:cody.yo...@move.com]
> Sent: Thursday, June 21, 2012 11:04 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Exception using distributed field-collapsing
>
> No, I believe it was a different exception, just brainstorming. (it was
a
> null reference iirc)
>
> Does a *:* query with no sorting work?
>
> Cody
>
> -Original Message-
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Thursday, June 21, 2012 10:33 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Exception using distributed field-collapsing
>
> Cody,
>
> > Does it work in the non distributed case?
>
> Yes.
>
> >
> > Is the field you're grouping on stored? What is the type on the
> uniqueKey
> > field? Is it stored and indexed?
>
> The field I'm grouping on is a string, stored and indexed. The unique
key
> field is a string, stored and indexed.
>
> > I've had a problem with distributed not working when the uniqueKey
> > field was indexed but not stored.
>
> Was it the same exception I'm seeing?
>
> -- Bryan
>
> >
> > -Original Message-
> > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> > Sent: Wednesday, June 20, 2012 1:54 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Exception using distributed field-collapsing
> >
> > > Hi Bryan,
> > >
> > > What is the fieldtype of the groupField? You can only group by field
> > > that is of type string as is described in the wiki:
> > > http://wiki.apache.org/solr/FieldCollapsing#Request_Parameters
> > >
> > > When you group by another field type a http 400 should be returned
> > > instead if this error. At least that what I'd expect.
> > >
> > > Martijn
> >
> > Martijn,
> >
> > The group-by field is a string. I have been unable to figure how a
> > date comes into the picture at all, and have basically been wondering
> > if
> there
> > is some problem in the grouping code that misaligns the field values
> from
> > different results in the group, so that it is not comparing like with
> > like. Not a strong theory, just the only thing I can think of.
> >
> > -- Bryan


Displaying highlights in formatted HTML document

2011-06-08 Thread Bryan Loofbourrow
Here is my use case:



I have a large number of HTML documents, sizes in the 0.5K-50M range, most
around, say, 10M.



I want to be able to present the user with the formatted HTML document, with
the hits tagged, so that he may iterate through them, and see them in the
context of the document, with the document looking as it would be presented
by a browser; that is, fully formatted, with its tables and italics and font
sizes and all.



This is something that the user would explicitly request from within a set
of search results, not something I’d expect to have returned from an initial
search – the initial search merely returns the snippets around the hits. But
if the user wants to dive into one of the returned results and see them in
context, I need to be able to go get that.



We are currently solving this problem by using an entirely separate search
engine (dtSearch), which performs the tagging of the hits in the HTML just
fine. But the solution is unsatisfactory because there are Solr searches
that dtSearch’s capabilities cannot reasonably match.



Can anyone suggest a good way to use Solr/Lucene for this instead? I’m
thinking a separate core for this purpose might make sense, so as not to
burden the primary search core with the full contents of the document. But
after that, I’m stuck. How can I get Solr to express the highlighting in the
context of the formatted HTML document?



If Solr does not do this currently, and anyone can suggest ways to add the
feature, any tips on how this might best be incorporated into the
implementation would be welcome.



Thanks,



-- Bryan


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow
Ludovic,

>> how do you index your html files ? I mean do you create fields for
different
parts of your document (for different stop words lists, stemming, etc) ?
with DIH or solrj or something else ?  <<

We are sending them over http, and using Tika to strip the HTML, at
present.

We do not split the document itself into separate fields, but what we
index includes a bunch of metadata that has been extracted by processes
earlier in the pipeline. These fields don't enter into the
HTML-hit-highlighting question.

>> I developed this week a new highlighter module which transfers the
fields
highlighting to the original document (xml in my case) (I use payloads to
store offsets and lenghts of fields in the index). This way, I use the
good
analyzers to do the highlighting correctly and then, I replace the
different
field parts in the document by the highlighted parts. It is not finished
yet, but I already have some good results. <<

Yes, I have been thinking along very similar lines. If you arrive at
something you're happy with, I encourage you to share it.

>> This is a client request too. Let me know if the iorixxx's solution is
not enought for your particular use case.<<

I'm enough of a Solr newb that I'll need to study his suggestion for a
bit, to figure out what it does and does not do. When I've done so, I'll
respond to his message.

Thanks,

-- Bryan


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Wednesday, June 08, 2011 11:56 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Displaying highlights in formatted HTML document
>
>
>
> --- On Thu, 6/9/11, Bryan Loofbourrow 
> wrote:
>
> > From: Bryan Loofbourrow 
> > Subject: Displaying highlights in formatted HTML document
> > To: solr-user@lucene.apache.org
> > Date: Thursday, June 9, 2011, 2:14 AM
> > Here is my use case:
> >
> >
> >
> > I have a large number of HTML documents, sizes in the
> > 0.5K-50M range, most
> > around, say, 10M.
> >
> >
> >
> > I want to be able to present the user with the formatted
> > HTML document, with
> > the hits tagged, so that he may iterate through them, and
> > see them in the
> > context of the document, with the document looking as it
> > would be presented
> > by a browser; that is, fully formatted, with its tables and
> > italics and font
> > sizes and all.
> >
> >
> >
> > This is something that the user would explicitly request
> > from within a set
> > of search results, not something I'd expect to have
> > returned from an initial
> > search - the initial search merely returns the snippets
> > around the hits. But
> > if the user wants to dive into one of the returned results
> > and see them in
> > context, I need to be able to go get that.
> >
> >
> >
> > We are currently solving this problem by using an entirely
> > separate search
> > engine (dtSearch), which performs the tagging of the hits
> > in the HTML just
> > fine. But the solution is unsatisfactory because there are
> > Solr searches
> > that dtSearch's capabilities cannot reasonably match.
> >
> >
> >
> > Can anyone suggest a good way to use Solr/Lucene for this
> > instead? I'm
> > thinking a separate core for this purpose might make sense,
> > so as not to
> > burden the primary search core with the full contents of
> > the document. But
> > after that, I'm stuck. How can I get Solr to express the
> > highlighting in the
> > context of the formatted HTML document?
> >
> >
> >
> > If Solr does not do this currently, and anyone can suggest
> > ways to add the
> > feature, any tips on how this might best be incorporated
> > into the
> > implementation would be welcome.
>
> I am doing the same thing (solr trunk) using the following field type:
>
>  positionIncrementGap="100">
> 
> 
>  mapping="mappings.txt"/>
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
>  ignoreCase="true" expand="true"/>
> 
> 
> 
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
> 
>
> In your separate core - which will is queried when the user wants to
dive
> into one of the returned results - feed your html files in to this
field.
>
> You may want to increase max analyzed chars too.
> 147483647

OK, I think see what you're up to. Might be pretty viable for me as well.
Can you talk about anything in your mappings.txt files that is an
important part of the solution?

Also, isn't there another piece? Don't you need to force it to return the
whole document, rather than its usual context chunks? Or are you somehow
able to map the returned chunks into the separately-stored documents?

We have another requirement I forgot to mention, about wanting to
associate a sequence number with each hit, but I imagine I can deal with
that by putting some sort of identifiable char sequence in a custom prefix
for the highlighting, then replacing that with a sequence number in
postprocessing.

I'm also wondering about the performance of this approach with large
documents, vs. something like what Ludovic is talking about, where you
would just get positions back from Solr, and fetch the document separately
from a filestore.

-- Bryan


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow
> > OK, I think see what you're up to. Might be pretty viable
> > for me as well.
> > Can you talk about anything in your mappings.txt files that
> > is an
> > important part of the solution?
>
> It is not important. I just copied it. Plus html strip char filter does
> not have mappings parameter. It was a copy paste mistake.

Yes, I asked the wrong question. What I was subconsciously getting at is
this: how are you avoiding the possibility of getting hits in the HTML
elements? Is that accomplished by putting tag names in your stopwords, or
by some other mechanism?

-- Bryan


RE: solr java.lang.NullPointerException on select queries

2012-06-26 Thread Bryan Loofbourrow
Regarding the large number of files, even after optimize, we found that
when rebuilding a large, experimental 1.7TB index on Solr 3.5, instead of
Solr 1.4.1, there were a ton of index files, thousands, in 3.5, when there
used to be just 10 (or 11?) segments worth (as expected with mergeFactor
set to 10) in 1.4.1.

The apparent cause was a Solr switch to use TieredMergePolicy by default
somewhere in the 3.x version series. TieredMergePolicy has a default
segment size limit of 5GB, so if your index goes over 50GB, a mergeFactor
of 10 effectively gets ignored. We remedied this by explicitly configuring
TieredMergePolicy's segment size (and some other things that may or may
not be making a difference) in solrconfig.xml:


   
  
10
10
3
  


-- Bryan

> -Original Message-
> From: avenka [mailto:ave...@gmail.com]
> Sent: Tuesday, June 26, 2012 8:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr java.lang.NullPointerException on select queries
>
> So, I tried 'optimize', but it failed because of lack of space on the
> first
> machine. I then moved the whole thing to a different machine where the
> index
> was pretty much the only thing and was using about 37% of disk, but it
> still
> failed because of a "No space left on device" IOException. Also, the
size
> of
> the index has since doubled to roughly 74% of the disk on this second
> machine now and the number of files has increased from 3289 to 3329.
> Actually even the 3289 files on the first machine were after I tried
> optimize on it once, so the "original" size must have been even smaller.
>
> I don't think I can afford any more space and am close to giving up and
> reclaiming space on the two machines. A couple more questions before
that:
>
> 1) I am tempted to try editing binary--the "magnetic needle" option.
Could
> you elaborate on this? Would there be a way to go back to an index that
is
> the original size from its super-sized current form(s)?
>
> 2) Will CheckIndex also need more than twice the space? Would there be a
> way
> to bring down the size to the original size without running 'optimize'
if
> I
> try that route? Also how exactly do I run CheckIndex, e.g., the exact
URL
> I
> need to hit?
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/solr-
> java-lang-NullPointerException-on-select-queries-tp3989974p3991400.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using Solr 3.4 running on tomcat7 - very slow search

2012-07-16 Thread Bryan Loofbourrow
5 min is ridiculously long for a query that used to take 65ms. That ought
to be a great clue. The only two things I've seen that could cause that
are thrashing, or GC. Hard to see how it could be thrashing, given your
hardware, so I'd initially suspect GC.

Aim VisualVM at the JVM. It shows how much CPU goes to GC over time, in a
nice blue line. And if it's not GC, try out its Sampler tab, and see where
the CPU is spending its time.

FWIW, when asked at what point one would want to split JVMs and shard, on
the same machine, Grant Ingersoll mentioned 16GB, and precisely for GC
cost reasons. You're way above that. Maybe multiple JVMs and sharding,
even on the same machine, would serve you better than a monster 70GB JVM.

-- Bryan

> -Original Message-
> From: Mou [mailto:mouna...@gmail.com]
> Sent: Monday, July 16, 2012 7:43 PM
> To: solr-user@lucene.apache.org
> Subject: Using Solr 3.4 running on tomcat7 - very slow search
>
> Hi,
>
> Our index is divided into two shards and each of them has 120M docs ,
> total
> size 75G in each core.
> The server is a pretty good one , jvm is given memory of 70G and about
> same
> is left for OS (SLES 11) .
>
> We use all dynamic fields except th eunique id and are using long
queries
> but almost all of them are filter queires, Each query may have 10 -30 fq
> parameters.
>
> When I tested the index ( same size) but with max heap size 40 G,
queries
> were blazing fast. I used solrmeter to load test and it was happily
> serving
> 12000 queries or more per min with avg 65 ms qtime.We had an excellent
> filtercache hit ratio.
>
> This index is only used for searching and being replicated every 7 sec
> from
> the master.
>
> But now in production server it is horribly slow and taking 5
mins(qtime)
> to
> return a query ( same query).
> What could go wrong?
>
> Really appreciate your suggestions on debugging this thing..
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Using-
> Solr-3-4-running-on-tomcat7-very-slow-search-tp3995436.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using Solr 3.4 running on tomcat7 - very slow search

2012-07-16 Thread Bryan Loofbourrow
Another thing you may wish to ponder is this blog entry from Mike
McCandless:
http://blog.mikemccandless.com/2011/04/just-say-no-to-swapping.html

In it, he discusses the poor interaction between OS swapping, and
long-neglected allocations in a JVM. You're on Linux, which has decent
control over swapping decisions, so you may find that a tweak is in order,
especially if you can discover evidence that the hard drive is being
worked hard during GC. If the problem exists, it might be especially
pronounced in your large JVM.

I have no direct evidence of thrashing during GC (I am not sure how to go
about gathering such evidence), but I have seen, on a Windows machine, a
Tomcat running Solr refuse to shut down for many minutes, while a Resource
Monitor session reports that that same Tomcat process is frantically
reading from the page file the whole time. So there is something besides
plausibility to the idea.

-- Bryan

> -Original Message-
> From: Mou [mailto:mouna...@gmail.com]
> Sent: Monday, July 16, 2012 9:09 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Solr 3.4 running on tomcat7 - very slow search
>
> Thanks Brian. Excellent suggestion.
>
> I haven't used VisualVM before but I am going to use it to see where CPU
> is
> going. I saw that CPU is overly used. I haven't seen so much CPU use in
> testing.
> Although I think GC is not a problem, splitting the jvm per shard would
be
> a good idea.
>
>
> On Mon, Jul 16, 2012 at 9:44 PM, Bryan Loofbourrow [via Lucene] <
> ml-node+s472066n3995446...@n3.nabble.com> wrote:
>
> > 5 min is ridiculously long for a query that used to take 65ms. That
> ought
> > to be a great clue. The only two things I've seen that could cause
that
> > are thrashing, or GC. Hard to see how it could be thrashing, given
your
> > hardware, so I'd initially suspect GC.
> >
> > Aim VisualVM at the JVM. It shows how much CPU goes to GC over time,
in
> a
> > nice blue line. And if it's not GC, try out its Sampler tab, and see
> where
> > the CPU is spending its time.
> >
> > FWIW, when asked at what point one would want to split JVMs and shard,
> on
> > the same machine, Grant Ingersoll mentioned 16GB, and precisely for GC
> > cost reasons. You're way above that. Maybe multiple JVMs and sharding,
> > even on the same machine, would serve you better than a monster 70GB
> JVM.
> >
> > -- Bryan
> >
> > > -Original Message-
> > > From: Mou [mailto:[hidden
> email]<http://user/SendEmail.jtp?type=node&node=3995446&i=0>]
> >
> > > Sent: Monday, July 16, 2012 7:43 PM
> > > To: [hidden
> email]<http://user/SendEmail.jtp?type=node&node=3995446&i=1>
> > > Subject: Using Solr 3.4 running on tomcat7 - very slow search
> > >
> > > Hi,
> > >
> > > Our index is divided into two shards and each of them has 120M docs
,
> > > total
> > > size 75G in each core.
> > > The server is a pretty good one , jvm is given memory of 70G and
about
> > > same
> > > is left for OS (SLES 11) .
> > >
> > > We use all dynamic fields except th eunique id and are using long
> > queries
> > > but almost all of them are filter queires, Each query may have 10
-30
> fq
> > > parameters.
> > >
> > > When I tested the index ( same size) but with max heap size 40 G,
> > queries
> >
> > > were blazing fast. I used solrmeter to load test and it was happily
> > > serving
> > > 12000 queries or more per min with avg 65 ms qtime.We had an
excellent
> > > filtercache hit ratio.
> > >
> > > This index is only used for searching and being replicated every 7
sec
> > > from
> > > the master.
> > >
> > > But now in production server it is horribly slow and taking 5
> > mins(qtime)
> >
> > > to
> > > return a query ( same query).
> > > What could go wrong?
> > >
> > > Really appreciate your suggestions on debugging this thing..
> > >
> > >
> > >
> > > --
> > > View this message in context:
> http://lucene.472066.n3.nabble.com/Using-
> > > Solr-3-4-running-on-tomcat7-very-slow-search-tp3995436.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
> > --
> >  If you reply to this email, your message will be added to the
> discussion
> > below:
> >
> > http://lucene.472066.n3.nabble.com/Using-Solr-3-4-running-on-tomcat7-
> very-slow-search-tp3995436p3995446.html
> >

A strange Solr NullPointerException while shutting down Tomcat, possible connection to messed-up index files

2012-09-18 Thread Bryan Loofbourrow
I’m using Solr/Lucene 3.6 under Tomcat 6.



When shutting down an indexing server after much indexing activity,
occasionally, I see the following NullPointerException trace from Tomcat:



INFO: Stopping Coyote HTTP/1.1 on http-1800

Exception in thread "Lucene Merge Thread #1"
org.apache.lucene.index.MergePolicy

$MergeException: java.lang.NullPointerException

at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException

(ConcurrentMergeScheduler.java:509)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc

urrentMergeScheduler.java:482)

Caused by: java.lang.NullPointerException

at
org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.jav

a:349)

at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3915)

at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe

rgeScheduler.java:388)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc

urrentMergeScheduler.java:456)



Here’s where the strange part comes in. The code at the line of the “caused
by” looks like this:



  *if* (filter.accept(*null*, fileName) &&

  (segmentName == *null* || fileName.startsWith(segmentPrefix1) ||
fileName.startsWith(segmentPrefix2)) &&

  !refCounts.containsKey(fileName) &&

  !fileName.equals(IndexFileNames.*SEGMENTS_GEN*)) {



So if the line number is accurate, filter must be null. But filter is set
like this:



IndexFileNameFilter filter = IndexFileNameFilter.*getFilter*();



And the method in IndexFileNameFilter looks like:



  *public* *static* IndexFileNameFilter getFilter() {

*return* *singleton*;

  }



And the singleton field is initialized like so:



  *private* *static* IndexFileNameFilter *singleton* =
*new*IndexFileNameFilter();



So, that shouldn’t ever be null, right? Well, I can’t exactly repro this
rare situation in a debugger, but it did make me wonder whether something
about the behavior of the Tomcat classloader against a previously unloaded
class, during shutdown might somehow lead to a null value here.



Or is that too fanciful?  Should I instead be figuring that the line number
is off because of an if statement extending over several lines, and that
either filename or refCounts must be null? The code doesn’t make that look
too likely, either, though,



Does anyone have any advice about tracking this down or mitigating it? Also
wondering whether this issue is related to occasional “missing fnm file”
errors we’ve seen when restarting Tomcat on this indexing server. That
error means I have to go in and make the index coherent again, generally by
aligning it with one of the slaves. I can see how a failure to delete files
might have that effect.



Thanks,



-- Bryan