Design and Usage Questions

2010-10-31 Thread getagrip

Hi,

I've got some basic usage / design questions.

1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
   instance for all requests to avoid connection leaks.
   So if I create a Singleton instance upon application-startup I can
   securely use this instance for ALL queries/updates throughout my
   application without running into performance issues?

2. My System's documents are stored in a Subversion repository.
   For fast searchresults I want to periodically index new documents
   from the repository.

   What I get from the repository is a ByteArrayOutputStream. How can I
   pass this Stream to Solr?

   I only see possibilities to pass Files but in my case it does not
   make sense to write the ByteArrayOutputStream to disk again as this
   would cause performance issues apart from making no sense anyway.

3. Are there any disadvantages using Solrj over some other HTTP based
   solution e.g. creating & sending my own HTTP requests? Do I even
   have to use HTTP?
   I see the EmbeddedSolrServer exists. Any drawbacks using that?

Any hints are welcome, Thanks!


Re: Design and Usage Questions

2010-11-01 Thread getagrip

Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow?
I do not depend on Solr_J, any connection-method would suffice.

On 11/01/2010 03:23 AM, Lance Norskog wrote:

2.
The SolrJ library handling of content streams is "pull", not "push".
That is, you give it a reader and it pulls content when it feels like
it. If your software to feed the connection wants to write the data,
you have to either buffer the whole thing or do a dual-thread
writer/reader pair.

The easiest way to pull stuff from SVN is to use one of the web server
apps. Solr takes a "stream.url" parameter. (Also stream.file.) Note
that there is no outbound authentication supported; your web server
has to be open (at least to the Solr instance).


On Sun, Oct 31, 2010 at 4:06 PM, getagrip  wrote:

Hi,

I've got some basic usage / design questions.

1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
   instance for all requests to avoid connection leaks.
   So if I create a Singleton instance upon application-startup I can
   securely use this instance for ALL queries/updates throughout my
   application without running into performance issues?

2. My System's documents are stored in a Subversion repository.
   For fast searchresults I want to periodically index new documents
   from the repository.

   What I get from the repository is a ByteArrayOutputStream. How can I
   pass this Stream to Solr?

   I only see possibilities to pass Files but in my case it does not
   make sense to write the ByteArrayOutputStream to disk again as this
   would cause performance issues apart from making no sense anyway.

3. Are there any disadvantages using Solrj over some other HTTP based
   solution e.g. creating&  sending my own HTTP requests? Do I even
   have to use HTTP?
   I see the EmbeddedSolrServer exists. Any drawbacks using that?

Any hints are welcome, Thanks!







highlighting encoding issue

2010-12-07 Thread getagrip

Hi,
when I query solr (trunk) I get "numeric character references" instead 
of regular UTF-8 strings in case of special characters in the 
highlighting section, in the result section the characters are presented 
fine.


e.g instead of the German Umlaut Ä I get ä

Example:



Vielfachmessgerät






Vielfachmessgerät


Any hints are welcome.


Special Character & Hightlighting issues after 3.1.0 update

2011-04-14 Thread getagrip
Having updated from 1.4.1 to 3.1.0 some documents are not parsed 
correctly anymore:


1. Both the result's id field and the highlighting's header do not 
display special-characters e.g. German Umlauts anymore.


2. The highlighting section is messed up as words appear in random order 
instead of readable sentences.


Please see both versions (3.1.0 & 1.4.1) below:

###
query =>




### solr 3.1.0 (not working) =>






 Netzqualitätsmessungen gemäß Klasse ASeit Einführung der Norm IEC






 derAnwendungsberichtSchleifenimpedanzDie Messung der 
Erdschleifenimpedanz und die Bestimmung desunbeeinflussten 
Kurzschlussstroms (PFC







### solr 1.4.1 (works well) =>







 die elektrische Anlage eines Unternehmens den zuverlässigen Betrieb 
der Ver- braucher gewährleistet







 wurde die Messung oft gar nicht erst durchgeführt aus Angst 
der FI könnte auslösen. Diese Befürch- tung







Re: Special Character & Hightlighting issues after 3.1.0 update

2011-04-17 Thread getagrip

This works with NEITHER HtmlEncoder NOR DefaultEncoder.

1. Special characters like öäüß simply are returned as question marks.
   This goes for ALL document types.

2. The index is built in a way that randomly concatenates words and
   puts them into the highlighting section in a way that does not
   mirror the original text. This goes for GERMAN PDFs.

On 04/14/2011 05:51 PM, Yonik Seeley wrote:

On Thu, Apr 14, 2011 at 11:27 AM, Koji Sekiguchi  wrote:

I'm not sure, but it is due to HtmlEncoder?

  
  

it set as default in example config.


Thanks Koji,

So it looks like the problems here are either in Tika (and PDFBox), or
the Tika-Solr integration.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


token exceeding provided text size error since Solr 3.2

2011-06-30 Thread getagrip

A bug was introduced between Solr 3.1 and 3.2.

With Solr 3.2 we are now getting the follwing error when querying 
several pdf and word documents:


SEVERE: org.apache.solr.common.SolrException: 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 
17 exceeds length of provided text sized 168
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:474)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 
17 exceeds length of provided text sized 168
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:467)

... 24 more