from:"Tod"

Please help - Solr Cell using 'stream.url'

2011-10-07 Thread Tod

I'm batching documents into solr using solr cell with the 'stream.url' 
parameter.  Everything is working fine until I get to about 5k documents 
in and then it starts issuing 'read timeout 500' errors on every document.


The sysadmin says there's plenty of CPU, memory, and no paging so it 
doesn't look like the OS is the problem.  I can curl the documents that 
Solr is trying to index and failing just fine so it seems to be a Solr 
issue.  There's only about 35K documents total so Solr should even blink.


Can anyone help me diagnose this problem?  I'd be happy to provide any 
more detail that is needed.



Thanks - Tod

Re: Please help - Solr Cell using 'stream.url'

2011-10-10 Thread Tod


On 10/07/2011 6:21 PM, � wrote:

Hi,

What Solr version?


Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42. 
 Its running on a Suse Linux VM.



How often do you do commits, or do you use autocommit?


I had been doing commits every 100 documents (the entire set is about 
35K docs so its relatively small.  Since that wasn't working, and I read 
that commits are expensive, I decided to experiment and wait until all 
documents were indexed before committing.  I haven't been able to 
successfully index all the documents yet to try the manual commit 
because of this problem.





What kind and size of docs?


Mostly MS office and PDF's, some straight HTML pages.  I can't give a 
specific answer to size but nothing alarmingly large - typical 2-5 page 
office documents.




Do you feed from a Java program? Where is the read timeout occurring? Can you 
paste in some logs?


I'd love to but I could never get it to work.  I'm using Perl right now 
getting rows from an Oracle database and using LWP to perform the calls 
to Solr's REST interface.




How much RAM on your server, and how much did you give to the JVM?


RAM to JVM:
export CATALINA_OPTIONS="-Xms1024m -Xmx3072m"

Top output on the VM:
cpu(s): 64.1%us, 11.4%sy,  0.0%ni, 24.0%id,  0.2%wa,  0.2%hi,  0.2%si, 
0.0%st

mem:   3980384k total,  3803300k used,   177084k free,   393924k buffers
swap:  4194296k total,  512k used,  4193784k free,  1518156k cached

 pid   user   pr  ni  virt res  shr  s %cpu %mem  time+command 



16243  solr   19   0  642m 322m 6256 s  119  8.3  73:16.49 java 





Thanks.

Re: Please help - Solr Cell using 'stream.url'

2011-10-12 Thread Tod


On 10/10/2011 3:39 PM, � wrote:

Hi,

If you have 4Gb on your server total, try giving about 1Gb to Solr, leaving 3Gb 
for OS, OS caching and mem-allocation outside the JVM.
Also, add 'ulimit -v unlimited' and 'ulimit -s 10240' to /etc/profile to 
increase virtual memory and stack limit.



I will try this - thanks.




And you should also consider upgrading to latest Solr...



Is there a clearly defined migration path?


- Tod

Instructions for Multiple Server Webapps Configuring with JNDI

2011-10-13 Thread Tod


I'm following the instructions here:

http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat

...under the heading "Multiple Solr Webapps".

I have configured the context fragment as instructed, placed the 
apache-solr-3.4.0.war in the directory pointed to by the docBase 
variable, and modified the solr/home accordingly.  I have an empty 
directory under tomcat/webapps named after the solr home directory in 
the context fragment.


The context fragment contains:


crossContext="true">
  value="/opt/solr/solr0" override="true"/>



An empty /tomcat/webapps/solr0 directory exists.

I expected to fire up tomcat and have it unpack the war file contents 
into the solr home directory specified in the context fragment, but its 
empty, as is the webapps directory.


What am I doing wrong?  I'm running Apache Tomcat/6.0.29.


TIA - Tod

Re: Instructions for Multiple Server Webapps Configuring with JNDI

2011-10-18 Thread Tod


On 10/14/2011 2:44 PM, Chris Hostetter wrote:


: modified the solr/home accordingly.  I have an empty directory under
: tomcat/webapps named after the solr home directory in the context fragment.

if that empty directory has the same base name as your context fragment
(ie: "tomcat/webapps/solr0" and "solr0.xml") that may give you problems
... the entire point of using context fragment files is to define webapps
independently of a simple directory based hierarchy in tomcat/webapps ...
if you have a directory there with the same name you create a conflict --
which webapp should it use, the empty one, or the one specified by your
contextt file?



Looks like that was the problem, once I removed the ./webapps/solr0 
directory and started tomcat back up it was recreated correctly.





: I expected to fire up tomcat and have it unpack the war file contents into the
: solr home directory specified in the context fragment, but its empty, as is
: the webapps directory.

that's not what the "solr/home" env variable is for at all.  tomcat will
put the unpacked war where ever it needs/wants to (in theory it could just
load it in memory) ... the point of the solr/home env variable is for you
to tell the solr.war where to find the configuration files for this
context.


Sorry, my mistake.  I wasn't referring to "solr/home" I was referring 
literally to the new solr home under tomcat - in this instance 
./webapps/solr0.


One more question, is there a particular advantage of multiple solr 
instances vs. multiple solr cores?



Thanks.

java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log

2011-10-19 Thread Tod

I'm working on upgrading to Solr 3.4.0 and am seeing this error in my 
tomcat log.  I'm using the following slf jars:


slf4j-api-1.6.1.jar
slf4j-jdk14-1.6.1.jar

Has anybody run into this?  I can reproduce it doing curl calls to the 
Solr ExtractingRequestHandler ala /solr/update/extract.


TIA - Tod

can solr follow and index hyperlinks embedded in rich text documents (pdf, doc, etc)?

2011-10-21 Thread Tod

I have a feeling the answer is "no" since you wouldn't want to start 
indexing a large volume of office documents containing hyperlinks that 
could lead all over the internet.  But, since there might be a use case 
like "a customer just asked me if it could be done?", I thought I would 
make sure.



Thanks - Tod

Re: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log

2011-10-21 Thread Tod


On 10/19/2011 2:58 PM, wrote:

Hi Tod,

I had similar issue with slf4j, but it was NoClassDefFound. Do you
have some other dependencies in your application that use some other
version of slf4j? You can use mvn dependency:tree to get all
dependencies in your application. Or maybe there's some other version
already in your tomcat or application server.

/Tim


I had to start over from scratch but I believe that's exactly what it 
was.  Things are working now.


Thanks.

Batch indexing documents using ContentStreamUpdateRequest

2011-11-04 Thread Tod

This is a code fragment of how I am doing a ContentStreamUpdateRequest 
using CommonHTTPSolrServer:



  ContentStreamBase.URLStream csbu = new ContentStreamBase.URLStream(url);
  InputStream is = csbu.getStream();
  FastInputStream fis = new FastInputStream(is);

  csur.addContentStream(csbu);
  csur.setParam("literal.content_id","00");
  csur.setParam("literal.contentitle","This is a test");
  csur.setParam("literal.title","This is a test");
  server.request(csur);
  server.commit();

  fis.close();


This works fine for one document (a pdf in this case).  When I surround 
this with a while loop and try adding multiple documents I get:


org.apache.solr.client.solrj.SolrServerException: java.io.IOException: 
stream is closed


I've tried commenting out the fis.close, and also using just a plain 
InputStream with and without a .close() call - neither work.  Is there a 
way to do this that I'm missing?



Thanks - Tod

Re: Batch indexing documents using ContentStreamUpdateRequest

2011-11-04 Thread Tod


Answering my own question.

ContentStreamUpdateRequest (csur) needs to be within the while loop not 
outside as I had it.  Still not seeing any dramatic performance 
improvements over perl though (the point of this exercise).  Indexing 
locks after about 30-45 minutes of activity, even a commit won't budge it.




On 11/04/2011 12:36 PM, Tod wrote:

This is a code fragment of how I am doing a ContentStreamUpdateRequest
using CommonHTTPSolrServer:


ContentStreamBase.URLStream csbu = new ContentStreamBase.URLStream(url);
InputStream is = csbu.getStream();
FastInputStream fis = new FastInputStream(is);

csur.addContentStream(csbu);
csur.setParam("literal.content_id","00");
csur.setParam("literal.contentitle","This is a test");
csur.setParam("literal.title","This is a test");
server.request(csur);
server.commit();

fis.close();


This works fine for one document (a pdf in this case). When I surround
this with a while loop and try adding multiple documents I get:

org.apache.solr.client.solrj.SolrServerException: java.io.IOException:
stream is closed

I've tried commenting out the fis.close, and also using just a plain
InputStream with and without a .close() call - neither work. Is there a
way to do this that I'm missing?


Thanks - Tod

Help! - ContentStreamUpdateRequest

2011-11-14 Thread Tod


Could someone take a look at this page:

http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

... and tell me what code changes I would need to make to be able to 
stream a LOT of files at once rather than just one?  It has to be 
something simple like a collection of some sort but I just can't get it 
figured out.  Maybe I'm using the wrong class altogether?



TIA

Re: Help! - ContentStreamUpdateRequest

2011-11-15 Thread Tod


Otis,

The files are only part of the payload.  The supporting metadata exists 
in a database.  I'm pulling that information, as well as the name and 
location of the file, from the database and then sending it to a remote 
Solr instance to be indexed.


I've heard Solr would prefer to get documents it needs to index in 
chunks rather than one at a time as I'm doing now.  The one at a time 
approach is locking up the Solr server at around 700 entries.  My 
thought was if I could chunk them in a batch at a time the lockup will 
stop and indexing performance would improve.



Thanks - Tod

On 11/15/2011 12:13 PM, Otis Gospodnetic wrote:

Hi,

How about just concatenating your files into one? �Would that work for you?

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



____
From: Tod
To: solr-user@lucene.apache.org
Sent: Monday, November 14, 2011 4:24 PM
Subject: Help! - ContentStreamUpdateRequest

Could someone take a look at this page:

http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

... and tell me what code changes I would need to make to be able to stream a 
LOT of files at once rather than just one?� It has to be something simple like 
a collection of some sort but I just can't get it figured out.� Maybe I'm using 
the wrong class altogether?


TIA

Re: Help! - ContentStreamUpdateRequest

2011-11-16 Thread Tod


Erick,

Autocommit is commented out in solrconfig.xml.  I have avoided them 
until after the indexing process is complete.  As an experiment I tried 
committing every n records processed to see if varying n would make a 
difference, it really didn't change much.


My original use case had the client running from the Solr server and 
streaming the document content over from a web server based on the URL 
gathered by a query from a backend database.  The locking problem 
appeared there first so I tried moving the client code to the web server 
to be closer the the documents origin.  That helped a little but ended 
up locking which is where I am now.


Solr should be able to index way more documents than the 35K I'm trying 
to index.  It seems from other's accounts they are able to do what I'm 
trying to do successfully.  Therefore I believe I must be doing 
something extraordinarily dumb.  I'll be happy to share any information 
about my environment or configuration if it will help find my error.


Thanks for all of your help.


- Tod





On 11/15/2011 8:08 PM, Erick Erickson wrote:

That's odd. What are your autocommit parameters? And are you either
committing or optimizing as part of your program? I'd bump the
autocommit parameters up and NOT commit (or optimize) from your
client if you are

Best
Erick

On Tue, Nov 15, 2011 at 2:17 PM, Tod  wrote:

Otis,

The files are only part of the payload.  The supporting metadata exists in a
database.  I'm pulling that information, as well as the name and location of
the file, from the database and then sending it to a remote Solr instance to
be indexed.

I've heard Solr would prefer to get documents it needs to index in chunks
rather than one at a time as I'm doing now.  The one at a time approach is
locking up the Solr server at around 700 entries.  My thought was if I could
chunk them in a batch at a time the lockup will stop and indexing
performance would improve.


Thanks - Tod

On 11/15/2011 12:13 PM, Otis Gospodnetic wrote:


Hi,

How about just concatenating your files into one? �Would that work for
you?

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: Tod
To: solr-user@lucene.apache.org
Sent: Monday, November 14, 2011 4:24 PM
Subject: Help! - ContentStreamUpdateRequest

Could someone take a look at this page:

http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

... and tell me what code changes I would need to make to be able to
stream a LOT of files at once rather than just one?� It has to be something
simple like a collection of some sort but I just can't get it figured out.�
Maybe I'm using the wrong class altogether?


TIA

Indexing Using XML Message

2012-01-25 Thread Tod

I have a local data store containing a host of different document types. 
 This data store is separate from a remote Solr install making 
streaming not an option.  Instead I'd like to generate an XML file that 
contains all of the documents including content and metadata.


What would be the most appropriate way to accomplish this?  I could use 
the Tika CLI to generate XML but I'm not sure it would work or that its 
the most efficient way to handle things.  Can anyone offer some suggestions?



Thanks - Tod

Data Import Handler Rich Format Documents

2010-06-18 Thread Tod

I have a database containing Metadata from a content management system. 
 Part of that data includes a URL pointing to the actual published 
document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.


I'm already indexing the Metadata and that provides a lot of value.  The 
customer however would like that the content pointed to by the URL also 
be indexed for more discrete searching.


This article at Lucid:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

describes the process of coding a custom transformer.  A separate 
article I've read implies Nutch could be used to provide this 
functionality too.


What would be the best and most efficient way to accomplish what I'm 
trying to do?  I have a feeling the Lucid article might be dated and 
there might ways to do this now without any coding and maybe without 
even needing to use Nutch.  I'm using the current release version of Solr.


Thanks in advance.


- Tod

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod


On 6/18/2010 9:12 AM, Otis Gospodnetic wrote:

Tod,

You didn't mention Tika, which makes me think you are not aware of it...
You could implement a custom Transformer that uses Tika to perform rich doc 
text extraction, just like ExtractingRequestHandler does it (see 
http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you could even 
just call ERH from your Transformer, though that wouldn't be the most efficient.



You're right, sorry.  I have looked at Tika, which I believe is used by 
Nutch too - no?


Implementing a transformer is fine.  I guess I'm being lazy and trying 
to see if a method of doing this has been incorporated into the latest 
Solr release so I can avoid coding for it.








- Original Message 

From: Tod 
To: solr-user@lucene.apache.org
Sent: Fri, June 18, 2010 8:51:02 AM
Subject: Data Import Handler Rich Format Documents

I have a database containing Metadata from a content management system.  
Part of that data includes a URL pointing to the actual published document which 
can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.


I'm already 
indexing the Metadata and that provides a lot of value.  The customer 
however would like that the content pointed to by the URL also be indexed for 
more discrete searching.


This article at Lucid:


href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; 
target=_blank 

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS


describes 
the process of coding a custom transformer.  A separate article I've read 
implies Nutch could be used to provide this functionality too.


What would 
be the best and most efficient way to accomplish what I'm trying to do?  I 
have a feeling the Lucid article might be dated and there might ways to do this 
now without any coding and maybe without even needing to use Nutch.  I'm 
using the current release version of Solr.


Thanks in 

advance.



- Tod

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod


On 6/18/2010 11:24 AM, Otis Gospodnetic wrote:

Tod,

I don't think DIH can do that, but who knows, let's see what others say.
Yes, Nutch uses TIKA, too.

 Otis


Looks like the ExtractingRequestHandler uses Tika as well.  I might just 
use this but I'm wondering if there will be a large performance 
difference between using it to batch content in over rolling my own 
Transformer?



- Tod

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod


On 6/18/2010 2:42 PM, Chris Hostetter wrote:

: > I don't think DIH can do that, but who knows, let's see what others say.

: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs to 
documents that you want to parse with Tika?


Why would you need a custom Transformer?


I started this thread after reading a Lucid article suggesting a custom 
Transformer might be the way to go when using DIH.  My initial question 
was if there was an alternative.


My database contains only Metadata and a reference to the actual content 
(HTML,Office Documents, PDF...) as a URL - not blobs as the Lucid 
article focused on.  What I would like to do is use DIH somehow to index 
the Metadata but also the actual content pointed to by the URL column.


I might be able to do this instead with Nutch (who uses Tika), haven't 
thoroughly researched this yet, or I can write a job to pull all the 
URL's out of the database and utilize cURL and the Solr 
ExtractingRequestHandler to push everything into the index.  I just 
wanted to see what everybody else is doing and what my other options 
might be.



Thanks - Tod


Ref:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

Re: Data Import Handler Rich Format Documents

2010-06-22 Thread Tod


On 6/18/2010 2:42 PM, Chris Hostetter wrote:

: > I don't think DIH can do that, but who knows, let's see what others say.

: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs to 
documents that you want to parse with Tika?


Why would you need a custom Transformer?

http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
http://wiki.apache.org/solr/TikaEntityProcessor

-Hoss


Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm 
using Solr Version: 1.4.0 and getting the following error:


java.lang.ClassNotFoundException: Unable to load BinURLDataSource or 
org.apache.solr.handler.dataimport.BinURLDataSource


curl -s http://test.html|curl 
http://localhost:9080/solr/update/extract?extractOnly=true --data-binary 
@-  -H 'Content-type:text/html'


... works fine so presumably my Tika processor is working.


My data-config.xml looks like this:


  

  

  

  
  
  
  
  
  
  


 query="select CONTENT_URL from my_database where 
content_id='${my_database.CONTENT_ID}'">

 
  url="http://www.mysite.com/${my_database.content_url}";
  
 


  


I added the entity name="my_database_url" section to an existing 
(working) database entity to be able to have Tika index the content 
pointed to by the content_url.


Is there anything obviously wrong with what I've tried so far?


Thanks - Tod

Indexing Rich Format Documents using Data Import Handler (DIH) and the TikaEntityProcessor

2010-06-23 Thread Tod


Please refer to this thread for history:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201006.mbox/%3c4c1b6bb6.7010...@gmail.com%3e


I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using 
Solr Version: 1.4.0 and getting the following error:


java.lang.ClassNotFoundException: Unable to load BinURLDataSource or 
org.apache.solr.handler.dataimport.BinURLDataSource


curl -s http://test.html|curl 
http://localhost:9080/solr/update/extract?extractOnly=true --data-binary 
@-  -H 'Content-type:text/html'


... works fine so presumably my Tika processor is working.


My data-config.xml looks like this:


  

  

  

  
  
  
  
  
  
  


 query="select CONTENT_URL from my_database where 
content_id='${my_database.CONTENT_ID}'">

 
  url="http://www.mysite.com/${my_database.content_url}";
  
 


  


I added the entity name="my_database_url" section to an existing 
(working) database entity to be able to have Tika index the content 
pointed to by the content_url.


Is there anything obviously wrong with what I've tried so far because 
this is not working, it keeps rolling back with the error above.



Thanks - Tod

Re: Data Import Handler Rich Format Documents

2010-07-06 Thread Tod


On 6/28/2010 8:28 AM, Alexey Serba wrote:

Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using
Solr Version: 1.4.0 and getting the following error:

java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
org.apache.solr.handler.dataimport.BinURLDataSource

It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583



Thanks, that would explain things - I'm using a stock 1.4.0 download.



My data-config.xml looks like this:


�

�

�
� �
� � �
� � �
� � �
� � �
� � �
� � �
� � �
� �

� �
� � 
� � �url="http://www.mysite.com/${my_database.content_url}";
� � �
� � 
� �

�


I added the entity name="my_database_url" section to an existing (working)
database entity to be able to have Tika index the content pointed to by the
content_url.

Is there anything obviously wrong with what I've tried so far?


I think you should move Tika entity into my_database entity and
simplify the whole configuration


...


http://www.mysite.com/${my_database.content_url}";






This, I guess, would be after I checked out and built from trunk?


Thanks - Tod

Supplementing already indexed data

2010-07-11 Thread Tod

I'm getting metadata from a RDB but the actual content is stored 
somewhere else.  I'd like to index the content too but I don't want to 
overlay the already indexed metadata.  I know this can be done but I 
just can't seem to dig up the correct docs, can anyone point me in the 
right direction?



Thanks.

Solrj ContentStreamUpdateRequest Slow

2010-08-04 Thread Tod

I'm running a slight variation of the example code referenced below and 
it takes a real long time to finally execute.  In fact it hangs for a 
long time at solr.request(up) before finally executing.  Is there 
anything I can look at or tweak to improve performance?


I am also indexing a local pdf file, there are no firewall issues, solr 
is running on the same machine, and I tried the actual host name in 
addition to localhost but nothing helps.



Thanks - Tod

http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

Re: Solrj ContentStreamUpdateRequest Slow

2010-08-06 Thread Tod


On 8/4/2010 11:11 PM, jayendra patil wrote:

ContentStreamUpdateRequest seems to read the file contents and transfer it
over http, which slows down the indexing.

Try Using StreamingUpdateSolrServer with stream.file param @
http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post

e.g.

SolrServer server = new StreamingUpdateSolrServer("Solr Server URL",20,8);
UpdateRequest req = new UpdateRequest("/update/extract");
ModifiableSolrParams params = null ;
params = new ModifiableSolrParams();
params.add("stream.file", new String[]{"local file path"});
params.set("literal.id", value);
req.setParams(params);
server.request(req);
server.commit();


Thanks for your suggestions.  Unfortunately, I'm still seeing poor 
performance.


To be clear, I am trying to have SOLR index multiple documents that 
exist on a remote server.  I'd prefer that SOLR stream the documents 
after I pass a pointer to them rather than me retrieving and pushing 
them so I can avoid network overhead.


When I do this:

curl 
'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'


It returns in around a second.  When I execute the attached code it 
takes just over three minutes.  The optimal for me would be able get 
closer to the performance I'm seeing with curl using Solrj.


To be fair the SOLR server I am using is really a workstation class 
machine, plus I am still learning.  I have a feeling I'm doing something 
dumb but just can't seem to pinpoint the exact problem.



Thanks - Tod


code---


import java.io.File;
import java.io.IOException;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;

import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.UpdateRequest;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.common.params.ModifiableSolrParams;


/**
 * @author EDaniel
 */
public class SolrExampleTests {

  public static void main(String[] args) {
System.out.println("main...");
try {
//  String fileName = "/test/test.pdf";
  String fileName = "http://remoteserver/test/test.pdf";;
  String solrId = "1234";
  indexFilesSolrCell(fileName, solrId);

} catch (Exception ex) {
  System.out.println(ex.toString());
}
  }

  /**
   * Method to index all types of files into Solr.
   * @param fileName
   * @param solrId
   * @throws IOException
   * @throws SolrServerException
   */
  public static void indexFilesSolrCell(String fileName, String solrId)
throws IOException, SolrServerException {

System.out.println("indexFilesSolrCell...");

String urlString = "http://localhost:8080/solr";;

System.out.println("getting connection...");
//SolrServer solr = new CommonsHttpSolrServer(urlString);
SolrServer solr = new StreamingUpdateSolrServer(urlString,100,5);

System.out.println("getting updaterequest handle...");
//ContentStreamUpdateRequest up = new 
ContentStreamUpdateRequest("/update/extract");

UpdateRequest up = new UpdateRequest("/update/extract");

ModifiableSolrParams params = null ;
params = new ModifiableSolrParams();
//params.add("stream.file", fileName);
params.add("stream.url", fileName);
params.set("literal.content_id", solrId);
up.setParams(params);

System.out.println("making request...");
solr.request(up);

System.out.println("committing...");
solr.commit();

System.out.println("done...");
  }
}

Re: Solrj ContentStreamUpdateRequest Slow

2010-08-13 Thread Tod


On 8/12/2010 8:02 PM, Chris Hostetter wrote:

: It returns in around a second.  When I execute the attached code it takes just
: over three minutes.  The optimal for me would be able get closer to the
: performance I'm seeing with curl using Solrj.

I think your problem may be that StreamingUpdateSolrServer buffers up 
commands and sends them in batches in a background thread.  if you want to 
send individual updates in real time (and time them) you should just use 
CommonsHttpSolrServer



-Hoss



My goal is to batch updates.  My content lives somewhere else so I was 
trying to find a way to tell Solr where the document lived so it could 
go out and stream it into the index for me.  That's where I thought 
StreamingUpdateSolrServer would help.


- Tod

Re: Solrj ContentStreamUpdateRequest Slow

2010-08-18 Thread Tod

On 8/16/2010 6:12 PM, Chris Hostetter wrote:

: > I think your problem may be that StreamingUpdateSolrServer buffers up
: > commands and sends them in batches in a background thread. if you want to
: > send individual updates in real time (and time them) you should just use
: > CommonsHttpSolrServer
:
: My goal is to batch updates. My content lives somewhere else so I was trying

: to find a way to tell Solr where the document lived so it could go out and
: stream it into the index for me. That's where I thought
: StreamingUpdateSolrServer would help.

If your content lives on a machine which is not your "client" nor your
"server" and you want your client to tell your server to go fetch it
directly then the "stream.url" param is what you need -- that is unrelated
to wether you use StreamingUpdateSolrServer or not.

Do you happen to have a code fragment laying around that demonstrates
using CommonsHttpSolrServer and "stream.url"? I've tried it in
conjunction with ContentStreamUpdateRequest and I keep getting an
annoying null pointer exception. In the meantime I will check the
examples...

Thinking about it some more, i suspect the reason you might be seeing a
delay when using StreamingUpdateSolrServer is because of this bug...

https://issues.apache.org/jira/browse/SOLR-1990

...if there are no actual documents in your UpdateRequest (because you are
using the stream.url param) then the StreamingUpdateSolrServer blocks
until all other requests are done, then delegates to the super class (so
it never actaully puts your indexing requests in a buffered queue, it just
delays and then does them immediately)

Not sure of a good way arround this off the top of my head, but i'll note
it in SOLR-1990 as another problematic use case that needs dealt with.

Perhaps I can execute an initial update request using a benign file
before making the "stream.url" call?

Also, to beat a dead horse, this:
'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'

... works fine - I just want to do it a LOT and as efficiently as
possible. If I have to I can wrap it in a perl script and run a cURL or
LWP loop but I'd prefer to use SolrJ if I can.

Thanks for all your help.

- Tod

Re: Solrj ContentStreamUpdateRequest Slow

2010-08-19 Thread Tod


On 8/19/2010 1:45 AM, Lance Norskog wrote:

'stream.url' is just a simple parameter. You should be able to just
add it directly.



I agree (code excluding imports):

public class CommonTest {

  public static void main(String[] args) {
System.out.println("main...");
try {
  String fileName = String fileName = 
"http://remoteserver/test/test.pdf";;

  String solrId = "1234";
  indexFilesSolrCell(fileName, solrId);

} catch (Exception ex) {
  ex.printStackTrace();
}
  }

  /**
   * Method to index all types of files into Solr.
   * @param fileName
   * @param solrId
   * @throws IOException
   * @throws SolrServerException
   */
  public static void indexFilesSolrCell(String fileName, String solrId)
throws IOException, SolrServerException {

System.out.println("indexFilesSolrCell...");

String urlString = "http://localhost:9080/solr";;

System.out.println("getting connection...");
SolrServer solr = new CommonsHttpSolrServer(urlString);

System.out.println("getting updaterequest handle...");
ContentStreamUpdateRequest req = new 
ContentStreamUpdateRequest("/update/extract");


System.out.println("setting params...");
req.setParam("stream.url", fileName);
req.setParam("literal.content_id", solrId);

System.out.println("making request...");
solr.request(req);

System.out.println("committing...");
solr.commit();

System.out.println("done...");
  }
}


At "making request" I get:

java.lang.NullPointerException
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:381)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)

at CommonTest.indexFilesSolrCell(CommonTest.java:59)
at CommonTest.main(CommonTest.java:26)

... which is pointing to the solr.request(req) line.



Thanks - Tod

Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Tod

On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote:

Hi,
I have exactly the same problem than the one you submitted in this link
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html
and I would like to ask you if you got a solution for that.
I started to have a look on tika and DataImportHandler but I don't success to
find to right way of writing the syntax.
So can you please give an example if you successed to find the right syntax.
Thanks.

Bumping this to the list...

Unfortunately I could never get DIH to work correctly. My suspicion is
that I was using a stock 1.4.0 Solr but attempting to perform a task
that was only available on the latest build. My customer requirements
demand a pretty well vetted GA release so experimenting was not an
option. I attempted an upgrade (quickly, sloppily) to 1.4.1 but no
luck. I believe the next GA release might be my solution.

I tried getting around that bump by trying SolrJ
ContentStreamUpdateRequest @
http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927.
After floundering for a while I decided to put that on hold. I ended
up writing a Perl script that emulates the command line cURL that I
referenced in the above thread. It took about 72 hours to index
~850,000 entries (if anyone is interested).

I plan on looping back to try the suggestions Hoss last made, just
haven't had the time to respond. I'm sure things will work I just
needed something quickly and don't have the seasoned experience the
other developers do.

- Tod

UpdateXmlMessage

2010-10-01 Thread Tod


I can do this using GET:

http://localhost:8983/solr/update?stream.body=%3Cdelete%3E%3Cquery%3Eoffice:Bridgewater%3C/query%3E%3C/delete%3E
http://localhost:8983/solr/update?stream.body=%3Ccommit/%3E

... but can I pass a stream.url parameter using an UpdateXmlMessage?  I 
looked at the schema and I think the answer is no but just wanted to check.



TIA

Re: UpdateXmlMessage

2010-10-04 Thread Tod


On 10/1/2010 11:33 PM, Lance Norskog wrote:

Yes. stream.file and stream.url are independent of the request handler.
They do their magic at the very top level of the request.

However, there are no unit tests for these features, but they are widely
used.



Sorry Lance, are you agreeing that I can't or that I can?  If I can, I'm 
doing something wrong.  I'm specifying stream.url as its own field in 
the XML like:



 
  I am the author
  I am the title
  http://www.test.com/myOfficeDoc.doc
  .
  .
  .
 


The wiki docs were a little sparse on this one.

- Tod





Tod wrote:

I can do this using GET:

http://localhost:8983/solr/update?stream.body=%3Cdelete%3E%3Cquery%3Eoffice:Bridgewater%3C/query%3E%3C/delete%3E

http://localhost:8983/solr/update?stream.body=%3Ccommit/%3E

... but can I pass a stream.url parameter using an UpdateXmlMessage? I
looked at the schema and I think the answer is no but just wanted to
check.


TIA

Overriding Tika's field processing

2010-10-28 Thread Tod

I'm reading my document data from a CMS and indexing it using calls to 
curl.  The curl call includes 'stream.url' so Tika will also index the 
actual document pointed to by the CMS' stored url.  This works fine.


Presentation side I have a dropdown with the title of all the indexed 
documents such that when a user clicks one of them it opens in a new 
window.  Using js, I've been parsing the json returned from Solr to 
create the dropdown.  The problem is I can't get the titles sorted 
alphabetically.


If I use a facet.sort on the title field I get back ALL the sorted 
titles in the facet block, but that doesn't include the associated 
URL's.  A sorted query won't work because title is a multivalued field.


The one option I can think of is to make the title single valued so that 
I have a one to one relationship to the returned url.  To do that I'd 
need to be able to *not* index the Tika returned values.


If I read right, my understanding was that I could use 'literal.title' 
in the curl call to limit what would be included in the index from Tika. 
 That doesn't seem to be working as a test facet query returns more 
than I have in the CMS.


Am I understanding the 'literal.title' processing correctly?  Does 
anybody have experience/suggestions on how to handle this?



Thanks - Tod

Facet count of zero

2010-11-01 Thread Tod

I'm trying to exclude certain facet results from a facet query.  It 
seems to work but rather than being excluded from the facet list its 
returned with a count of zero.


Ex: 
q=(-foo:bar)&facet=true&facet.field=foo&facet.sort=idx&wt=json&indent=true


This returns bar with a count of zero.  All the other foo's show up with 
valid counts.


Can I do this?  Is my syntax incorrect?



Thanks - Tod

Re: Facet count of zero

2010-11-01 Thread Tod


On 11/1/2010 1:03 PM, Yonik Seeley wrote:

On Mon, Nov 1, 2010 at 12:55 PM, Tod  wrote:

I'm trying to exclude certain facet results from a facet query. �It seems to
work but rather than being excluded from the facet list its returned with a
count of zero.


If you don't want to see 0 counts, use facet.mincount=1

http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik
http://www.lucidimagination.co



Ex:
q=(-foo:bar)&facet=true&facet.field=foo&facet.sort=idx&wt=json&indent=true

This returns bar with a count of zero. �All the other foo's show up with
valid counts.

Can I do this? �Is my syntax incorrect?



Thanks - Tod





Excellent, I completely missed it - thanks!

Phrase Query Problem?

2010-11-01 Thread Tod

I have a number of fields I need to do an exact match on.  I've defined 
them as 'string' in my schema.xml.  I've noticed that I get back query 
results that don't have all of the words I'm using to search with.


For example:

q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))&start=0&indent=true&wt=json

Should, with an exact match, return only one entry but it returns five 
some of which don't have any of the fields I've specified.  I've tried 
this both with and without quotes.


What could I be doing wrong?


Thanks - Tod

Re: Phrase Query Problem?

2010-11-02 Thread Tod


On 11/1/2010 11:14 PM, Ken Stanley wrote:

On Mon, Nov 1, 2010 at 10:26 PM, Tod  wrote:


I have a number of fields I need to do an exact match on.  I've defined
them as 'string' in my schema.xml.  I've noticed that I get back query
results that don't have all of the words I'm using to search with.

For example:


q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))&start=0&indent=true&wt=json

Should, with an exact match, return only one entry but it returns five some
of which don't have any of the fields I've specified.  I've tried this both
with and without quotes.

What could I be doing wrong?


Thanks - Tod




Tod,

Without knowing your exact field definition, my first guess would be your
first boolean query; because it is not quoted, what SOLR typically does is
to transform that type of query into something like (assuming your uniqueKey
is "id"): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do
(mykeywords:"Compliance+With+Conduct+Standards) you might see different
(better?) results. Otherwise, append&debugQuery=on to your URL and you can
see exactly how SOLR is parsing your query. If none of that helps, what is
your field definition in your schema.xml?

- Ken



The field definition is:

multiValued="true"/>


The request:

select?q=(((mykeywords:"Compliance+With+Attorney+Conduct+Standards")OR(mykeywords:All)OR(mykeywords:ALL)))&fl=mykeywords&start=0&indent=true&wt=json&debugQuery=on"

The response looks like this:

 "responseHeader":{
  "status":0,
  "QTime":8,
  "params":{
"wt":"json",
"q":"(((mykeywords:Compliance With Attorney Conduct 
Standards)OR(mykeywords:All)OR(mykeywords:ALL)))",

"start":"0",
"indent":"true",
"fl":"mykeywords",
"debugQuery":"on"}},
 "response":{"numFound":6,"start":0,"docs":[
{
 "mykeywords":["Compliance With Attorney Conduct Standards"]},
{
 "mykeywords":["Anti-Bribery","Bribes"]},
{
 "mykeywords":["Marketing Guidelines","Marketing"]},
{},
{
 "mykeywords":["Anti-Bribery","Due Diligence"]},
{
 "mykeywords":["Anti-Bribery","AntiBribery"]}]
 },
 "debug":{
  "rawquerystring":"(((mykeywords:Compliance With Attorney Conduct 
Standards)OR(mykeywords:All)OR(mykeywords:ALL)))",
  "querystring":"(((mykeywords:Compliance With Attorney Conduct 
Standards)OR(mykeywords:All)OR(mykeywords:ALL)))",
  "parsedquery":"(mykeywords:Compliance text:attorney text:conduct 
text:standard) mykeywords:All mykeywords:ALL",
  "parsedquery_toString":"(mykeywords:Compliance text:attorney 
text:conduct text:standard) mykeywords:All mykeywords:ALL",

  "explain":{
...

As you mentioned, looking at the parsed query its breaking the request 
up on word boundaries rather than on the entire phrase.  The goal is to 
return only the very first entry.  Any ideas?



Thanks - Tod

Re: Phrase Query Problem?

2010-11-02 Thread Tod


On 11/2/2010 9:21 AM, Ken Stanley wrote:

On Tue, Nov 2, 2010 at 8:19 AM, Erick Ericksonwrote:


That's not the response I get when I try your query, so I suspect
something's not quite right with your test...

But you could also try putting parentheses around the words, like
mykeywords:(Compliance+With+Conduct+Standards)

Best
Erick



I agree with Erick, your query string showed quotes, but your parsed query
did not. Using quotes, or parenthesis, would pretty much leave your query
alone. There is one exception that I've found: if you use a stopword
analyzer, any stop words would be converted to ? in the parsed query. So if
you absolutely need every single word to match, regardless, you cannot use a
field type that uses the stop word analyzer.

For example, I have two dynamic field definitions: df_text_* that does the
default text transformations (including stop words), and df_text_exact_*
that does nothing (field type is string). When I run the
query df_text_exact_company_name:"Bank of America" OR
df_text_company_name:"Bank of America", the following is shown as my
query/parsed query when debugQuery is on:


df_text_exact_company_name:"Bank of America" OR df_text_company_name:"Bank
of America"


df_text_exact_company_name:"Bank of America" OR df_text_company_name:"Bank
of America"


df_text_exact_company_name:Bank of America
PhraseQuery(df_text_company_name:"bank ? america")


df_text_exact_company_name:Bank of America df_text_company_name:"bank ?
america"


The difference is subtle, but important. If I were to do
df_text_company_name:"Bank and America", I would still match "Bank of
America". These are things that you should keep in mind when you are
creating fields for your indices.

A useful tool for seeing what SOLR does to your query terms is the Analysis
tool found in the admin panel. You can do an analysis on either a specific
field, or by a field type, and you will see a breakdown by Analyzer for
either the index, query, or both of any query that you put in. This would
definitely be useful when trying to determine why SOLR might return what it
does.

- Ken



What it turned out to be was escaping the spaces.

q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))

became

q=(((mykeywords:Compliance\+With\+Conduct\+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))

If I tried

q=(((mykeywords:"Compliance+With+Conduct+Standards")OR(mykeywords:All)OR(mykeywords:ALL)))

... it didn't work.  Once I removed the quotes and escaped spaces it 
worked as expected.  This seems odd since I would have expected the 
quotes to have triggered a phrase query.


Thanks for your help.

- Tod

Chinese characters - a little OT

2010-11-10 Thread Tod


Sorry, OT but its driving me nuts.

I've indexed a document with chinese characters in its title.  When I 
perform the search (that returns json) I get back the title and using 
Javascript place it into a variable that ultimately ends up as a 
dropdown of titles to choose from.  The problem is the title contains 
the literal unicode representation of the chinese characters (中 
for example).


Here's the javascript:

 var optionObj=document.createElement('option');

 menuItem=titleArray[1].title;
 menuVal=titleArray[1].url;

 if((menuItem != " ")&&(menuItem != "")&&(menuItem != null))
  {
   optionObj.appendChild(document.createTextNode(menuItem));
   optionObj.setAttribute('id',"optId" + optCnt);
   optionObj.setAttribute('target',"_blank");
   optionObj.setAttribute('value',menuVal);
   optCnt++;
   selectObj.appendChild(optionObj);
  }

My hunch is I should utf-8 encode the title and then try and display the 
result but its nor working.  I still am seeing the unicode characters.


Does anyone see what I could be doing wrong?

TIA - Tod

Re: Any Copy Field Caveats?

2010-11-11 Thread Tod


I've noticed that using camelCase in field names causes problems.


On 11/5/2010 11:02 AM, Will Milspec wrote:

Hi all,

we're moving from an old lucene version to solr  and plan to use the "Copy
Field" functionality. Previously we had "rolled our own" implementation,
sticking title, description, etc. in a field called 'content'.

We lose some flexibility (i.e. java layer can no longer control what gets in
the new copied field), at the expense of simplicity. A fair tradeoff IMO.

My question: has anyone found any subtle issues or "gotchas" with copy
fields?

(from the subject line "caveat"--pronounced 'kah-VEY-AT'  is Latin as in
"Caveat Emptor"..."let the buyer beware").

thanks,

will

will

Retrieving indexed content containing multiple languages

2010-11-11 Thread Tod

My Solr corpus is currently created by indexing metadata from a 
relational database as well as content pointed to by URLs from the 
database.  I'm using a pretty generic out of the box Solr schema.  The 
search results are presented via an AJAX enabled HTML page.


When I perform a search the document title (for example) has a mix of 
english and chinese characters.  Everything there is fine - I can see 
the english and chinese returned from a facet query on title.  I can 
search against the title using english words it contains and I get back 
an expected result.  I asked a chinese friend to perform the same search 
using chinese and nothing is returned.


How should I go about getting this search to work?  Chinese is just one 
language, I'll probably need to support more in the future.


My thought is that the chinese characters are indexed as their unicode 
equivalent so all I'll need to do is make sure the query is encoded 
appropriately and just perform a regular search as I would if the terms 
were in english.  For some reason that sounds too easy.


I see there is a CJK tokenizer that would help here.  Do I need that for 
my situation?  Is there a fairly detailed tutorial on how to handle 
these types of language challenges?



Thanks in advance - Tod

Upgrading Tika "in place"

2013-02-05 Thread Tod

I'm running an older version of Solr - 3.4.0.2011.09.09.09.06.17.  It 
seems the version of Tika that came with it has trouble with some PDF 
files and newer Office documents.  I've checked the latest Tika release 
and it solves these problems.


I'd like to just drop in the necessary Tika jars without needing to 
rebuild or upgrade Solr.  Is that a possibility and if so how would I go 
about accomplishing it?  I see tika-core and tika-parsers in the 3.6.2 
Solr build distro, is that the only two files I need?



Thanks - Tod

Solr 3.6 parsing and extraction files

2012-04-18 Thread Tod

Could someone possibly provide me with a list of jars that I need to 
extract from the apache-solr-3.6.0.tgz file to enable the parsing and 
remote streaming of office style documents?  I assume (for a multicore 
configuration) they would go into ./tomcat/webapps/solr/WEB-INF/lib - 
correct?



Thanks - Tod

Re: Retrieving indexed content containing multiple languages

2010-11-16 Thread Tod


On 11/11/2010 3:24 PM, Dennis Gearon wrote:

I look forward to the eanswers to this one.


Well, it seems it was as easy as adding the CJKTokenizerFactory:

positionIncrementGap="100">

 
  
 



Once I did that and reindexed I could search for both english and 
chinese using the default 'text' field.  The next hurdle was getting the 
javascript to cooperate.  The chinese characters were getting corrupted 
on the way to the AJAX call against the Solr server.


As it turned out I was performing a POST to Solr using the jQuery .ajax 
api call.  Apparently when executing a POST you need to make sure the 
characters entered into the input field of the form are converted to 
unicode (\u7968 for example) prior to the AJAX call to Solr. 
Conversely, if executing a GET you need to convert the characters to 
UTF8 (%E7%A5%A8).


So now my customers are happily finding the appropriate document using 
english and chinese.


If someone could check my math I would appreciate it.  If it looks 
reasonable and there is nothing else written about it on the wiki I'll 
create a tutorial to give everybody else a leg up.



- Tod




- Original Message 
From: Tod
To: solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 11:35:23 AM
Subject: Retrieving indexed content containing multiple languages

My Solr corpus is currently created by indexing metadata from a relational
database as well as content pointed to by URLs from the database.  I'm using a
pretty generic out of the box Solr schema.  The search results are presented via
an AJAX enabled HTML page.

When I perform a search the document title (for example) has a mix of english
and chinese characters.  Everything there is fine - I can see the english and
chinese returned from a facet query on title.  I can search against the title
using english words it contains and I get back an expected result.  I asked a
chinese friend to perform the same search using chinese and nothing is returned.

How should I go about getting this search to work?  Chinese is just one
language, I'll probably need to support more in the future.

My thought is that the chinese characters are indexed as their unicode
equivalent so all I'll need to do is make sure the query is encoded
appropriately and just perform a regular search as I would if the terms were in
english.  For some reason that sounds too easy.

I see there is a CJK tokenizer that would help here.  Do I need that for my
situation?  Is there a fairly detailed tutorial on how to handle these types of
language challenges?


Thanks in advance - Tod

Opensearch Format Support

2011-01-20 Thread Tod

Does Solr support the Opensearch format?  If so could someone point me 
to the correct documentation?



Thanks - Tod

Term Vector Query on Single Document

2011-02-16 Thread Tod

I have a couple of semi-related questions regarding the use of the Term 
Vector Component:



- Using curl is there a way to query a specific document (maybe using 
Tika when required?) to get a distribution of the terms it contains?


- When I set the termVector on a field do I need to reindex?  I'm 
thinking 'yes'


- How expensive is setting the termVector on a field?


Thanks - Tod

Can ExtractingRequestHandler ignore documents metadata

2011-05-09 Thread Tod

I'm indexing content from a CMS' database of metadata.  The client would 
prefer that Solr exclude the properties (metadata) of any documents 
being indexed.  Is there a way to tell Tika to only index a document's 
text and not its properties?


Thanks - Tod

Indexing Mediawiki

2011-06-07 Thread Tod

I have a need to index an internal instance of Mediawiki.  I'd like to 
use DIH if I can since I have access to the database but the example 
provided on the Solr wiki uses a Mediawiki dump XML file.


Does anyone have any experience using DIH in this manner?  Am I barking 
up the wrong tree and would be better off dumping and indexing the wiki 
instead?




Thanks - Tod

Tika Jax-RS and DIH

2011-06-22 Thread Tod

Mattmann, Chris A (388J  jpl.nasa.gov> writes:

>
> Hi Jo,
>
> You may consider checking out Tika trunk, where we recently have a Tika JAX-RS

web service [1] committed as

> part of the tika-server module. You could probably wire DIH into it and

accomplish the same thing.

>
> Cheers,
> Chris
>
> [1] https://issues.apache.org/jira/browse/TIKA-593

Chris - could you elaborate on using Tika Jax-RS and DIH?  How 
production ready is it?  Could you summarize the steps necessary to get 
it to work?  Any examples yet?

I'd be happy to work with you to get something out to the group.

Thanks - Tod

Default schema - 'keywords' not multivalued

2011-06-27 Thread Tod

This was a little curious to me and I wondered what the thought process 
was behind it before I decide to change it.



Thanks - Tod

Re: Default schema - 'keywords' not multivalued

2011-06-28 Thread Tod


On 06/27/2011 11:23 AM, lee carroll wrote:

Hi Tod,
A list of keywords would be fine in a non multi valued field:

keywords : "xxx yyy sss aaa "

multi value field would allow you to repeat the field when indexing

keywords: "xxx"
keywords: "yyy"
keywords: "sss"
etc



Thanks Lee. the problem is I'm manually pushing a document (via 
stream.url) and its metadata from a database with the Solr 
/update/extract REST service, HTTP GET, using Perl.


I'm streaming over the document content (presumably via tika) and its 
gathering the document's metadata which includes the keywords metadata 
field.  Since I'm also passing that field from the DB to the REST call 
as a list (as you suggested) there is a collision because the keywords 
field is single valued.


I can change this behavior using a copy field.  What I wanted to know is 
if there was a specific reason the default schema defined a field like 
keywords single valued so I could make sure I wasn't missing something 
before I changed things.


While I'm at it, I'd REALLY like to know how to use DIH to index the 
metadata from the database while simultaneously streaming over the 
document content and indexing it.  I've never quite figured it out yet 
but I have to believe it is a possibility.



- Tod

Re: Default schema - 'keywords' not multivalued

2011-06-29 Thread Tod


On 06/28/2011 12:04 PM, Chris Hostetter wrote:


: I'm streaming over the document content (presumably via tika) and its
: gathering the document's metadata which includes the keywords metadata field.
: Since I'm also passing that field from the DB to the REST call as a list (as
: you suggested) there is a collision because the keywords field is single
: valued.
:
: I can change this behavior using a copy field.  What I wanted to know is if
: there was a specific reason the default schema defined a field like keywords
: single valued so I could make sure I wasn't missing something before I changed
: things.

That file is just an example, you're absolutely free to change it to meet
your use case.

I'm not very familiar with Tika, but based on the comment in the example
config...



...i suspect it was intentional that that field is *not* multiValued (i
guess Tika always returns a single delimited value?) but if you have
multiple descrete values you want to send for your DB backed data there is
no downside to changing that.

: While I'm at it, I'd REALLY like to know how to use DIH to index the metadata
: from the database while simultaneously streaming over the document content and
: indexing it.  I've never quite figured it out yet but I have to believe it is
: a possibility.

There's a TikaEntityProcessor that can be used to have Tika crunch the
data that comes from an "entity" and extract out specific fields, and it
can be used in combination with a JdbcDataSource and a BinFileDataSource
so that a field in your db data specifies the name of a file on disk to
use as the TikaEntity -- but i've personally never tried it

Here's a simple example someone posted last year that they got working...

http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html



-Hoss



Thanks Hoss, I'll just change the schema then.

The problem with TikaEntityProcessor is this installation is still 
running v1.4.1 so I'll need to upgrade.


Any short and sweet instructions for upgrading to 3.2?  I have a pretty 
straight forward Tomcat install, would just dropping in the new war suffice?



- Tod

mutliple webapps vs multi-core vs distruibuted

2011-06-30 Thread Tod

Currently I'm working with a group implementing Solr on an enterprise 
level.  Their initial toe dipping into Solr consists of running multiple 
(two) webapps on Tomcat using identical schemas.


Content is dispersed among a variety of repositories from CMS, DMS, WCMS 
to file systems and RDBS'.  The expectation is that this implementation 
is going to get very popular very quick.  With that in mind there is 
also a very large, very diverse set of business groups spanning the 
entire organization all of which want to participate.


This participation is based mostly on marketing their wares, not making 
sure a unified enterprise taxonomy exists that can ultimately facilitate 
search relevancy at an enterprise level.  Therefore accomplishing a 
unified taxonomy most likely can't be completed within the time frame 
the customer wants to have the search up and running.


So its up to us to figure out how to satisfy the immediate needs of each 
individual business entity, without the benefit of a unified enterprise 
wide taxonomy, and with advance knowledge there is a likelihood that 
each unit's search index may be based on a different schema dependent on 
their individual business drivers.


At an enterprise level users should be able to search the entire set of 
individual indexes returning a merged result with a desire to provide a 
high level of relevancy to individual business groups along with the 
enterprise audience both internal and external.


From what I've been reading I think the current configuration may not 
stand up to the long term demand both from a usability and 
administrative standpoint, but I'm not completely sure.  That leaves 
multi-core and distributed search as possibilities.


I'm leaning towards multi-core.  Part of this decision is based on my 
perceived performance and administrative gains over the current 
configuration.  Distributed search is a possibility but in the short to 
medium term I don't see the number of indexed documents increasing to a 
size that would require it.  Plus I think the lack of a unified schema 
might throw a monkey wrench into the mix limiting the available solutions.


Does anyone have a similar experience that would be willing to share? 
Its early enough in the project life cycle that alternative ideas can be 
considered.  I'd be interested to hear other's opinions.



TIA - Tod

tika.parser.AutoDetectParser

2011-07-01 Thread Tod

I'm working on upgrading to v3.2 from v 1.4.1.  I think I've got 
everything working but when I try to do a data import using 
dataimport.jsp I'm rolling back and getting class not found exception on 
the above referenced class.


I thought that tika was packaged up with the base Solr build now but 
this message seems to contradict that unless I'm missing a jar 
somewhere.  I've got both dataimporthandler jar files in my WEB-INF/lib 
dir so not sure what I could be missing.  Any ideas?



Thanks - Tod

Re: tika.parser.AutoDetectParser

2011-07-01 Thread Tod


On 07/01/2011 12:59 PM, Shawn Heisey wrote:

On 7/1/2011 9:23 AM, Tod wrote:

I'm working on upgrading to v3.2 from v 1.4.1. I think I've got
everything working but when I try to do a data import using
dataimport.jsp I'm rolling back and getting class not found exception
on the above referenced class.

I thought that tika was packaged up with the base Solr build now but
this message seems to contradict that unless I'm missing a jar
somewhere. I've got both dataimporthandler jar files in my WEB-INF/lib
dir so not sure what I could be missing. Any ideas?


Tika is included in the solr download, but it's not included in the .war
or any of the other files in the dist directory. You may have noticed
that you now have to include one or more jars for the dataimport
handler. If you copy the following files from the solr download to the
same place you have apache-solr-dataimporthandler-3.2.0.jar, you should
be OK.

contrib/extraction/lib/tika-core-0.8.jar
contrib/extraction/lib/tika-parsers-0.8.jar

Thanks,
Shawn





Got them, thanks Shawn.

ContentStreamLoader Problem

2011-07-12 Thread Tod

I'm getting this error testing Solr V3.3.0 using the 
ExtractingRequestHandler.  I'm taking advantage of the REST interface 
and batching my documents in using stream.url.   It happens for every 
document I try to index.  It works fine under Solr 1.4.1.


I'm running everything under Tomcat.  I already have an existing 1.4.1 
instance running, could that be causing the problem?



Thanks - Tod




Jul 12, 2011 1:11:31 PM 
org.apache.solr.update.processor.LogUpdateProcessor finish

INFO: {} 0 1
Jul 12, 2011 1:11:31 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.AbstractMethodError: 
org/apache/solr/handler/ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
	at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
	at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
	at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
	at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
	at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
	at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)

at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:811)

Re: ContentStreamLoader Problem

2011-07-13 Thread Tod


On 07/12/2011 6:52 PM, Erick Erickson wrote:

This is a shot in the dark, but this smells like a classpath issue,
and since you have
a 1.4.1 installation on the machine, I'm *guessing* that you're getting a mix of
old and new Jars. What happens if you try this on a machine that doesn't have
1.4.1 on it? If that works, then it's likely a classpath issue

Best
Erick


I'll give it a shot and report back.


Thanks - Tod

Most current tik jar files that work with Solr 1.4.1

2011-08-17 Thread Tod

What is the latest version of Tika that I can use with Solr 1.4.1?  it 
comes packaged with 0.4.  I tried 0.8 and it no workie.

Solr read timeout

2011-08-18 Thread Tod

I'm using perl to indirectly call the solr ExtractingRequestHandler to 
stream remote documents into a solr index instance.  Every 100 URL's I 
process I do a commit.  I've got about 30K documents to be indexed.  I'm 
using a stock, out of the box version of solr 1.4.1 with the necessary 
schema changes for the fields I'm indexing.


I seem to be running into performance problems about 40 documents in.  I 
start getting Failed: 500 read timeouts that last about 4 minutes each 
slowing processing down to a crawl.  I've tried a later version of tika 
(0.8) and that didn't seem to help.  I'm also not sure it's the problem.


Given I'm using a pretty much unaltered version of Solr could it be my 
problem?  I'm running everything under a typical Tomcat install on a 
Linux VM.  I understand there are performance tweaks I can make to the 
Solr config but would like to focus them first on resolving this problem 
rather than blanket tweaking the entire config.


Is there anything in particular I should look at?  Can I provide any 
more information?



Thanks - Tod

JSON formatted response from SOLR question....

2010-05-10 Thread Tod

I apologize, this is such a JSON/javascript question but I'm stuck and 
am not finding any resources that address this specifically.


I'm doing a faceted search and getting back in my 
facet_counts.faceted_fields response an array of countries.  I'm 
gathering the count of the array elements returned using this notation:


rsp.facet_counts.facet_fields.country.length

... where rsp is the eval'ed JSON response from SOLR.  From there I just 
loop through listing the individual country with its associated count.


The problem I am having is trying to automate this to loop through any 
one of a number of facets contained in my JSON response, not just 
country.  So instead of the above I would have something like:


rsp.facet_counts.facet_fields.VARIABLE.length

... where VARIABLE would be the name of one of the facets passed into a 
javascript function to perform the loop.  None of the javascript 
examples I can find seems to address this.  Has anyone run into this? 
Is there a better list to ask this question?



Thanks in advance.

Re: JSON formatted response from SOLR question....

2010-05-11 Thread Tod


Jon,

Yes!!!

rsp.facet_counts.facet_fields.['var'].length to
rsp.facet_counts.facet_fields[var].length and voila.

Tripped up on a syntax error, how special.  Just needed another set of 
eyes - thanks.  VelocityResponse duly noted, it will come in handy later.



- Tod

On 5/10/2010 4:55 PM, Jon Baer wrote:

IIRC, I think what we ended up doing in a project was to use the 
VelocityResponseWriter to write the JSON and set the echoParams to read the 
handler setup (and looping through the variables).

In the template you can grab it w/ something like 
$request.params.get("facet_fields") ... I don't remember the exact hack here 
but basically you should also be able to do something like:

rsp.facet_counts.facet_fields['var'].length

In the end w/ some of the nice stuff from the Velocity tools .jar it was easier 
to work w/ the layout needed for plugins.

- Jon

On May 10, 2010, at 10:18 AM, Tod wrote:


I apologize, this is such a JSON/javascript question but I'm stuck and am not 
finding any resources that address this specifically.

I'm doing a faceted search and getting back in my facet_counts.faceted_fields 
response an array of countries.  I'm gathering the count of the array elements 
returned using this notation:

rsp.facet_counts.facet_fields.country.length

... where rsp is the eval'ed JSON response from SOLR.  From there I just loop 
through listing the individual country with its associated count.

The problem I am having is trying to automate this to loop through any one of a 
number of facets contained in my JSON response, not just country.  So instead 
of the above I would have something like:

rsp.facet_counts.facet_fields.VARIABLE.length

... where VARIABLE would be the name of one of the facets passed into a 
javascript function to perform the loop.  None of the javascript examples I can 
find seems to address this.  Has anyone run into this? Is there a better list 
to ask this question?


Thanks in advance.

Compile problems with anonymous SimpleCollector in custom request handler

2017-11-29 Thread Tod Olson

Hi everyone,

I'm modifying a existing custom request handler for an open source project, and 
am looking for some help with a compile error around an anonymous 
SimpleCollector. The build failure message from ant and the source of the 
specific method are below. I am compiling on a Mac with Java 1.8 and Solr 
6.4.2. There are two things I do not understand.

First:
[javac] 
/Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445:
 error:  is not abstract and does 
not override abstract method setNextReader(AtomicReaderContext) in Collector
[javac] db.search(q, new SimpleCollector() {

Based on the javadoc, neither SimpleCollector nor Collector define a 
setNextReader(AtomicReaderContext) method. Grepping through the Lucene 6.4.2 
source reveals neither a setNextReader method (though maybe a couple archaic 
comments), nor an AtomicReaderContext class or interface.

Second:
[javac] method IndexSearcher.search(Query,Collector) is not applicable
[javac]   (argument mismatch;  cannot be 
converted to Collector)

How is it that SimpleCollector cannot be converted to Collector? Perhaps this 
is just a consequence of the first error.

Any help getting past this compile problem would be most welcome!

-Tod




Build failure message:

build-handler:
[mkdir] Created dir: 
/Users/tod/src/vufind-browse-handler/build/browse-handler
[javac] Compiling 1 source file to 
/Users/tod/src/vufind-browse-handler/build/browse-handler
[javac] 
/Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445:
 error:  is not abstract and does 
not override abstract method setNextReader(AtomicReaderContext) in Collector
[javac] db.search(q, new SimpleCollector() {
[javac]^
[javac] 
/Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445:
 error: no suitable method found for search(TermQuery,)
[javac] db.search(q, new SimpleCollector() {
[javac]   ^
[javac] method IndexSearcher.search(Query,int) is not applicable
[javac]   (argument mismatch;  cannot be 
converted to int)
[javac] method IndexSearcher.search(Query,Filter,int) is not applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method IndexSearcher.search(Query,Filter,Collector) is not 
applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method IndexSearcher.search(Query,Collector) is not applicable
[javac]   (argument mismatch;  cannot be 
converted to Collector)
[javac] method IndexSearcher.search(Query,Filter,int,Sort) is not 
applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method 
IndexSearcher.search(Query,Filter,int,Sort,boolean,boolean) is not applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method IndexSearcher.search(Query,int,Sort) is not applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method IndexSearcher.search(Weight,ScoreDoc,int) is not 
applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method 
IndexSearcher.search(List,Weight,ScoreDoc,int) is not 
applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method IndexSearcher.search(Weight,int,Sort,boolean,boolean) is 
not applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method 
IndexSearcher.search(Weight,FieldDoc,int,Sort,boolean,boolean,boolean) is not 
applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method 
IndexSearcher.search(List,Weight,FieldDoc,int,Sort,boolean,boolean,boolean)
 is not applicable
[javac]   (actual and formal argument lists differ in length)
[javac] method 
IndexSearcher.search(List,Weight,Collector) is not 
applicable
[javac]   (actual and formal argument lists differ in length)
[javac] 2 errors


Problem method:

/**
 *
 * Function to retrieve the doc ids when there is a building limit
 * This retrieves the doc ids for an individual heading
 *
 * Need to add a filter query to limit the results from Solr
 *
 * Includes functionality to retrieve additional info
 * like titles for call numbers, possibly ISBNs
 *
 * @param headingstring of the heading to use for finding matching
 * @param fields docs colon-separated string of Solr fields
 *   to return for use in the browse display
 * @param maxBibListSize maximum numbers of records to check for fields
 * @return return a map of Solr ids and extra bib info
 */
publi

Re: Compile problems with anonymous SimpleCollector in custom request handler

2017-11-30 Thread Tod Olson

Shawn,

Thanks for the response! Yes, that was it, an older version unexpectedly in the
classpath.

And for the benefit of anyone who searches the list archive with a similar
debugging need, it's pretty easy to print out the classpath from ant's
build.xml:

Classpath: ${classpathProp}

-Tod

On Nov 29, 2017, at 6:00 PM, Shawn Heisey
mailto:apa...@elyograg.org>> wrote:

On 11/29/2017 2:27 PM, Tod Olson wrote:
I'm modifying a existing custom request handler for an open source project, and
am looking for some help with a compile error around an anonymous
SimpleCollector. The build failure message from ant and the source of the
specific method are below. I am compiling on a Mac with Java 1.8 and Solr
6.4.2. There are two things I do not understand.

First:
[javac]
/Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445:
error: is not abstract and does
not override abstract method setNextReader(AtomicReaderContext) in Collector
[javac] db.search(q, new SimpleCollector() {

Based on the javadoc, neither SimpleCollector nor Collector define a
setNextReader(AtomicReaderContext) method. Grepping through the Lucene 6.4.2
source reveals neither a setNextReader method (though maybe a couple archaic
comments), nor an AtomicReaderContext class or interface.

Second:
[javac] method IndexSearcher.search(Query,Collector) is not applicable
[javac] (argument mismatch; cannot be
converted to Collector)

How is it that SimpleCollector cannot be converted to Collector? Perhaps this
is just a consequence of the first error.

For the first error: What version of Solr/Lucene are you compiling
against? I have found that Collector *did* have a setNextReader method
up through Lucene 4.10.4, but in 5.0, that method was gone. I suspect
that what's causing your first problem is that you have older Lucene
jars (4.x or earlier) on your classpath, in addition to a newer version
that you actually want to use for the compile.

I think that can also explain the second problem. It looks like
SimpleCollector didn't exist in Lucene 4.10, which is the last version
where Collector had setNextReader. SimpleCollector is mentioned in the
javadoc for Collector as of 5.0, though.

Thanks,
Shawn

Debugging custom RequestHander: spinning up a core for debugging

2017-12-15 Thread Tod Olson

Hi everyone,

I need to do some step-wise debugging on a custom RequestHandler. I'm trying to 
spin up a core in a Junit test, with the idea of running it inside of Eclipse 
for debugging. (If there's an easier way, I'd like to see a walk through!) 
Problem is the core fails to spin up with:

java.io.IOException: Break Iterator Rule Data Magic Number Incorrect, or 
unsupported data version

Here's the code, just trying to load (cribbed and adapted from 
https://stackoverflow.com/questions/45506381/how-to-debug-solr-plugin):

public class BrowseHandlerTest
{
private static CoreContainer container;
private static SolrCore core;

private static final Logger logger = Logger.getGlobal();



@BeforeClass
public static void prepareClass() throws Exception
{
String solrHomeProp = "solr.solr.home";
System.out.println(solrHomeProp + "= " + 
System.getProperty(solrHomeProp));
// create the core container from the solr.solr.home system property
container = new CoreContainer();
container.load();
core = container.getCore("biblio");
logger.info<http://logger.info>("Solr core loaded!");
}

@AfterClass
public static void cleanUpClass()
{
core.close();
container.shutdown();
logger.info<http://logger.info>("Solr core shut down!");
}
}

The test, run through ant, fails as follows:

[junit] solr.solr.home= /Users/tod/src/vufind/solr/vufind
[junit] SLF4J: Defaulting to no-operation (NOP) logger implementation
[junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.
[junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
[junit] SLF4J: Defaulting to no-operation MDCAdapter implementation.
[junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
[junit] Tests run: 0, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 
1.299 sec
[junit]
[junit] - Standard Error -
[junit] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[junit] SLF4J: Defaulting to no-operation (NOP) logger implementation
[junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.
[junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
[junit] SLF4J: Defaulting to no-operation MDCAdapter implementation.
[junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
[junit] -  ---
[junit] Testcase: org.vufind.solr.handler.tests.BrowseHandlerTest: Caused 
an ERROR
[junit] SolrCore 'biblio' is not available due to init failure: JVM Error 
creating core [biblio]: null
[junit] org.apache.solr.common.SolrException: SolrCore 'biblio' is not 
available due to init failure: JVM Error creating core [biblio]: null
[junit]  at 
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1066)
[junit]  at 
org.vufind.solr.handler.tests.BrowseHandlerTest.prepareClass(BrowseHandlerTest.java:45)
[junit] Caused by: org.apache.solr.common.SolrException: JVM Error creating 
core [biblio]: null
[junit]  at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:833)
[junit]  at 
org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:87)
[junit]  at 
org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:467)
[junit]  at 
org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:458)
[junit]  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[junit]  at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
[junit]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[junit]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[junit]  at java.lang.Thread.run(Thread.java:745)
[junit] Caused by: java.lang.ExceptionInInitializerError
[junit]  at 
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory.inform(ICUTokenizerFactory.java:107)
[junit]  at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:721)
[junit]  at org.apache.solr.schema.IndexSchema.(IndexSchema.java:160)
[junit]  at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:56)
[junit]  at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:70)
[junit]  at 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:108)
[junit]  at 
org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:79)
[junit]  at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:812)
[junit] Caused by: java.lang.R

Re: Debugging custom RequestHander: spinning up a core for debugging

2017-12-22 Thread Tod Olson

Thanks, that pointed me in the right direction! The problem was an ancient ICU 
library in the distributed code.

-Tod

On Dec 15, 2017, at 5:15 PM, Erick Erickson 
mailto:erickerick...@gmail.com>> wrote:

My guess is this isn't a Solr issue at all; you are somehow using an old Java.

RBBIDataWrapper is from

com.ibm.icu.text;

I saw on a quick Google that this was cured by re-installing Eclipse,
but that was from 5 years ago.

You say your Java and IDE skills are a bit rusty, maybe you haven't
updated your Java JDK or Eclipse in a while? I don't know if Eclipse
somehow has its own Java (I haven't used Eclipse for quite a while).

I take it this runs outside Eclipse OK? (well, with problems otherwise
you wouldn't be stepping through it.)

Best,
Erick

On Fri, Dec 15, 2017 at 1:16 PM, Tod Olson 
mailto:t...@uchicago.edu>> wrote:
Hi everyone,

I need to do some step-wise debugging on a custom RequestHandler. I'm trying to 
spin up a core in a Junit test, with the idea of running it inside of Eclipse 
for debugging. (If there's an easier way, I'd like to see a walk through!) 
Problem is the core fails to spin up with:

java.io.IOException: Break Iterator Rule Data Magic Number Incorrect, or 
unsupported data version

Here's the code, just trying to load (cribbed and adapted from 
https://stackoverflow.com/questions/45506381/how-to-debug-solr-plugin):

public class BrowseHandlerTest
{
   private static CoreContainer container;
   private static SolrCore core;

   private static final Logger logger = Logger.getGlobal();



   @BeforeClass
   public static void prepareClass() throws Exception
   {
   String solrHomeProp = "solr.solr.home";
   System.out.println(solrHomeProp + "= " + 
System.getProperty(solrHomeProp));
   // create the core container from the solr.solr.home system property
   container = new CoreContainer();
   container.load();
   core = container.getCore("biblio");
   
logger.info<http://logger.info/><http://logger.info<http://logger.info/>>("Solr 
core loaded!");
   }

   @AfterClass
   public static void cleanUpClass()
   {
   core.close();
   container.shutdown();
   
logger.info<http://logger.info/><http://logger.info<http://logger.info/>>("Solr 
core shut down!");
   }
}

The test, run through ant, fails as follows:

   [junit] solr.solr.home= /Users/tod/src/vufind/solr/vufind
   [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation
   [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.
   [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
   [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation.
   [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
   [junit] Tests run: 0, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 
1.299 sec
   [junit]
   [junit] - Standard Error -
   [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
   [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation
   [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.
   [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
   [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation.
   [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
   [junit] -  ---
   [junit] Testcase: org.vufind.solr.handler.tests.BrowseHandlerTest: Caused an 
ERROR
   [junit] SolrCore 'biblio' is not available due to init failure: JVM Error 
creating core [biblio]: null
   [junit] org.apache.solr.common.SolrException: SolrCore 'biblio' is not 
available due to init failure: JVM Error creating core [biblio]: null
   [junit]  at 
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1066)
   [junit]  at 
org.vufind.solr.handler.tests.BrowseHandlerTest.prepareClass(BrowseHandlerTest.java:45)
   [junit] Caused by: org.apache.solr.common.SolrException: JVM Error creating 
core [biblio]: null
   [junit]  at org.apache.solr.core.CoreContainer.create(CoreContainer.java:833)
   [junit]  at 
org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:87)
   [junit]  at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:467)
   [junit]  at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:458)
   [junit]  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   [junit]  at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
   [junit]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   [junit]  at 
java.util.conc

63 matches

Mail list logo