Please help - Solr Cell using 'stream.url'
I'm batching documents into solr using solr cell with the 'stream.url' parameter. Everything is working fine until I get to about 5k documents in and then it starts issuing 'read timeout 500' errors on every document. The sysadmin says there's plenty of CPU, memory, and no paging so it doesn't look like the OS is the problem. I can curl the documents that Solr is trying to index and failing just fine so it seems to be a Solr issue. There's only about 35K documents total so Solr should even blink. Can anyone help me diagnose this problem? I'd be happy to provide any more detail that is needed. Thanks - Tod
Re: Please help - Solr Cell using 'stream.url'
On 10/07/2011 6:21 PM, � wrote: Hi, What Solr version? Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42. Its running on a Suse Linux VM. How often do you do commits, or do you use autocommit? I had been doing commits every 100 documents (the entire set is about 35K docs so its relatively small. Since that wasn't working, and I read that commits are expensive, I decided to experiment and wait until all documents were indexed before committing. I haven't been able to successfully index all the documents yet to try the manual commit because of this problem. What kind and size of docs? Mostly MS office and PDF's, some straight HTML pages. I can't give a specific answer to size but nothing alarmingly large - typical 2-5 page office documents. Do you feed from a Java program? Where is the read timeout occurring? Can you paste in some logs? I'd love to but I could never get it to work. I'm using Perl right now getting rows from an Oracle database and using LWP to perform the calls to Solr's REST interface. How much RAM on your server, and how much did you give to the JVM? RAM to JVM: export CATALINA_OPTIONS="-Xms1024m -Xmx3072m" Top output on the VM: cpu(s): 64.1%us, 11.4%sy, 0.0%ni, 24.0%id, 0.2%wa, 0.2%hi, 0.2%si, 0.0%st mem: 3980384k total, 3803300k used, 177084k free, 393924k buffers swap: 4194296k total, 512k used, 4193784k free, 1518156k cached pid user pr ni virt res shr s %cpu %mem time+command 16243 solr 19 0 642m 322m 6256 s 119 8.3 73:16.49 java Thanks.
Re: Please help - Solr Cell using 'stream.url'
On 10/10/2011 3:39 PM, � wrote: Hi, If you have 4Gb on your server total, try giving about 1Gb to Solr, leaving 3Gb for OS, OS caching and mem-allocation outside the JVM. Also, add 'ulimit -v unlimited' and 'ulimit -s 10240' to /etc/profile to increase virtual memory and stack limit. I will try this - thanks. And you should also consider upgrading to latest Solr... Is there a clearly defined migration path? - Tod
Instructions for Multiple Server Webapps Configuring with JNDI
I'm following the instructions here: http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat ...under the heading "Multiple Solr Webapps". I have configured the context fragment as instructed, placed the apache-solr-3.4.0.war in the directory pointed to by the docBase variable, and modified the solr/home accordingly. I have an empty directory under tomcat/webapps named after the solr home directory in the context fragment. The context fragment contains: crossContext="true"> value="/opt/solr/solr0" override="true"/> An empty /tomcat/webapps/solr0 directory exists. I expected to fire up tomcat and have it unpack the war file contents into the solr home directory specified in the context fragment, but its empty, as is the webapps directory. What am I doing wrong? I'm running Apache Tomcat/6.0.29. TIA - Tod
Re: Instructions for Multiple Server Webapps Configuring with JNDI
On 10/14/2011 2:44 PM, Chris Hostetter wrote: : modified the solr/home accordingly. I have an empty directory under : tomcat/webapps named after the solr home directory in the context fragment. if that empty directory has the same base name as your context fragment (ie: "tomcat/webapps/solr0" and "solr0.xml") that may give you problems ... the entire point of using context fragment files is to define webapps independently of a simple directory based hierarchy in tomcat/webapps ... if you have a directory there with the same name you create a conflict -- which webapp should it use, the empty one, or the one specified by your contextt file? Looks like that was the problem, once I removed the ./webapps/solr0 directory and started tomcat back up it was recreated correctly. : I expected to fire up tomcat and have it unpack the war file contents into the : solr home directory specified in the context fragment, but its empty, as is : the webapps directory. that's not what the "solr/home" env variable is for at all. tomcat will put the unpacked war where ever it needs/wants to (in theory it could just load it in memory) ... the point of the solr/home env variable is for you to tell the solr.war where to find the configuration files for this context. Sorry, my mistake. I wasn't referring to "solr/home" I was referring literally to the new solr home under tomcat - in this instance ./webapps/solr0. One more question, is there a particular advantage of multiple solr instances vs. multiple solr cores? Thanks.
java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log
I'm working on upgrading to Solr 3.4.0 and am seeing this error in my tomcat log. I'm using the following slf jars: slf4j-api-1.6.1.jar slf4j-jdk14-1.6.1.jar Has anybody run into this? I can reproduce it doing curl calls to the Solr ExtractingRequestHandler ala /solr/update/extract. TIA - Tod
can solr follow and index hyperlinks embedded in rich text documents (pdf, doc, etc)?
I have a feeling the answer is "no" since you wouldn't want to start indexing a large volume of office documents containing hyperlinks that could lead all over the internet. But, since there might be a use case like "a customer just asked me if it could be done?", I thought I would make sure. Thanks - Tod
Re: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log
On 10/19/2011 2:58 PM, wrote: Hi Tod, I had similar issue with slf4j, but it was NoClassDefFound. Do you have some other dependencies in your application that use some other version of slf4j? You can use mvn dependency:tree to get all dependencies in your application. Or maybe there's some other version already in your tomcat or application server. /Tim I had to start over from scratch but I believe that's exactly what it was. Things are working now. Thanks.
Batch indexing documents using ContentStreamUpdateRequest
This is a code fragment of how I am doing a ContentStreamUpdateRequest using CommonHTTPSolrServer: ContentStreamBase.URLStream csbu = new ContentStreamBase.URLStream(url); InputStream is = csbu.getStream(); FastInputStream fis = new FastInputStream(is); csur.addContentStream(csbu); csur.setParam("literal.content_id","00"); csur.setParam("literal.contentitle","This is a test"); csur.setParam("literal.title","This is a test"); server.request(csur); server.commit(); fis.close(); This works fine for one document (a pdf in this case). When I surround this with a while loop and try adding multiple documents I get: org.apache.solr.client.solrj.SolrServerException: java.io.IOException: stream is closed I've tried commenting out the fis.close, and also using just a plain InputStream with and without a .close() call - neither work. Is there a way to do this that I'm missing? Thanks - Tod
Re: Batch indexing documents using ContentStreamUpdateRequest
Answering my own question. ContentStreamUpdateRequest (csur) needs to be within the while loop not outside as I had it. Still not seeing any dramatic performance improvements over perl though (the point of this exercise). Indexing locks after about 30-45 minutes of activity, even a commit won't budge it. On 11/04/2011 12:36 PM, Tod wrote: This is a code fragment of how I am doing a ContentStreamUpdateRequest using CommonHTTPSolrServer: ContentStreamBase.URLStream csbu = new ContentStreamBase.URLStream(url); InputStream is = csbu.getStream(); FastInputStream fis = new FastInputStream(is); csur.addContentStream(csbu); csur.setParam("literal.content_id","00"); csur.setParam("literal.contentitle","This is a test"); csur.setParam("literal.title","This is a test"); server.request(csur); server.commit(); fis.close(); This works fine for one document (a pdf in this case). When I surround this with a while loop and try adding multiple documents I get: org.apache.solr.client.solrj.SolrServerException: java.io.IOException: stream is closed I've tried commenting out the fis.close, and also using just a plain InputStream with and without a .close() call - neither work. Is there a way to do this that I'm missing? Thanks - Tod
Help! - ContentStreamUpdateRequest
Could someone take a look at this page: http://wiki.apache.org/solr/ContentStreamUpdateRequestExample ... and tell me what code changes I would need to make to be able to stream a LOT of files at once rather than just one? It has to be something simple like a collection of some sort but I just can't get it figured out. Maybe I'm using the wrong class altogether? TIA
Re: Help! - ContentStreamUpdateRequest
Otis, The files are only part of the payload. The supporting metadata exists in a database. I'm pulling that information, as well as the name and location of the file, from the database and then sending it to a remote Solr instance to be indexed. I've heard Solr would prefer to get documents it needs to index in chunks rather than one at a time as I'm doing now. The one at a time approach is locking up the Solr server at around 700 entries. My thought was if I could chunk them in a batch at a time the lockup will stop and indexing performance would improve. Thanks - Tod On 11/15/2011 12:13 PM, Otis Gospodnetic wrote: Hi, How about just concatenating your files into one? �Would that work for you? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ____ From: Tod To: solr-user@lucene.apache.org Sent: Monday, November 14, 2011 4:24 PM Subject: Help! - ContentStreamUpdateRequest Could someone take a look at this page: http://wiki.apache.org/solr/ContentStreamUpdateRequestExample ... and tell me what code changes I would need to make to be able to stream a LOT of files at once rather than just one?� It has to be something simple like a collection of some sort but I just can't get it figured out.� Maybe I'm using the wrong class altogether? TIA
Re: Help! - ContentStreamUpdateRequest
Erick, Autocommit is commented out in solrconfig.xml. I have avoided them until after the indexing process is complete. As an experiment I tried committing every n records processed to see if varying n would make a difference, it really didn't change much. My original use case had the client running from the Solr server and streaming the document content over from a web server based on the URL gathered by a query from a backend database. The locking problem appeared there first so I tried moving the client code to the web server to be closer the the documents origin. That helped a little but ended up locking which is where I am now. Solr should be able to index way more documents than the 35K I'm trying to index. It seems from other's accounts they are able to do what I'm trying to do successfully. Therefore I believe I must be doing something extraordinarily dumb. I'll be happy to share any information about my environment or configuration if it will help find my error. Thanks for all of your help. - Tod On 11/15/2011 8:08 PM, Erick Erickson wrote: That's odd. What are your autocommit parameters? And are you either committing or optimizing as part of your program? I'd bump the autocommit parameters up and NOT commit (or optimize) from your client if you are Best Erick On Tue, Nov 15, 2011 at 2:17 PM, Tod wrote: Otis, The files are only part of the payload. The supporting metadata exists in a database. I'm pulling that information, as well as the name and location of the file, from the database and then sending it to a remote Solr instance to be indexed. I've heard Solr would prefer to get documents it needs to index in chunks rather than one at a time as I'm doing now. The one at a time approach is locking up the Solr server at around 700 entries. My thought was if I could chunk them in a batch at a time the lockup will stop and indexing performance would improve. Thanks - Tod On 11/15/2011 12:13 PM, Otis Gospodnetic wrote: Hi, How about just concatenating your files into one? �Would that work for you? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Tod To: solr-user@lucene.apache.org Sent: Monday, November 14, 2011 4:24 PM Subject: Help! - ContentStreamUpdateRequest Could someone take a look at this page: http://wiki.apache.org/solr/ContentStreamUpdateRequestExample ... and tell me what code changes I would need to make to be able to stream a LOT of files at once rather than just one?� It has to be something simple like a collection of some sort but I just can't get it figured out.� Maybe I'm using the wrong class altogether? TIA
Indexing Using XML Message
I have a local data store containing a host of different document types. This data store is separate from a remote Solr install making streaming not an option. Instead I'd like to generate an XML file that contains all of the documents including content and metadata. What would be the most appropriate way to accomplish this? I could use the Tika CLI to generate XML but I'm not sure it would work or that its the most efficient way to handle things. Can anyone offer some suggestions? Thanks - Tod
Data Import Handler Rich Format Documents
I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already indexing the Metadata and that provides a lot of value. The customer however would like that the content pointed to by the URL also be indexed for more discrete searching. This article at Lucid: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS describes the process of coding a custom transformer. A separate article I've read implies Nutch could be used to provide this functionality too. What would be the best and most efficient way to accomplish what I'm trying to do? I have a feeling the Lucid article might be dated and there might ways to do this now without any coding and maybe without even needing to use Nutch. I'm using the current release version of Solr. Thanks in advance. - Tod
Re: Data Import Handler Rich Format Documents
On 6/18/2010 9:12 AM, Otis Gospodnetic wrote: Tod, You didn't mention Tika, which makes me think you are not aware of it... You could implement a custom Transformer that uses Tika to perform rich doc text extraction, just like ExtractingRequestHandler does it (see http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you could even just call ERH from your Transformer, though that wouldn't be the most efficient. You're right, sorry. I have looked at Tika, which I believe is used by Nutch too - no? Implementing a transformer is fine. I guess I'm being lazy and trying to see if a method of doing this has been incorporated into the latest Solr release so I can avoid coding for it. - Original Message From: Tod To: solr-user@lucene.apache.org Sent: Fri, June 18, 2010 8:51:02 AM Subject: Data Import Handler Rich Format Documents I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already indexing the Metadata and that provides a lot of value. The customer however would like that the content pointed to by the URL also be indexed for more discrete searching. This article at Lucid: href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; target=_blank http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS describes the process of coding a custom transformer. A separate article I've read implies Nutch could be used to provide this functionality too. What would be the best and most efficient way to accomplish what I'm trying to do? I have a feeling the Lucid article might be dated and there might ways to do this now without any coding and maybe without even needing to use Nutch. I'm using the current release version of Solr. Thanks in advance. - Tod
Re: Data Import Handler Rich Format Documents
On 6/18/2010 11:24 AM, Otis Gospodnetic wrote: Tod, I don't think DIH can do that, but who knows, let's see what others say. Yes, Nutch uses TIKA, too. Otis Looks like the ExtractingRequestHandler uses Tika as well. I might just use this but I'm wondering if there will be a large performance difference between using it to batch content in over rolling my own Transformer? - Tod
Re: Data Import Handler Rich Format Documents
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to batch content in over rolling my own Transformer? I'm confused ... You're using DIH, and some of your fields are URLs to documents that you want to parse with Tika? Why would you need a custom Transformer? I started this thread after reading a Lucid article suggesting a custom Transformer might be the way to go when using DIH. My initial question was if there was an alternative. My database contains only Metadata and a reference to the actual content (HTML,Office Documents, PDF...) as a URL - not blobs as the Lucid article focused on. What I would like to do is use DIH somehow to index the Metadata but also the actual content pointed to by the URL column. I might be able to do this instead with Nutch (who uses Tika), haven't thoroughly researched this yet, or I can write a job to pull all the URL's out of the database and utilize cURL and the Solr ExtractingRequestHandler to push everything into the index. I just wanted to see what everybody else is doing and what my other options might be. Thanks - Tod Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
Re: Data Import Handler Rich Format Documents
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to batch content in over rolling my own Transformer? I'm confused ... You're using DIH, and some of your fields are URLs to documents that you want to parse with Tika? Why would you need a custom Transformer? http://wiki.apache.org/solr/DataImportHandler#Tika_Integration http://wiki.apache.org/solr/TikaEntityProcessor -Hoss Ok, I'm trying to integrate the TikaEntityProcessor as suggested. I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource curl -s http://test.html|curl http://localhost:9080/solr/update/extract?extractOnly=true --data-binary @- -H 'Content-type:text/html' ... works fine so presumably my Tika processor is working. My data-config.xml looks like this: query="select CONTENT_URL from my_database where content_id='${my_database.CONTENT_ID}'"> url="http://www.mysite.com/${my_database.content_url}"; I added the entity name="my_database_url" section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url. Is there anything obviously wrong with what I've tried so far? Thanks - Tod
Indexing Rich Format Documents using Data Import Handler (DIH) and the TikaEntityProcessor
Please refer to this thread for history: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201006.mbox/%3c4c1b6bb6.7010...@gmail.com%3e I'm trying to integrate the TikaEntityProcessor as suggested. I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource curl -s http://test.html|curl http://localhost:9080/solr/update/extract?extractOnly=true --data-binary @- -H 'Content-type:text/html' ... works fine so presumably my Tika processor is working. My data-config.xml looks like this: query="select CONTENT_URL from my_database where content_id='${my_database.CONTENT_ID}'"> url="http://www.mysite.com/${my_database.content_url}"; I added the entity name="my_database_url" section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url. Is there anything obviously wrong with what I've tried so far because this is not working, it keeps rolling back with the error above. Thanks - Tod
Re: Data Import Handler Rich Format Documents
On 6/28/2010 8:28 AM, Alexey Serba wrote: Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1 release. You should use trunk / nightly builds. https://issues.apache.org/jira/browse/SOLR-1583 Thanks, that would explain things - I'm using a stock 1.4.0 download. My data-config.xml looks like this: � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �url="http://www.mysite.com/${my_database.content_url}"; � � � � � � � � I added the entity name="my_database_url" section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url. Is there anything obviously wrong with what I've tried so far? I think you should move Tika entity into my_database entity and simplify the whole configuration ... http://www.mysite.com/${my_database.content_url}"; This, I guess, would be after I checked out and built from trunk? Thanks - Tod
Supplementing already indexed data
I'm getting metadata from a RDB but the actual content is stored somewhere else. I'd like to index the content too but I don't want to overlay the already indexed metadata. I know this can be done but I just can't seem to dig up the correct docs, can anyone point me in the right direction? Thanks.
Solrj ContentStreamUpdateRequest Slow
I'm running a slight variation of the example code referenced below and it takes a real long time to finally execute. In fact it hangs for a long time at solr.request(up) before finally executing. Is there anything I can look at or tweak to improve performance? I am also indexing a local pdf file, there are no firewall issues, solr is running on the same machine, and I tried the actual host name in addition to localhost but nothing helps. Thanks - Tod http://wiki.apache.org/solr/ContentStreamUpdateRequestExample
Re: Solrj ContentStreamUpdateRequest Slow
On 8/4/2010 11:11 PM, jayendra patil wrote: ContentStreamUpdateRequest seems to read the file contents and transfer it over http, which slows down the indexing. Try Using StreamingUpdateSolrServer with stream.file param @ http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post e.g. SolrServer server = new StreamingUpdateSolrServer("Solr Server URL",20,8); UpdateRequest req = new UpdateRequest("/update/extract"); ModifiableSolrParams params = null ; params = new ModifiableSolrParams(); params.add("stream.file", new String[]{"local file path"}); params.set("literal.id", value); req.setParams(params); server.request(req); server.commit(); Thanks for your suggestions. Unfortunately, I'm still seeing poor performance. To be clear, I am trying to have SOLR index multiple documents that exist on a remote server. I'd prefer that SOLR stream the documents after I pass a pointer to them rather than me retrieving and pushing them so I can avoid network overhead. When I do this: curl 'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true' It returns in around a second. When I execute the attached code it takes just over three minutes. The optimal for me would be able get closer to the performance I'm seeing with curl using Solrj. To be fair the SOLR server I am using is really a workstation class machine, plus I am still learning. I have a feeling I'm doing something dumb but just can't seem to pinpoint the exact problem. Thanks - Tod code--- import java.io.File; import java.io.IOException; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.request.AbstractUpdateRequest; import org.apache.solr.client.solrj.response.QueryResponse; import org.apache.solr.client.solrj.SolrQuery; import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; import org.apache.solr.client.solrj.request.UpdateRequest; import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer; import org.apache.solr.common.params.ModifiableSolrParams; /** * @author EDaniel */ public class SolrExampleTests { public static void main(String[] args) { System.out.println("main..."); try { // String fileName = "/test/test.pdf"; String fileName = "http://remoteserver/test/test.pdf";; String solrId = "1234"; indexFilesSolrCell(fileName, solrId); } catch (Exception ex) { System.out.println(ex.toString()); } } /** * Method to index all types of files into Solr. * @param fileName * @param solrId * @throws IOException * @throws SolrServerException */ public static void indexFilesSolrCell(String fileName, String solrId) throws IOException, SolrServerException { System.out.println("indexFilesSolrCell..."); String urlString = "http://localhost:8080/solr";; System.out.println("getting connection..."); //SolrServer solr = new CommonsHttpSolrServer(urlString); SolrServer solr = new StreamingUpdateSolrServer(urlString,100,5); System.out.println("getting updaterequest handle..."); //ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract"); UpdateRequest up = new UpdateRequest("/update/extract"); ModifiableSolrParams params = null ; params = new ModifiableSolrParams(); //params.add("stream.file", fileName); params.add("stream.url", fileName); params.set("literal.content_id", solrId); up.setParams(params); System.out.println("making request..."); solr.request(up); System.out.println("committing..."); solr.commit(); System.out.println("done..."); } }
Re: Solrj ContentStreamUpdateRequest Slow
On 8/12/2010 8:02 PM, Chris Hostetter wrote: : It returns in around a second. When I execute the attached code it takes just : over three minutes. The optimal for me would be able get closer to the : performance I'm seeing with curl using Solrj. I think your problem may be that StreamingUpdateSolrServer buffers up commands and sends them in batches in a background thread. if you want to send individual updates in real time (and time them) you should just use CommonsHttpSolrServer -Hoss My goal is to batch updates. My content lives somewhere else so I was trying to find a way to tell Solr where the document lived so it could go out and stream it into the index for me. That's where I thought StreamingUpdateSolrServer would help. - Tod
Re: Solrj ContentStreamUpdateRequest Slow
On 8/16/2010 6:12 PM, Chris Hostetter wrote: : > I think your problem may be that StreamingUpdateSolrServer buffers up : > commands and sends them in batches in a background thread. if you want to : > send individual updates in real time (and time them) you should just use : > CommonsHttpSolrServer : : My goal is to batch updates. My content lives somewhere else so I was trying : to find a way to tell Solr where the document lived so it could go out and : stream it into the index for me. That's where I thought : StreamingUpdateSolrServer would help. If your content lives on a machine which is not your "client" nor your "server" and you want your client to tell your server to go fetch it directly then the "stream.url" param is what you need -- that is unrelated to wether you use StreamingUpdateSolrServer or not. Do you happen to have a code fragment laying around that demonstrates using CommonsHttpSolrServer and "stream.url"? I've tried it in conjunction with ContentStreamUpdateRequest and I keep getting an annoying null pointer exception. In the meantime I will check the examples... Thinking about it some more, i suspect the reason you might be seeing a delay when using StreamingUpdateSolrServer is because of this bug... https://issues.apache.org/jira/browse/SOLR-1990 ...if there are no actual documents in your UpdateRequest (because you are using the stream.url param) then the StreamingUpdateSolrServer blocks until all other requests are done, then delegates to the super class (so it never actaully puts your indexing requests in a buffered queue, it just delays and then does them immediately) Not sure of a good way arround this off the top of my head, but i'll note it in SOLR-1990 as another problematic use case that needs dealt with. Perhaps I can execute an initial update request using a benign file before making the "stream.url" call? Also, to beat a dead horse, this: 'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true' ... works fine - I just want to do it a LOT and as efficiently as possible. If I have to I can wrap it in a perl script and run a cURL or LWP loop but I'd prefer to use SolrJ if I can. Thanks for all your help. - Tod
Re: Solrj ContentStreamUpdateRequest Slow
On 8/19/2010 1:45 AM, Lance Norskog wrote: 'stream.url' is just a simple parameter. You should be able to just add it directly. I agree (code excluding imports): public class CommonTest { public static void main(String[] args) { System.out.println("main..."); try { String fileName = String fileName = "http://remoteserver/test/test.pdf";; String solrId = "1234"; indexFilesSolrCell(fileName, solrId); } catch (Exception ex) { ex.printStackTrace(); } } /** * Method to index all types of files into Solr. * @param fileName * @param solrId * @throws IOException * @throws SolrServerException */ public static void indexFilesSolrCell(String fileName, String solrId) throws IOException, SolrServerException { System.out.println("indexFilesSolrCell..."); String urlString = "http://localhost:9080/solr";; System.out.println("getting connection..."); SolrServer solr = new CommonsHttpSolrServer(urlString); System.out.println("getting updaterequest handle..."); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); System.out.println("setting params..."); req.setParam("stream.url", fileName); req.setParam("literal.content_id", solrId); System.out.println("making request..."); solr.request(req); System.out.println("committing..."); solr.commit(); System.out.println("done..."); } } At "making request" I get: java.lang.NullPointerException at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:381) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243) at CommonTest.indexFilesSolrCell(CommonTest.java:59) at CommonTest.main(CommonTest.java:26) ... which is pointing to the solr.request(req) line. Thanks - Tod
Re: Data Import Handler Rich Format Documents
On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote: Hi, I have exactly the same problem than the one you submitted in this link http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html and I would like to ask you if you got a solution for that. I started to have a look on tika and DataImportHandler but I don't success to find to right way of writing the syntax. So can you please give an example if you successed to find the right syntax. Thanks. Bumping this to the list... Unfortunately I could never get DIH to work correctly. My suspicion is that I was using a stock 1.4.0 Solr but attempting to perform a task that was only available on the latest build. My customer requirements demand a pretty well vetted GA release so experimenting was not an option. I attempted an upgrade (quickly, sloppily) to 1.4.1 but no luck. I believe the next GA release might be my solution. I tried getting around that bump by trying SolrJ ContentStreamUpdateRequest @ http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. After floundering for a while I decided to put that on hold. I ended up writing a Perl script that emulates the command line cURL that I referenced in the above thread. It took about 72 hours to index ~850,000 entries (if anyone is interested). I plan on looping back to try the suggestions Hoss last made, just haven't had the time to respond. I'm sure things will work I just needed something quickly and don't have the seasoned experience the other developers do. - Tod
UpdateXmlMessage
I can do this using GET: http://localhost:8983/solr/update?stream.body=%3Cdelete%3E%3Cquery%3Eoffice:Bridgewater%3C/query%3E%3C/delete%3E http://localhost:8983/solr/update?stream.body=%3Ccommit/%3E ... but can I pass a stream.url parameter using an UpdateXmlMessage? I looked at the schema and I think the answer is no but just wanted to check. TIA
Re: UpdateXmlMessage
On 10/1/2010 11:33 PM, Lance Norskog wrote: Yes. stream.file and stream.url are independent of the request handler. They do their magic at the very top level of the request. However, there are no unit tests for these features, but they are widely used. Sorry Lance, are you agreeing that I can't or that I can? If I can, I'm doing something wrong. I'm specifying stream.url as its own field in the XML like: I am the author I am the title http://www.test.com/myOfficeDoc.doc . . . The wiki docs were a little sparse on this one. - Tod Tod wrote: I can do this using GET: http://localhost:8983/solr/update?stream.body=%3Cdelete%3E%3Cquery%3Eoffice:Bridgewater%3C/query%3E%3C/delete%3E http://localhost:8983/solr/update?stream.body=%3Ccommit/%3E ... but can I pass a stream.url parameter using an UpdateXmlMessage? I looked at the schema and I think the answer is no but just wanted to check. TIA
Overriding Tika's field processing
I'm reading my document data from a CMS and indexing it using calls to curl. The curl call includes 'stream.url' so Tika will also index the actual document pointed to by the CMS' stored url. This works fine. Presentation side I have a dropdown with the title of all the indexed documents such that when a user clicks one of them it opens in a new window. Using js, I've been parsing the json returned from Solr to create the dropdown. The problem is I can't get the titles sorted alphabetically. If I use a facet.sort on the title field I get back ALL the sorted titles in the facet block, but that doesn't include the associated URL's. A sorted query won't work because title is a multivalued field. The one option I can think of is to make the title single valued so that I have a one to one relationship to the returned url. To do that I'd need to be able to *not* index the Tika returned values. If I read right, my understanding was that I could use 'literal.title' in the curl call to limit what would be included in the index from Tika. That doesn't seem to be working as a test facet query returns more than I have in the CMS. Am I understanding the 'literal.title' processing correctly? Does anybody have experience/suggestions on how to handle this? Thanks - Tod
Facet count of zero
I'm trying to exclude certain facet results from a facet query. It seems to work but rather than being excluded from the facet list its returned with a count of zero. Ex: q=(-foo:bar)&facet=true&facet.field=foo&facet.sort=idx&wt=json&indent=true This returns bar with a count of zero. All the other foo's show up with valid counts. Can I do this? Is my syntax incorrect? Thanks - Tod
Re: Facet count of zero
On 11/1/2010 1:03 PM, Yonik Seeley wrote: On Mon, Nov 1, 2010 at 12:55 PM, Tod wrote: I'm trying to exclude certain facet results from a facet query. �It seems to work but rather than being excluded from the facet list its returned with a count of zero. If you don't want to see 0 counts, use facet.mincount=1 http://wiki.apache.org/solr/SimpleFacetParameters -Yonik http://www.lucidimagination.co Ex: q=(-foo:bar)&facet=true&facet.field=foo&facet.sort=idx&wt=json&indent=true This returns bar with a count of zero. �All the other foo's show up with valid counts. Can I do this? �Is my syntax incorrect? Thanks - Tod Excellent, I completely missed it - thanks!
Phrase Query Problem?
I have a number of fields I need to do an exact match on. I've defined them as 'string' in my schema.xml. I've noticed that I get back query results that don't have all of the words I'm using to search with. For example: q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))&start=0&indent=true&wt=json Should, with an exact match, return only one entry but it returns five some of which don't have any of the fields I've specified. I've tried this both with and without quotes. What could I be doing wrong? Thanks - Tod
Re: Phrase Query Problem?
On 11/1/2010 11:14 PM, Ken Stanley wrote: On Mon, Nov 1, 2010 at 10:26 PM, Tod wrote: I have a number of fields I need to do an exact match on. I've defined them as 'string' in my schema.xml. I've noticed that I get back query results that don't have all of the words I'm using to search with. For example: q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))&start=0&indent=true&wt=json Should, with an exact match, return only one entry but it returns five some of which don't have any of the fields I've specified. I've tried this both with and without quotes. What could I be doing wrong? Thanks - Tod Tod, Without knowing your exact field definition, my first guess would be your first boolean query; because it is not quoted, what SOLR typically does is to transform that type of query into something like (assuming your uniqueKey is "id"): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do (mykeywords:"Compliance+With+Conduct+Standards) you might see different (better?) results. Otherwise, append&debugQuery=on to your URL and you can see exactly how SOLR is parsing your query. If none of that helps, what is your field definition in your schema.xml? - Ken The field definition is: multiValued="true"/> The request: select?q=(((mykeywords:"Compliance+With+Attorney+Conduct+Standards")OR(mykeywords:All)OR(mykeywords:ALL)))&fl=mykeywords&start=0&indent=true&wt=json&debugQuery=on" The response looks like this: "responseHeader":{ "status":0, "QTime":8, "params":{ "wt":"json", "q":"(((mykeywords:Compliance With Attorney Conduct Standards)OR(mykeywords:All)OR(mykeywords:ALL)))", "start":"0", "indent":"true", "fl":"mykeywords", "debugQuery":"on"}}, "response":{"numFound":6,"start":0,"docs":[ { "mykeywords":["Compliance With Attorney Conduct Standards"]}, { "mykeywords":["Anti-Bribery","Bribes"]}, { "mykeywords":["Marketing Guidelines","Marketing"]}, {}, { "mykeywords":["Anti-Bribery","Due Diligence"]}, { "mykeywords":["Anti-Bribery","AntiBribery"]}] }, "debug":{ "rawquerystring":"(((mykeywords:Compliance With Attorney Conduct Standards)OR(mykeywords:All)OR(mykeywords:ALL)))", "querystring":"(((mykeywords:Compliance With Attorney Conduct Standards)OR(mykeywords:All)OR(mykeywords:ALL)))", "parsedquery":"(mykeywords:Compliance text:attorney text:conduct text:standard) mykeywords:All mykeywords:ALL", "parsedquery_toString":"(mykeywords:Compliance text:attorney text:conduct text:standard) mykeywords:All mykeywords:ALL", "explain":{ ... As you mentioned, looking at the parsed query its breaking the request up on word boundaries rather than on the entire phrase. The goal is to return only the very first entry. Any ideas? Thanks - Tod
Re: Phrase Query Problem?
On 11/2/2010 9:21 AM, Ken Stanley wrote: On Tue, Nov 2, 2010 at 8:19 AM, Erick Ericksonwrote: That's not the response I get when I try your query, so I suspect something's not quite right with your test... But you could also try putting parentheses around the words, like mykeywords:(Compliance+With+Conduct+Standards) Best Erick I agree with Erick, your query string showed quotes, but your parsed query did not. Using quotes, or parenthesis, would pretty much leave your query alone. There is one exception that I've found: if you use a stopword analyzer, any stop words would be converted to ? in the parsed query. So if you absolutely need every single word to match, regardless, you cannot use a field type that uses the stop word analyzer. For example, I have two dynamic field definitions: df_text_* that does the default text transformations (including stop words), and df_text_exact_* that does nothing (field type is string). When I run the query df_text_exact_company_name:"Bank of America" OR df_text_company_name:"Bank of America", the following is shown as my query/parsed query when debugQuery is on: df_text_exact_company_name:"Bank of America" OR df_text_company_name:"Bank of America" df_text_exact_company_name:"Bank of America" OR df_text_company_name:"Bank of America" df_text_exact_company_name:Bank of America PhraseQuery(df_text_company_name:"bank ? america") df_text_exact_company_name:Bank of America df_text_company_name:"bank ? america" The difference is subtle, but important. If I were to do df_text_company_name:"Bank and America", I would still match "Bank of America". These are things that you should keep in mind when you are creating fields for your indices. A useful tool for seeing what SOLR does to your query terms is the Analysis tool found in the admin panel. You can do an analysis on either a specific field, or by a field type, and you will see a breakdown by Analyzer for either the index, query, or both of any query that you put in. This would definitely be useful when trying to determine why SOLR might return what it does. - Ken What it turned out to be was escaping the spaces. q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL))) became q=(((mykeywords:Compliance\+With\+Conduct\+Standards)OR(mykeywords:All)OR(mykeywords:ALL))) If I tried q=(((mykeywords:"Compliance+With+Conduct+Standards")OR(mykeywords:All)OR(mykeywords:ALL))) ... it didn't work. Once I removed the quotes and escaped spaces it worked as expected. This seems odd since I would have expected the quotes to have triggered a phrase query. Thanks for your help. - Tod
Chinese characters - a little OT
Sorry, OT but its driving me nuts. I've indexed a document with chinese characters in its title. When I perform the search (that returns json) I get back the title and using Javascript place it into a variable that ultimately ends up as a dropdown of titles to choose from. The problem is the title contains the literal unicode representation of the chinese characters (中 for example). Here's the javascript: var optionObj=document.createElement('option'); menuItem=titleArray[1].title; menuVal=titleArray[1].url; if((menuItem != " ")&&(menuItem != "")&&(menuItem != null)) { optionObj.appendChild(document.createTextNode(menuItem)); optionObj.setAttribute('id',"optId" + optCnt); optionObj.setAttribute('target',"_blank"); optionObj.setAttribute('value',menuVal); optCnt++; selectObj.appendChild(optionObj); } My hunch is I should utf-8 encode the title and then try and display the result but its nor working. I still am seeing the unicode characters. Does anyone see what I could be doing wrong? TIA - Tod
Re: Any Copy Field Caveats?
I've noticed that using camelCase in field names causes problems. On 11/5/2010 11:02 AM, Will Milspec wrote: Hi all, we're moving from an old lucene version to solr and plan to use the "Copy Field" functionality. Previously we had "rolled our own" implementation, sticking title, description, etc. in a field called 'content'. We lose some flexibility (i.e. java layer can no longer control what gets in the new copied field), at the expense of simplicity. A fair tradeoff IMO. My question: has anyone found any subtle issues or "gotchas" with copy fields? (from the subject line "caveat"--pronounced 'kah-VEY-AT' is Latin as in "Caveat Emptor"..."let the buyer beware"). thanks, will will
Retrieving indexed content containing multiple languages
My Solr corpus is currently created by indexing metadata from a relational database as well as content pointed to by URLs from the database. I'm using a pretty generic out of the box Solr schema. The search results are presented via an AJAX enabled HTML page. When I perform a search the document title (for example) has a mix of english and chinese characters. Everything there is fine - I can see the english and chinese returned from a facet query on title. I can search against the title using english words it contains and I get back an expected result. I asked a chinese friend to perform the same search using chinese and nothing is returned. How should I go about getting this search to work? Chinese is just one language, I'll probably need to support more in the future. My thought is that the chinese characters are indexed as their unicode equivalent so all I'll need to do is make sure the query is encoded appropriately and just perform a regular search as I would if the terms were in english. For some reason that sounds too easy. I see there is a CJK tokenizer that would help here. Do I need that for my situation? Is there a fairly detailed tutorial on how to handle these types of language challenges? Thanks in advance - Tod
Upgrading Tika "in place"
I'm running an older version of Solr - 3.4.0.2011.09.09.09.06.17. It seems the version of Tika that came with it has trouble with some PDF files and newer Office documents. I've checked the latest Tika release and it solves these problems. I'd like to just drop in the necessary Tika jars without needing to rebuild or upgrade Solr. Is that a possibility and if so how would I go about accomplishing it? I see tika-core and tika-parsers in the 3.6.2 Solr build distro, is that the only two files I need? Thanks - Tod
Solr 3.6 parsing and extraction files
Could someone possibly provide me with a list of jars that I need to extract from the apache-solr-3.6.0.tgz file to enable the parsing and remote streaming of office style documents? I assume (for a multicore configuration) they would go into ./tomcat/webapps/solr/WEB-INF/lib - correct? Thanks - Tod
Re: Retrieving indexed content containing multiple languages
On 11/11/2010 3:24 PM, Dennis Gearon wrote: I look forward to the eanswers to this one. Well, it seems it was as easy as adding the CJKTokenizerFactory: positionIncrementGap="100"> Once I did that and reindexed I could search for both english and chinese using the default 'text' field. The next hurdle was getting the javascript to cooperate. The chinese characters were getting corrupted on the way to the AJAX call against the Solr server. As it turned out I was performing a POST to Solr using the jQuery .ajax api call. Apparently when executing a POST you need to make sure the characters entered into the input field of the form are converted to unicode (\u7968 for example) prior to the AJAX call to Solr. Conversely, if executing a GET you need to convert the characters to UTF8 (%E7%A5%A8). So now my customers are happily finding the appropriate document using english and chinese. If someone could check my math I would appreciate it. If it looks reasonable and there is nothing else written about it on the wiki I'll create a tutorial to give everybody else a leg up. - Tod - Original Message From: Tod To: solr-user@lucene.apache.org Sent: Thu, November 11, 2010 11:35:23 AM Subject: Retrieving indexed content containing multiple languages My Solr corpus is currently created by indexing metadata from a relational database as well as content pointed to by URLs from the database. I'm using a pretty generic out of the box Solr schema. The search results are presented via an AJAX enabled HTML page. When I perform a search the document title (for example) has a mix of english and chinese characters. Everything there is fine - I can see the english and chinese returned from a facet query on title. I can search against the title using english words it contains and I get back an expected result. I asked a chinese friend to perform the same search using chinese and nothing is returned. How should I go about getting this search to work? Chinese is just one language, I'll probably need to support more in the future. My thought is that the chinese characters are indexed as their unicode equivalent so all I'll need to do is make sure the query is encoded appropriately and just perform a regular search as I would if the terms were in english. For some reason that sounds too easy. I see there is a CJK tokenizer that would help here. Do I need that for my situation? Is there a fairly detailed tutorial on how to handle these types of language challenges? Thanks in advance - Tod
Opensearch Format Support
Does Solr support the Opensearch format? If so could someone point me to the correct documentation? Thanks - Tod
Term Vector Query on Single Document
I have a couple of semi-related questions regarding the use of the Term Vector Component: - Using curl is there a way to query a specific document (maybe using Tika when required?) to get a distribution of the terms it contains? - When I set the termVector on a field do I need to reindex? I'm thinking 'yes' - How expensive is setting the termVector on a field? Thanks - Tod
Can ExtractingRequestHandler ignore documents metadata
I'm indexing content from a CMS' database of metadata. The client would prefer that Solr exclude the properties (metadata) of any documents being indexed. Is there a way to tell Tika to only index a document's text and not its properties? Thanks - Tod
Indexing Mediawiki
I have a need to index an internal instance of Mediawiki. I'd like to use DIH if I can since I have access to the database but the example provided on the Solr wiki uses a Mediawiki dump XML file. Does anyone have any experience using DIH in this manner? Am I barking up the wrong tree and would be better off dumping and indexing the wiki instead? Thanks - Tod
Tika Jax-RS and DIH
Mattmann, Chris A (388J jpl.nasa.gov> writes: > > Hi Jo, > > You may consider checking out Tika trunk, where we recently have a Tika JAX-RS web service [1] committed as > part of the tika-server module. You could probably wire DIH into it and accomplish the same thing. > > Cheers, > Chris > > [1] https://issues.apache.org/jira/browse/TIKA-593 Chris - could you elaborate on using Tika Jax-RS and DIH? How production ready is it? Could you summarize the steps necessary to get it to work? Any examples yet? I'd be happy to work with you to get something out to the group. Thanks - Tod
Default schema - 'keywords' not multivalued
This was a little curious to me and I wondered what the thought process was behind it before I decide to change it. Thanks - Tod
Re: Default schema - 'keywords' not multivalued
On 06/27/2011 11:23 AM, lee carroll wrote: Hi Tod, A list of keywords would be fine in a non multi valued field: keywords : "xxx yyy sss aaa " multi value field would allow you to repeat the field when indexing keywords: "xxx" keywords: "yyy" keywords: "sss" etc Thanks Lee. the problem is I'm manually pushing a document (via stream.url) and its metadata from a database with the Solr /update/extract REST service, HTTP GET, using Perl. I'm streaming over the document content (presumably via tika) and its gathering the document's metadata which includes the keywords metadata field. Since I'm also passing that field from the DB to the REST call as a list (as you suggested) there is a collision because the keywords field is single valued. I can change this behavior using a copy field. What I wanted to know is if there was a specific reason the default schema defined a field like keywords single valued so I could make sure I wasn't missing something before I changed things. While I'm at it, I'd REALLY like to know how to use DIH to index the metadata from the database while simultaneously streaming over the document content and indexing it. I've never quite figured it out yet but I have to believe it is a possibility. - Tod
Re: Default schema - 'keywords' not multivalued
On 06/28/2011 12:04 PM, Chris Hostetter wrote: : I'm streaming over the document content (presumably via tika) and its : gathering the document's metadata which includes the keywords metadata field. : Since I'm also passing that field from the DB to the REST call as a list (as : you suggested) there is a collision because the keywords field is single : valued. : : I can change this behavior using a copy field. What I wanted to know is if : there was a specific reason the default schema defined a field like keywords : single valued so I could make sure I wasn't missing something before I changed : things. That file is just an example, you're absolutely free to change it to meet your use case. I'm not very familiar with Tika, but based on the comment in the example config... ...i suspect it was intentional that that field is *not* multiValued (i guess Tika always returns a single delimited value?) but if you have multiple descrete values you want to send for your DB backed data there is no downside to changing that. : While I'm at it, I'd REALLY like to know how to use DIH to index the metadata : from the database while simultaneously streaming over the document content and : indexing it. I've never quite figured it out yet but I have to believe it is : a possibility. There's a TikaEntityProcessor that can be used to have Tika crunch the data that comes from an "entity" and extract out specific fields, and it can be used in combination with a JdbcDataSource and a BinFileDataSource so that a field in your db data specifies the name of a file on disk to use as the TikaEntity -- but i've personally never tried it Here's a simple example someone posted last year that they got working... http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html -Hoss Thanks Hoss, I'll just change the schema then. The problem with TikaEntityProcessor is this installation is still running v1.4.1 so I'll need to upgrade. Any short and sweet instructions for upgrading to 3.2? I have a pretty straight forward Tomcat install, would just dropping in the new war suffice? - Tod
mutliple webapps vs multi-core vs distruibuted
Currently I'm working with a group implementing Solr on an enterprise level. Their initial toe dipping into Solr consists of running multiple (two) webapps on Tomcat using identical schemas. Content is dispersed among a variety of repositories from CMS, DMS, WCMS to file systems and RDBS'. The expectation is that this implementation is going to get very popular very quick. With that in mind there is also a very large, very diverse set of business groups spanning the entire organization all of which want to participate. This participation is based mostly on marketing their wares, not making sure a unified enterprise taxonomy exists that can ultimately facilitate search relevancy at an enterprise level. Therefore accomplishing a unified taxonomy most likely can't be completed within the time frame the customer wants to have the search up and running. So its up to us to figure out how to satisfy the immediate needs of each individual business entity, without the benefit of a unified enterprise wide taxonomy, and with advance knowledge there is a likelihood that each unit's search index may be based on a different schema dependent on their individual business drivers. At an enterprise level users should be able to search the entire set of individual indexes returning a merged result with a desire to provide a high level of relevancy to individual business groups along with the enterprise audience both internal and external. From what I've been reading I think the current configuration may not stand up to the long term demand both from a usability and administrative standpoint, but I'm not completely sure. That leaves multi-core and distributed search as possibilities. I'm leaning towards multi-core. Part of this decision is based on my perceived performance and administrative gains over the current configuration. Distributed search is a possibility but in the short to medium term I don't see the number of indexed documents increasing to a size that would require it. Plus I think the lack of a unified schema might throw a monkey wrench into the mix limiting the available solutions. Does anyone have a similar experience that would be willing to share? Its early enough in the project life cycle that alternative ideas can be considered. I'd be interested to hear other's opinions. TIA - Tod
tika.parser.AutoDetectParser
I'm working on upgrading to v3.2 from v 1.4.1. I think I've got everything working but when I try to do a data import using dataimport.jsp I'm rolling back and getting class not found exception on the above referenced class. I thought that tika was packaged up with the base Solr build now but this message seems to contradict that unless I'm missing a jar somewhere. I've got both dataimporthandler jar files in my WEB-INF/lib dir so not sure what I could be missing. Any ideas? Thanks - Tod
Re: tika.parser.AutoDetectParser
On 07/01/2011 12:59 PM, Shawn Heisey wrote: On 7/1/2011 9:23 AM, Tod wrote: I'm working on upgrading to v3.2 from v 1.4.1. I think I've got everything working but when I try to do a data import using dataimport.jsp I'm rolling back and getting class not found exception on the above referenced class. I thought that tika was packaged up with the base Solr build now but this message seems to contradict that unless I'm missing a jar somewhere. I've got both dataimporthandler jar files in my WEB-INF/lib dir so not sure what I could be missing. Any ideas? Tika is included in the solr download, but it's not included in the .war or any of the other files in the dist directory. You may have noticed that you now have to include one or more jars for the dataimport handler. If you copy the following files from the solr download to the same place you have apache-solr-dataimporthandler-3.2.0.jar, you should be OK. contrib/extraction/lib/tika-core-0.8.jar contrib/extraction/lib/tika-parsers-0.8.jar Thanks, Shawn Got them, thanks Shawn.
ContentStreamLoader Problem
I'm getting this error testing Solr V3.3.0 using the ExtractingRequestHandler. I'm taking advantage of the REST interface and batching my documents in using stream.url. It happens for every document I try to index. It works fine under Solr 1.4.1. I'm running everything under Tomcat. I already have an existing 1.4.1 instance running, could that be causing the problem? Thanks - Tod Jul 12, 2011 1:11:31 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 1 Jul 12, 2011 1:11:31 PM org.apache.solr.common.SolrException log SEVERE: java.lang.AbstractMethodError: org/apache/solr/handler/ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:811)
Re: ContentStreamLoader Problem
On 07/12/2011 6:52 PM, Erick Erickson wrote: This is a shot in the dark, but this smells like a classpath issue, and since you have a 1.4.1 installation on the machine, I'm *guessing* that you're getting a mix of old and new Jars. What happens if you try this on a machine that doesn't have 1.4.1 on it? If that works, then it's likely a classpath issue Best Erick I'll give it a shot and report back. Thanks - Tod
Most current tik jar files that work with Solr 1.4.1
What is the latest version of Tika that I can use with Solr 1.4.1? it comes packaged with 0.4. I tried 0.8 and it no workie.
Solr read timeout
I'm using perl to indirectly call the solr ExtractingRequestHandler to stream remote documents into a solr index instance. Every 100 URL's I process I do a commit. I've got about 30K documents to be indexed. I'm using a stock, out of the box version of solr 1.4.1 with the necessary schema changes for the fields I'm indexing. I seem to be running into performance problems about 40 documents in. I start getting Failed: 500 read timeouts that last about 4 minutes each slowing processing down to a crawl. I've tried a later version of tika (0.8) and that didn't seem to help. I'm also not sure it's the problem. Given I'm using a pretty much unaltered version of Solr could it be my problem? I'm running everything under a typical Tomcat install on a Linux VM. I understand there are performance tweaks I can make to the Solr config but would like to focus them first on resolving this problem rather than blanket tweaking the entire config. Is there anything in particular I should look at? Can I provide any more information? Thanks - Tod
JSON formatted response from SOLR question....
I apologize, this is such a JSON/javascript question but I'm stuck and am not finding any resources that address this specifically. I'm doing a faceted search and getting back in my facet_counts.faceted_fields response an array of countries. I'm gathering the count of the array elements returned using this notation: rsp.facet_counts.facet_fields.country.length ... where rsp is the eval'ed JSON response from SOLR. From there I just loop through listing the individual country with its associated count. The problem I am having is trying to automate this to loop through any one of a number of facets contained in my JSON response, not just country. So instead of the above I would have something like: rsp.facet_counts.facet_fields.VARIABLE.length ... where VARIABLE would be the name of one of the facets passed into a javascript function to perform the loop. None of the javascript examples I can find seems to address this. Has anyone run into this? Is there a better list to ask this question? Thanks in advance.
Re: JSON formatted response from SOLR question....
Jon, Yes!!! rsp.facet_counts.facet_fields.['var'].length to rsp.facet_counts.facet_fields[var].length and voila. Tripped up on a syntax error, how special. Just needed another set of eyes - thanks. VelocityResponse duly noted, it will come in handy later. - Tod On 5/10/2010 4:55 PM, Jon Baer wrote: IIRC, I think what we ended up doing in a project was to use the VelocityResponseWriter to write the JSON and set the echoParams to read the handler setup (and looping through the variables). In the template you can grab it w/ something like $request.params.get("facet_fields") ... I don't remember the exact hack here but basically you should also be able to do something like: rsp.facet_counts.facet_fields['var'].length In the end w/ some of the nice stuff from the Velocity tools .jar it was easier to work w/ the layout needed for plugins. - Jon On May 10, 2010, at 10:18 AM, Tod wrote: I apologize, this is such a JSON/javascript question but I'm stuck and am not finding any resources that address this specifically. I'm doing a faceted search and getting back in my facet_counts.faceted_fields response an array of countries. I'm gathering the count of the array elements returned using this notation: rsp.facet_counts.facet_fields.country.length ... where rsp is the eval'ed JSON response from SOLR. From there I just loop through listing the individual country with its associated count. The problem I am having is trying to automate this to loop through any one of a number of facets contained in my JSON response, not just country. So instead of the above I would have something like: rsp.facet_counts.facet_fields.VARIABLE.length ... where VARIABLE would be the name of one of the facets passed into a javascript function to perform the loop. None of the javascript examples I can find seems to address this. Has anyone run into this? Is there a better list to ask this question? Thanks in advance.
Compile problems with anonymous SimpleCollector in custom request handler
Hi everyone, I'm modifying a existing custom request handler for an open source project, and am looking for some help with a compile error around an anonymous SimpleCollector. The build failure message from ant and the source of the specific method are below. I am compiling on a Mac with Java 1.8 and Solr 6.4.2. There are two things I do not understand. First: [javac] /Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445: error: is not abstract and does not override abstract method setNextReader(AtomicReaderContext) in Collector [javac] db.search(q, new SimpleCollector() { Based on the javadoc, neither SimpleCollector nor Collector define a setNextReader(AtomicReaderContext) method. Grepping through the Lucene 6.4.2 source reveals neither a setNextReader method (though maybe a couple archaic comments), nor an AtomicReaderContext class or interface. Second: [javac] method IndexSearcher.search(Query,Collector) is not applicable [javac] (argument mismatch; cannot be converted to Collector) How is it that SimpleCollector cannot be converted to Collector? Perhaps this is just a consequence of the first error. Any help getting past this compile problem would be most welcome! -Tod Build failure message: build-handler: [mkdir] Created dir: /Users/tod/src/vufind-browse-handler/build/browse-handler [javac] Compiling 1 source file to /Users/tod/src/vufind-browse-handler/build/browse-handler [javac] /Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445: error: is not abstract and does not override abstract method setNextReader(AtomicReaderContext) in Collector [javac] db.search(q, new SimpleCollector() { [javac]^ [javac] /Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445: error: no suitable method found for search(TermQuery,) [javac] db.search(q, new SimpleCollector() { [javac] ^ [javac] method IndexSearcher.search(Query,int) is not applicable [javac] (argument mismatch; cannot be converted to int) [javac] method IndexSearcher.search(Query,Filter,int) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(Query,Filter,Collector) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(Query,Collector) is not applicable [javac] (argument mismatch; cannot be converted to Collector) [javac] method IndexSearcher.search(Query,Filter,int,Sort) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(Query,Filter,int,Sort,boolean,boolean) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(Query,int,Sort) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(Weight,ScoreDoc,int) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(List,Weight,ScoreDoc,int) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(Weight,int,Sort,boolean,boolean) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(Weight,FieldDoc,int,Sort,boolean,boolean,boolean) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(List,Weight,FieldDoc,int,Sort,boolean,boolean,boolean) is not applicable [javac] (actual and formal argument lists differ in length) [javac] method IndexSearcher.search(List,Weight,Collector) is not applicable [javac] (actual and formal argument lists differ in length) [javac] 2 errors Problem method: /** * * Function to retrieve the doc ids when there is a building limit * This retrieves the doc ids for an individual heading * * Need to add a filter query to limit the results from Solr * * Includes functionality to retrieve additional info * like titles for call numbers, possibly ISBNs * * @param headingstring of the heading to use for finding matching * @param fields docs colon-separated string of Solr fields * to return for use in the browse display * @param maxBibListSize maximum numbers of records to check for fields * @return return a map of Solr ids and extra bib info */ publi
Re: Compile problems with anonymous SimpleCollector in custom request handler
Shawn, Thanks for the response! Yes, that was it, an older version unexpectedly in the classpath. And for the benefit of anyone who searches the list archive with a similar debugging need, it's pretty easy to print out the classpath from ant's build.xml: Classpath: ${classpathProp} -Tod On Nov 29, 2017, at 6:00 PM, Shawn Heisey mailto:apa...@elyograg.org>> wrote: On 11/29/2017 2:27 PM, Tod Olson wrote: I'm modifying a existing custom request handler for an open source project, and am looking for some help with a compile error around an anonymous SimpleCollector. The build failure message from ant and the source of the specific method are below. I am compiling on a Mac with Java 1.8 and Solr 6.4.2. There are two things I do not understand. First: [javac] /Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445: error: is not abstract and does not override abstract method setNextReader(AtomicReaderContext) in Collector [javac] db.search(q, new SimpleCollector() { Based on the javadoc, neither SimpleCollector nor Collector define a setNextReader(AtomicReaderContext) method. Grepping through the Lucene 6.4.2 source reveals neither a setNextReader method (though maybe a couple archaic comments), nor an AtomicReaderContext class or interface. Second: [javac] method IndexSearcher.search(Query,Collector) is not applicable [javac] (argument mismatch; cannot be converted to Collector) How is it that SimpleCollector cannot be converted to Collector? Perhaps this is just a consequence of the first error. For the first error: What version of Solr/Lucene are you compiling against? I have found that Collector *did* have a setNextReader method up through Lucene 4.10.4, but in 5.0, that method was gone. I suspect that what's causing your first problem is that you have older Lucene jars (4.x or earlier) on your classpath, in addition to a newer version that you actually want to use for the compile. I think that can also explain the second problem. It looks like SimpleCollector didn't exist in Lucene 4.10, which is the last version where Collector had setNextReader. SimpleCollector is mentioned in the javadoc for Collector as of 5.0, though. Thanks, Shawn
Debugging custom RequestHander: spinning up a core for debugging
Hi everyone, I need to do some step-wise debugging on a custom RequestHandler. I'm trying to spin up a core in a Junit test, with the idea of running it inside of Eclipse for debugging. (If there's an easier way, I'd like to see a walk through!) Problem is the core fails to spin up with: java.io.IOException: Break Iterator Rule Data Magic Number Incorrect, or unsupported data version Here's the code, just trying to load (cribbed and adapted from https://stackoverflow.com/questions/45506381/how-to-debug-solr-plugin): public class BrowseHandlerTest { private static CoreContainer container; private static SolrCore core; private static final Logger logger = Logger.getGlobal(); @BeforeClass public static void prepareClass() throws Exception { String solrHomeProp = "solr.solr.home"; System.out.println(solrHomeProp + "= " + System.getProperty(solrHomeProp)); // create the core container from the solr.solr.home system property container = new CoreContainer(); container.load(); core = container.getCore("biblio"); logger.info<http://logger.info>("Solr core loaded!"); } @AfterClass public static void cleanUpClass() { core.close(); container.shutdown(); logger.info<http://logger.info>("Solr core shut down!"); } } The test, run through ant, fails as follows: [junit] solr.solr.home= /Users/tod/src/vufind/solr/vufind [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder". [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation. [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details. [junit] Tests run: 0, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 1.299 sec [junit] [junit] - Standard Error - [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder". [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation. [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details. [junit] - --- [junit] Testcase: org.vufind.solr.handler.tests.BrowseHandlerTest: Caused an ERROR [junit] SolrCore 'biblio' is not available due to init failure: JVM Error creating core [biblio]: null [junit] org.apache.solr.common.SolrException: SolrCore 'biblio' is not available due to init failure: JVM Error creating core [biblio]: null [junit] at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1066) [junit] at org.vufind.solr.handler.tests.BrowseHandlerTest.prepareClass(BrowseHandlerTest.java:45) [junit] Caused by: org.apache.solr.common.SolrException: JVM Error creating core [biblio]: null [junit] at org.apache.solr.core.CoreContainer.create(CoreContainer.java:833) [junit] at org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:87) [junit] at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:467) [junit] at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:458) [junit] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [junit] at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) [junit] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [junit] at java.lang.Thread.run(Thread.java:745) [junit] Caused by: java.lang.ExceptionInInitializerError [junit] at org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory.inform(ICUTokenizerFactory.java:107) [junit] at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:721) [junit] at org.apache.solr.schema.IndexSchema.(IndexSchema.java:160) [junit] at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:56) [junit] at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:70) [junit] at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:108) [junit] at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:79) [junit] at org.apache.solr.core.CoreContainer.create(CoreContainer.java:812) [junit] Caused by: java.lang.R
Re: Debugging custom RequestHander: spinning up a core for debugging
Thanks, that pointed me in the right direction! The problem was an ancient ICU library in the distributed code. -Tod On Dec 15, 2017, at 5:15 PM, Erick Erickson mailto:erickerick...@gmail.com>> wrote: My guess is this isn't a Solr issue at all; you are somehow using an old Java. RBBIDataWrapper is from com.ibm.icu.text; I saw on a quick Google that this was cured by re-installing Eclipse, but that was from 5 years ago. You say your Java and IDE skills are a bit rusty, maybe you haven't updated your Java JDK or Eclipse in a while? I don't know if Eclipse somehow has its own Java (I haven't used Eclipse for quite a while). I take it this runs outside Eclipse OK? (well, with problems otherwise you wouldn't be stepping through it.) Best, Erick On Fri, Dec 15, 2017 at 1:16 PM, Tod Olson mailto:t...@uchicago.edu>> wrote: Hi everyone, I need to do some step-wise debugging on a custom RequestHandler. I'm trying to spin up a core in a Junit test, with the idea of running it inside of Eclipse for debugging. (If there's an easier way, I'd like to see a walk through!) Problem is the core fails to spin up with: java.io.IOException: Break Iterator Rule Data Magic Number Incorrect, or unsupported data version Here's the code, just trying to load (cribbed and adapted from https://stackoverflow.com/questions/45506381/how-to-debug-solr-plugin): public class BrowseHandlerTest { private static CoreContainer container; private static SolrCore core; private static final Logger logger = Logger.getGlobal(); @BeforeClass public static void prepareClass() throws Exception { String solrHomeProp = "solr.solr.home"; System.out.println(solrHomeProp + "= " + System.getProperty(solrHomeProp)); // create the core container from the solr.solr.home system property container = new CoreContainer(); container.load(); core = container.getCore("biblio"); logger.info<http://logger.info/><http://logger.info<http://logger.info/>>("Solr core loaded!"); } @AfterClass public static void cleanUpClass() { core.close(); container.shutdown(); logger.info<http://logger.info/><http://logger.info<http://logger.info/>>("Solr core shut down!"); } } The test, run through ant, fails as follows: [junit] solr.solr.home= /Users/tod/src/vufind/solr/vufind [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder". [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation. [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details. [junit] Tests run: 0, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 1.299 sec [junit] [junit] - Standard Error - [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder". [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation. [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details. [junit] - --- [junit] Testcase: org.vufind.solr.handler.tests.BrowseHandlerTest: Caused an ERROR [junit] SolrCore 'biblio' is not available due to init failure: JVM Error creating core [biblio]: null [junit] org.apache.solr.common.SolrException: SolrCore 'biblio' is not available due to init failure: JVM Error creating core [biblio]: null [junit] at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1066) [junit] at org.vufind.solr.handler.tests.BrowseHandlerTest.prepareClass(BrowseHandlerTest.java:45) [junit] Caused by: org.apache.solr.common.SolrException: JVM Error creating core [biblio]: null [junit] at org.apache.solr.core.CoreContainer.create(CoreContainer.java:833) [junit] at org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:87) [junit] at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:467) [junit] at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:458) [junit] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [junit] at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) [junit] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [junit] at java.util.conc