Re: A Newbie Question

2010-11-15 Thread Lance Norskog
"There is no current feature" is what I meant. Yes, it would be very handy to do this. I handled this problem in the DIH by creating two documents, both with the same unique ID. The first doc just had the metadata. The second document parsed the input with Tika, but had 'skip doc on error' set

Re: A Newbie Question

2010-11-14 Thread Ken Krugler
On Nov 14, 2010, at 3:02pm, Lance Norskog wrote: Yes, the ExtractingRequestHandler uses Tika to parse many file formats. Solr 1.4.1 uses a previous version of Tika (0.6 or 0.7). Here's the problem with Tika and extraction utilities in general: they are not perfect. They will fail on some

Re: A Newbie Question

2010-11-14 Thread Lance Norskog
Yes, the ExtractingRequestHandler uses Tika to parse many file formats. Solr 1.4.1 uses a previous version of Tika (0.6 or 0.7). Here's the problem with Tika and extraction utilities in general: they are not perfect. They will fail on some files. In the ExtractingRequestHandler's case, there i

Re: A Newbie Question

2010-11-14 Thread K. Seshadri Iyer
Thanks for all the responses. Govind: To answer your question, yes, all I want to search is plain text files. They are located in NFS directories across multiple Solaris/Linux storage boxes. The total storage is in hundreds of terabytes. I have just got started with Solr and my understanding is t

Re: A Newbie Question

2010-11-13 Thread Govind Kanshi
Another pov you might want to think about - what kind of search you want. Just plain - full text search or there is something more to those text files. Are they grouped in folders? Do the folders imply certain kind of grouping/hierarchy/tagging? I recently was trying to help somebody who had files

Re: A Newbie Question

2010-11-12 Thread Lance Norskog
About web servers: Solr is a servlet war file and needs a Java web server "container" to run. The example/ folder in the Solr disribution uses 'Jetty', and this is fine for small production-quality projects. You can just copy the example/ directory somewhere to set up your own running Solr; th

Re: A Newbie Question

2010-11-12 Thread Erick Erickson
Think of the data import handler (DIH) as Solr pulling data to index from some source based on configuration. So, once you set up your DIH config to point to your file system, you issue a command to solr like "OK, do your data import thing". See the FileListEntityProcessor. http://wiki.apache.org/s

Re: A Newbie Question

2010-11-12 Thread K. Seshadri Iyer
Hi Lance, Thank you very much for responding (not sure how I reply to the group, so, writing to you). Can you please expand on your suggestion? I am not a web guy and so, don't know where to start. What is the difference between SolrJ and DataImportHandler? Do I need to set up web servers on all

Re: A Newbie Question

2010-11-12 Thread Lance Norskog
Using 'curl' is fine. There is a library called SolrJ for Java and other libraries for other scripting languages that let you upload with more control. There is a thing in Solr called the DataImportHandler that lets you script walking a file system. On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iye