Re: A Newbie Question

Ken Krugler Sun, 14 Nov 2010 16:18:31 -0800


On Nov 14, 2010, at 3:02pm, Lance Norskog wrote:

Yes, the ExtractingRequestHandler uses Tika to parse many fileformats.
Solr 1.4.1 uses a previous version of Tika (0.6 or 0.7).
Here's the problem with Tika and extraction utilities in general:they are not perfect. They will fail on some files. In theExtractingRequestHandler's case, there is no way to let it fail inparsing but save the document's metadata anyway with a notation:"sorry not parsed".

By "there is no way" do you mean in configuring the currentExtractingRequestHandler? Or is there some fundamental issue with howSolr uses Tika that prevents ExtractingRequestHandler from beingmodified to work this way (which seems like a useful configurationsettings)?


Regards,

-- Ken

I would rather have the unix 'strings' command parse my documents(thanks to a co-worker for this).
K. Seshadri Iyer wrote:
Thanks for all the responses.
Govind: To answer your question, yes, all I want to search is plaintextfiles. They are located in NFS directories across multiple Solaris/Linux
storage boxes. The total storage is in hundreds of terabytes.

I have just got started with Solr and my understanding is that I will
somehow need Tika to help stream/upload files to Solr. I don't knowanythingabout Java programming, being a system admin. So far, I have readthat theautodetect parser in Tika will somehow detect the file type and Ican usethe stream to populate Solr. How, that is still a mystery to me -working on
it. Any tips appreciated; thanks in advance.

Sesh
On 13 November 2010 15:24, Govind Kanshi<govind.kan...@gmail.com>wrote:
Another pov you might want to think about - what kind of searchyou want.Just plain - full text search or there is something more to thosetextfiles. Are they grouped in folders? Do the folders imply certainkind of
grouping/hierarchy/tagging?
I recently was trying to help somebody who had files across lot ofplacesgrouped by date/subject/author - he wanted to ensure these are"fields"
which too can act as filters/navigators.

Just an input - ignore it if you just want plain full text search.
On Sat, Nov 13, 2010 at 11:25 AM, LanceNorskog<goks...@gmail.com> wrote:
About web servers: Solr is a servlet war file and needs a Javaweb server"container" to run. The example/ folder in the Solr disributionuses'Jetty', and this is fine for small production-quality projects.You canjust copy the example/ directory somewhere to set up your ownrunning
Solr;
that's what I always do.
About indexing programs: if you know Unix scripting, it may beeasiest towalk the file system yourself with the 'find' program and createSolr
input
XML files.
But yes, you definitely want the Solr 1.4 Enterprise manual. Ispent
months
learning this stuff very slowly, and the book would have beengreat back
then.

Lance


Erick Erickson wrote:
Think of the data import handler (DIH) as Solr pulling data toindex
from some source based on configuration. So, once you set up
your DIH config to point to your file system, you issue a command
to solr like "OK, do your data import thing". See the
FileListEntityProcessor.
http://wiki.apache.org/solr/DataImportHandler
<http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clentlibrary
you'd use to push data to Solr. Basically, you
write a Java program that uses SolrJ to walk the file system, find
documents, create a Solr document and sent that to Solr. It's not
nearly as complex as it sounds<G>. See:
http://wiki.apache.org/solr/Solrj
<http://wiki.apache.org/solr/Solrj>It's probably worth yourwhile to
get
a
copy of "Solr 1.4, Enterprise Search Server"
by Erik Pugh and David Smiley.

Best
Erick

On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri Iyer<seshadri...@gmail.com
wrote:
Hi Lance,
Thank you very much for responding (not sure how I reply to thegroup,
so,
writing to you).
Can you please expand on your suggestion? I am not a web guyand so,
don't
know where to start.
What is the difference between SolrJ and DataImportHandler? DoI need
to
set
up web servers on all my storage boxes?
Apologies for the basic level of questions, but hope I can getstarted
and
implement this before the year end (you know why :o)

Thanks,

Sesh
On 12 November 2010 13:31, Lance Norskog<goks...@gmail.com>wrote:
Using 'curl' is fine. There is a library called SolrJ for Javaandother libraries for other scripting languages that let youupload withmore control. There is a thing in Solr called theDataImportHandler
that lets you script walking a file system.

On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer<
seshadri...@gmail.com
wrote:
Hi,
Pardon me if this sounds very elementary, but I have a verybasic
question
regarding Solr search. I have about 10 storage devices running
Solaris
with
hundreds of thousands of text files (there are other files,as well,
but
my
target is these text files). The directories on the Solarisboxes are
exported and are available as NFS mounts.

I have installed Solr 1.4 on a Linux box and have tested the
installation,
using curl to post documents. However, the manual says thatcurl is
not
the
recommended way of posting documents to Solr. Could someoneplease
tell
me
what is the preferred approach in such an environment? I amnot a
programmer
and would appreciate some hand-holding here :o)

Thanks in advance,

Sesh
--
Lance Norskog
goks...@gmail.com


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: A Newbie Question

Reply via email to