On 3/29/2018 3:59 PM, Terry Steichen wrote:
> First question: When indexing content in a directory, Solr's normal
> behavior is to recursively index all the files found in that directory
> and its subdirectories.  However, turns out that when the files are of
> the form *.eml (email), solr won't do that.  I can use a wildcard to get
> it to index the current directory, but it won't recurse.

At first I had no idea what program you were using.  I may have figured
it out, see below.

> I note this message that's displayed when I begin indexing: "Entering
> auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log

That looks like the simple post tool included with Solr.  If it is, type
"bin/post -help" and you will see that there is a -filetypes option that
lets you change the list of extensions that are considered valid.

Note that the post tool included with Solr is a SIMPLE post tool.  It's
designed as a way to get your feet wet, not for heavy production usage. 
It does not have extensive capability.  We strongly recommend that you
graduate to a better indexing program.  Usually that means that you're
going to have to write one yourself, to be sure that it does everything
YOU want it to do.  The one included with Solr probably can't do some of
the things that you want it to do.

Also, indexing files using the post tool is going to run Tika extraction
inside Solr.  Tika is a separate Apache project.  Solr happens to
include a subset of Tika's capability that can run inside Solr.  That
program is known to sometimes behave explosively when it processes
documents.  If an explosion happens in Tika and it's running inside
Solr, then Solr itself might crash.  Running Tika outside Solr, usually
in a program that you write yourself, is highly recommended.  Doing this
will also give you access to the full range of Tika's capabilities.

Here's an example of a program that uses both JDBC and Tika to index to
Solr:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

If you search google for "tika index solr" (without the quotes), you'll
find some other examples of custom programs that use Tika to index to
Solr.  There may be better searches you can do on Google as well.

Thanks,
Shawn

Reply via email to