Indexing file system contents

Sadler, Anthony Thu, 03 Oct 2013 22:30:52 -0700

Hi all:

I've had a quick look through the archives but am struggling to find a decent 
search query (a bad start to my solr career), so apologies if this has been 
asked multiple times before, as I'm sure it has.


We've got several windows file servers across several locations and we'd like 
to index their contents using Solr. So far we've come up with this setup:

- 1 Solr server with several collections, collections segregated by file 
security needs or line of business.
- At each remote site a linux machine has mounted the relevant local 
fileserver's filesystem via SMB/CIFS.
- That server is running a perl script written by yours truly that creates an 
XML index of all the files and then submits them to Solr for indexing. Content 
of files with certain extensions is indexed using Tika. Happy to post this 
script.

The script is fairly mature and has a few smarts in it, like being able to do 
delta updates (not in the solr sense of the word: It'll do a full scan of a 
file system then write out a timestamp. Next time it runs it only grabs files 
modified since that timestamp). This works... to a point. There are these 
problems:

---------------------------------------------------------------------------------------------------------------------------------------

Time:
-----
On some servers we're dealing with something in the region of a million or more 
files. Indexing that many times takes upwards of 48 hours or more. While the 
script is now fairly stable and fault tolerant, that is still a pretty long 
time. Part of the reason of the slowness is the content indexing by Tika, but 
I've been unable to find a satisfactory alternative. We could drop the whole 
content thing, but then what's the point? Half the beauty of solr/tika is that 
we >can< do it.

Projecting from some averages, it'd take the better part of a week to index one 
of our file servers.

Deletes:
--------
As explained above, once the initial scan takes place all activity thereafter 
is limited to files that have changed since $last_run_time. However this 
present a problem in that if a file gets deleted from the file server, we're 
still going to see it in the search results. There are a few ways that I can 
see to get rid of these stale files, but they either won't work or are evil:

- Re-index the server. Evil because it'll take half a week.
- Use some filesystem watcher to watch for deletes. Won't work because we're 
using a SMB/CIFS share mount.
- Periodically list all the files on the fileserver, diff that against all the 
files stored in Solr and delete the differences from Solr, thereby syncing the 
two. Evil because... well it just is. I'd be asking Solr for every record it 
has, which'll be a doozy of a return variable. Surely there has to be a more 
elegant way?

Security:
---------
We've worked around this by not indexing some files or separating out into 
various collections. As such it is not a huge problem, but has anyone figured 
out how to integrate Solr with LDAP?

---------------------------------------------------------------------------------------------------------------------------------------

DIH:
----
Someone will reasonably ask why we're not using the DIH. I tried using that but 
found the following:
- It would crash.
- When I stopped it crashing by using the on-error stuff, both in the Tika 
subsection and the main part of the DIH config, it still crashed with a 
java-out-of-memory error.
- I gave java more memory but it still crashed.

At that point I gave up for the following reasons:
- DIH and I were not getting along.
- Java and I were not getting along.
- Java and DIH were not getting along.
- All the doco I could find was either really basic or really advanced... there 
was no intermediate stuff as far as I could find.
- I realised that I could do what I wanted to do better using perl than I could 
with DIH, and this seemed a better solution.

The perl script has, by and large, been a success. However, we've run up 
against the above problems.

Which now leads me to my ultimate question. Surely other people have been in 
this same situation. How did they solve these issues? Is the slow indexing time 
simply a function of the large dataset we're wanting to index? Do we need to 
throw more oomph at the servers?

The more I play with Solr, the more I realise I need to learn and the more I 
realise I'm way out of my depth, hence this email.

Thanks

Anthony

________________________________

==========================================
Privileged/Confidential Information may be contained in this message. If you 
are not the addressee indicated in this message (or responsible for delivery of 
the message to such person), you may not copy or deliver this message to 
anyone. In such case, you should destroy this message and kindly notify the 
sender by reply email. Please advise immediately if you or your employer does 
not consent to email for messages of this kind. Opinions, conclusions and other 
information in this message that do not relate to the official business of 
Burson-Marsteller shall be understood as neither given nor endorsed by it.
==========================================

Indexing file system contents

Reply via email to