Re: Indexing file system contents

Shawn Heisey Fri, 04 Oct 2013 00:03:57 -0700

On 10/3/2013 11:29 PM, Sadler, Anthony wrote:
> Time:
> -----
> On some servers we're dealing with something in the region of a million or 
> more files. Indexing that many times takes upwards of 48 hours or more. While 
> the script is now fairly stable and fault tolerant, that is still a pretty 
> long time. Part of the reason of the slowness is the content indexing by 
> Tika, but I've been unable to find a satisfactory alternative. We could drop 
> the whole content thing, but then what's the point? Half the beauty of 
> solr/tika is that we >can< do it.
> 
> Projecting from some averages, it'd take the better part of a week to index 
> one of our file servers.


You might have already thought of this, and even done testing that
proved it wasn't much of a problem, but some of it might be network
latency in dealing with the SMB filesystem.  Not necessarily data
transfer time, but the latency of finding files and navigating the
directory structure.  I'll be getting back to this in a moment.

> Deletes:
> --------
> As explained above, once the initial scan takes place all activity thereafter 
> is limited to files that have changed since $last_run_time. However this 
> present a problem in that if a file gets deleted from the file server, we're 
> still going to see it in the search results. There are a few ways that I can 
> see to get rid of these stale files, but they either won't work or are evil:
> 
> - Re-index the server. Evil because it'll take half a week.
> - Use some filesystem watcher to watch for deletes. Won't work because we're 
> using a SMB/CIFS share mount.
> - Periodically list all the files on the fileserver, diff that against all 
> the files stored in Solr and delete the differences from Solr, thereby 
> syncing the two. Evil because... well it just is. I'd be asking Solr for 
> every record it has, which'll be a doozy of a return variable. Surely there 
> has to be a more elegant way?

I would recommend that what you do is remove all unix-isms from your
perl script so that it is written in pure perl, and then run it directly
on each windows fileserver rather than on your Solr server.  What gets
indexed for adds, changes, and deletes could be driven by a locally
running service that can watch the filesystem.

This would pretty much eliminate SMB network latency for filesystem
navigation.  You also would not need the SMB mounts, the only traffic
would be relatively short-lived HTTP transactions on the Solr port.

> Security:
> ---------
> We've worked around this by not indexing some files or separating out into 
> various collections. As such it is not a huge problem, but has anyone figured 
> out how to integrate Solr with LDAP?

Earlier today I responded to a post on this list asking about security.
 The context of the question is very different from yours, but the
spirit of the reply applies here too.  Basically, Solr doesn't do security.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3C524DB8E5.3090903%40elyograg.org%3E

If you were running the perl script on the Windows machines directly,
you would probably have more options to control what can be accessed.

Thanks,
Shawn

Re: Indexing file system contents

Reply via email to