Hi all: I've had a quick look through the archives but am struggling to find a decent search query (a bad start to my solr career), so apologies if this has been asked multiple times before, as I'm sure it has.
We've got several windows file servers across several locations and we'd like to index their contents using Solr. So far we've come up with this setup: - 1 Solr server with several collections, collections segregated by file security needs or line of business. - At each remote site a linux machine has mounted the relevant local fileserver's filesystem via SMB/CIFS. - That server is running a perl script written by yours truly that creates an XML index of all the files and then submits them to Solr for indexing. Content of files with certain extensions is indexed using Tika. Happy to post this script. The script is fairly mature and has a few smarts in it, like being able to do delta updates (not in the solr sense of the word: It'll do a full scan of a file system then write out a timestamp. Next time it runs it only grabs files modified since that timestamp). This works... to a point. There are these problems: --------------------------------------------------------------------------------------------------------------------------------------- Time: ----- On some servers we're dealing with something in the region of a million or more files. Indexing that many times takes upwards of 48 hours or more. While the script is now fairly stable and fault tolerant, that is still a pretty long time. Part of the reason of the slowness is the content indexing by Tika, but I've been unable to find a satisfactory alternative. We could drop the whole content thing, but then what's the point? Half the beauty of solr/tika is that we >can< do it. Projecting from some averages, it'd take the better part of a week to index one of our file servers. Deletes: -------- As explained above, once the initial scan takes place all activity thereafter is limited to files that have changed since $last_run_time. However this present a problem in that if a file gets deleted from the file server, we're still going to see it in the search results. There are a few ways that I can see to get rid of these stale files, but they either won't work or are evil: - Re-index the server. Evil because it'll take half a week. - Use some filesystem watcher to watch for deletes. Won't work because we're using a SMB/CIFS share mount. - Periodically list all the files on the fileserver, diff that against all the files stored in Solr and delete the differences from Solr, thereby syncing the two. Evil because... well it just is. I'd be asking Solr for every record it has, which'll be a doozy of a return variable. Surely there has to be a more elegant way? Security: --------- We've worked around this by not indexing some files or separating out into various collections. As such it is not a huge problem, but has anyone figured out how to integrate Solr with LDAP? --------------------------------------------------------------------------------------------------------------------------------------- DIH: ---- Someone will reasonably ask why we're not using the DIH. I tried using that but found the following: - It would crash. - When I stopped it crashing by using the on-error stuff, both in the Tika subsection and the main part of the DIH config, it still crashed with a java-out-of-memory error. - I gave java more memory but it still crashed. At that point I gave up for the following reasons: - DIH and I were not getting along. - Java and I were not getting along. - Java and DIH were not getting along. - All the doco I could find was either really basic or really advanced... there was no intermediate stuff as far as I could find. - I realised that I could do what I wanted to do better using perl than I could with DIH, and this seemed a better solution. The perl script has, by and large, been a success. However, we've run up against the above problems. Which now leads me to my ultimate question. Surely other people have been in this same situation. How did they solve these issues? Is the slow indexing time simply a function of the large dataset we're wanting to index? Do we need to throw more oomph at the servers? The more I play with Solr, the more I realise I need to learn and the more I realise I'm way out of my depth, hence this email. Thanks Anthony ________________________________ ========================================== Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it. ==========================================