Specific replies below, but what I'd seriously consider is writing my own filesystem-aware hook that pushed documents to known Solr servers rather than using DIH to pull them. You could use the code from FileSystemEntityProcessor as a base and go from there. The FileSystemEntityProcessor isn't really intended to do very complex stuff.....
1> Don't think this is possible OOB. There's nothing built in to the DIH that puts in filesystem hooks and automatically tries to index it.... 2> Nope. DIH is pretty simple that way as per the FileListEntityProcessor. 3> I'm pretty sure this is irrelevant to FileSystemEntityProcessor, it's really used for the database importation. 4> "whatever order Java returns them in". Take a look at the FileListEntityProcessor code, but the relevant bit is below. So the ordering is whatever Java does which I don't know what, if any, guarantees are made. private void getFolderFiles(File dir, final List<Map<String, Object>> fileDetails) { // Fetch an array of file objects that pass the filter, however the // returned array is never populated; accept() always returns false. // Rather we make use of the fileDetails array which is populated as // a side affect of the accept method. dir.list(new FilenameFilter() { public boolean accept(File dir, String name) { File fileObj = new File(dir, name); if (fileObj.isDirectory()) { if (recursive) getFolderFiles(fileObj, fileDetails); } else if (fileNamePattern == null) { addDetails(fileDetails, dir, name); } else if (fileNamePattern.matcher(name).find()) { if (excludesPattern != null && excludesPattern.matcher(name).find()) return false; addDetails(fileDetails, dir, name); } return false; } }); } On Tue, Sep 27, 2011 at 4:51 PM, Gabriel Cooper <inanutshel...@gmail.com> wrote: > I'm researching using DataImportHandler to import my data files utilizing > FileDataSource with FileListEntityProcessor and have a couple questions > before I get started that I'm hoping you guys can assist with. > > 1) I would like to put a file on the local filesystem in the configured > location and have Solr see and process the file without additional effort on > my part. > 1a) Is this doable in any way? From what I've seen, this is not supported > and I must manually call a URL (e.g. > http://foo/solr/dataimport?command=full-import). > 1b) The manual, URL-based invocation method seems perfectly logical in a > database-oriented world, where one might schedule an update to run regularly > but in my case I have a couple identical indexes I load balance between and > don't want to run the same hefty query multiple times in parallel. As such, > I'm doing one query, writing the results to an XML file, pushing that file > to each box, and then wanting that file processed. I'd like the process to > be as automated as possible. > > 2) I would like any files processed by Solr to be deleted after they've been > imported. I haven't seen any way to do this currently. I thought I might be > able to subclass something, but FileListEntityProcessor, for example, > doesn't seem to give any handles at the right time in the workflow to delete > a file. > > 3) When reading the DIH documentation, I ran across this statement: "When > delta-import command is executed, it reads the start time stored in * > conf/dataimport.properties*. It uses that timestamp to run delta queries and > after completion, updates the timestamp in *conf/dataimport.properties*." If > it really does update the date to the completion date, what happens to any > files added between the start and end dates? Are they lost? > > 4) For delta imports, I don't see mention of how processed files are ordered > other than that it tries not to re-import files older than that mentioned in > the conf/dataimport.properties file. In cases where order matters, does it > order the files by name or creation date or ...? > > Thanks for any help, > > Gabriel. >