Re file vs. URL - can't both be hidden behind an URL object (file:// vs. http:// schema)?
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Fergus McMenemie <fer...@twig.me.uk> > To: solr-user@lucene.apache.org > Sent: Monday, March 9, 2009 7:00:43 PM > Subject: Re: DIH with a list of changed documents? > > >Le 09-mars-09 à 22:29, Fergus McMenemie a écrit : > >>> how would I implement entity-processor if I were able to get the list > >>> of recently changed documents of our sites? > >> > >> Hmmmm, this sounds like a job for my manifestEnityProcessor > >> see if you can find the thread titled:- > >> > >> "a new DIH manifestEnityProcessor" > >> > >> is your list of changed documents a list of additions and > >> updates only, or does it contain deletes as well? > > > >Fergus, > > > >I think you should then rename it... Manifest is not the right name to > >me (manifest refers to something such as the manifest of a jar or of > >an IMS-content-package, both are a metadata of the data). > > Its all in the jargon, I guess. Our content repositories are changed > by update kits, some of the kits come with manifests or in other cases > we capture the output from un-tar or un-zip commands and we call these > manifests. The name is up for grabs if a better suggestion comes along; > I would have used FileListEntityProcessor except the name was taken;-) > > > >I looked at your original description and I could not read anything > >about the changed files. > >The regex approach is a nice one for sure... > > Yep, our "manifest"s quite often include jpegs, avis etc which we > do not want indexed. And if it's a tar output it will contain > directory stubs as well. > > >I think a useful DIH Entity-processor that would maintain its deltas > >well would have as parameters, url to a list of recently updated urls, > >url to a list of recently deleted urls. Is this yours? > > urls hu! Never thought of that, i was just assuming it would be a local > file. However I guess that could be added... so "manifestFileName" would > become "manifestURL"? In my use cases some of the "manifests" are along > the lines of > > ADD xxxx-checksum-xxx --pathname_1-- > DEL --pathname_b-- > > Hence "manifestAddRegex" and "manifestDelRegex". I also, in other > cases, have separate files, one for adding another for deleting. > This I was going to deal with as two separate DIH imports. > > >I would have one for URLs with the list of recent things basically > >from an RSS; the transformer is custom in all cases. > > The output from my manifestEnityProcessor is fed to an > XPathEntityProcessor > > > > >paul > > > Fergus. > -- > > =============================================================== > Fergus McMenemie Email:fer...@twig.me.uk > Techmore Ltd Phone:(UK) 07721 376021 > > Unix/Mac/Intranets Analyst Programmer > ===============================================================