Re file vs. URL - can't both be hidden behind an URL object (file:// vs. 
http:// schema)?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Fergus McMenemie <fer...@twig.me.uk>
> To: solr-user@lucene.apache.org
> Sent: Monday, March 9, 2009 7:00:43 PM
> Subject: Re: DIH with a list of changed documents?
> 
> >Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
> >>> how would I implement entity-processor if I were able to get the list
> >>> of recently changed documents of our sites?
> >>
> >> Hmmmm, this sounds like a job for my manifestEnityProcessor
> >> see if you can find the thread titled:-
> >>
> >>   "a new DIH manifestEnityProcessor"
> >>
> >> is your list of changed documents a list of additions and
> >> updates only, or does it contain deletes as well?
> >
> >Fergus,
> >
> >I think you should then rename it... Manifest is not the right name to  
> >me (manifest refers to something such as the manifest of a jar or of  
> >an IMS-content-package, both are a metadata of the data).
> 
> Its all in the jargon, I guess. Our content repositories are changed
> by update kits, some of the kits come with manifests or in other cases
> we capture the output from un-tar or un-zip commands and we call these
> manifests. The name is up for grabs if a better suggestion comes along;
> I would have used FileListEntityProcessor except the name was taken;-)
> 
> 
> >I looked at your original description and I could not read anything  
> >about the changed files.
> >The regex approach is a nice one for sure...
> 
> Yep, our "manifest"s quite often include jpegs, avis etc which we
> do not want indexed. And if it's a tar output it will contain
> directory stubs as well.
> 
> >I think a useful DIH Entity-processor that would maintain its deltas  
> >well would have as parameters, url to a list of recently updated urls,  
> >url to a list of recently deleted urls. Is this yours?
> 
> urls hu! Never thought of that, i was just assuming it would be a local
> file. However I guess that could be added... so "manifestFileName" would
> become "manifestURL"? In my use cases some of the "manifests" are along
> the  lines of 
> 
>    ADD xxxx-checksum-xxx  --pathname_1--
>    DEL --pathname_b--
> 
> Hence "manifestAddRegex" and "manifestDelRegex". I also, in other 
> cases, have separate files, one for adding another for deleting.
> This I was going to deal with as two separate DIH imports.
> 
> >I would have one for URLs with the list of recent things basically  
> >from an RSS; the transformer is custom in all cases.
> 
> The output from my manifestEnityProcessor is fed to an
> XPathEntityProcessor
> 
> >
> >paul
> >
> Fergus.
> -- 
> 
> ===============================================================
> Fergus McMenemie               Email:fer...@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
> 
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================

Reply via email to