>Hi Fergus, >The idea is that we have something generic which can be applicable to >a large set of users. If the manifest is a text file it can be read in >somestandard way (say line by line). So we can have an EntityProcessor >which reads a text file line and filer it by a regex like the way >'grep' works. Yes. That is what I have written. It is just an alternate form of the FileListEntityProcessor except that rather than walking the file system it reads from a file, line by line, and identifies the portion of the line containing the filename using a regexp.
> >On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie <fer...@twig.me.uk> wrote: >>>manifest processing has a very limited usecase. Why can't it be >>>processed using a PlainTextEntityProcessor and write a Tranformer to >>>read lines using regex? >>> >> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough >> insight to see how this could be used to index each of the files >> listed by a 'tar xvf' report. Can you explain further? >> >> About the limited usecase. Verity thought it was useful enough >> to have there own "bulk insert file" or bif file format that >> did the same and was far less flexible. >> >> In my experience we generally start off with some kind of >> file walker or crawler looking after file repositories. But >> these always proved slow and unreliable and over time they >> were always replaced it with some kind of manifest based >> control of the indexer. Where we could get a report of changes >> we always used it, and only relied on walkers or crawlers >> where we had to. >> >> Fergus >> >>> >>>--Noble >>> >>>On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie <fer...@twig.me.uk> wrote: >>>> Hello, >>>> >>>> I have almost finished a new DIH EntityProcessor which >>>> I am calling the manifestEnityProcessor. It is designed >>>> around the idea that whatever demon is used to maintain >>>> your set of a few 100,000 xml documents it is likely to >>>> drop a report or log file explaining what has been changed >>>> within your content store. This assumes a file based >>>> content repository. >>>> >>>> The manifestEnityProcessor is used as follows >>>> >>>> <entity name="jc" >>>> processor="ManifestEntityProcessor" >>>> baseDir="/Volumes/Techmore/ts/aaa/schema/data" >>>> rootEntity="false" >>>> dataSource="null" >>>> >>>> allowRegex="^.*\.xml$" >>>> manifestFileName="/Volumes/ts/man-find.txt" >>>> manifestAddRegex="(.*)$" >>>> > >>>> >>>> The idea is you have a log file or other report, perhaps >>>> from tar or zip, and you wish to use this to control the >>>> indexing of the new content. The new entity fields are as >>>> follows. >>>> >>>> manifestFileName is the name of the manifest file. If >>>> this value is relative, it assumed to >>>> be relative to baseDir. Required. >>>> >>>> manifestAddRegex is a required regex to identify lines >>>> which when matched should cause docs to >>>> be added to the index. >>>> >>>> manifestDelRegex is an optional value of a regex to >>>> identify documents which when matched should >>>> be deleted from the index **PLANNED** >>>> >>>> allowRegex a required regex to identify the portion >>>> of the ADD/DELete line identified above >>>> which contains the file or pathname to >>>> ADDed or DELeted. If the resulting value >>>> relative, it assumed to be relative to >>>> baseDir. >>>> >>>> What do I do next? >>>> Raise a JIRA issue and add the code? >>>> Is DIH the right place to add this? >>>> Suggestions for a different name? >>>> Suggestions on how to do the delete bitty from within an entity? >>>> >>>> Regards Fergus. >>>--Noble Paul >> >> -- >> >> =============================================================== >> Fergus McMenemie Email:fer...@twig.me.uk >> Techmore Ltd Phone:(UK) 07721 376021 >> >> Unix/Mac/Intranets Analyst Programmer >> =============================================================== >> > > > >-- >--Noble Paul -- =============================================================== Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===============================================================