>manifest processing has a very limited usecase. Why can't it be
>processed using a PlainTextEntityProcessor and write a Tranformer to
>read lines using regex?
>
Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
insight to see how this could be used to index each of the files
listed by a 'tar xvf' report. Can you explain further?

About the limited usecase. Verity thought it was useful enough
to have there own "bulk insert file" or bif file format that
did the same and was far less flexible.

In my experience we generally start off with some kind of
file walker or crawler looking after file repositories. But
these always proved slow and unreliable and over time they
were always replaced it with some kind of manifest based
control of the indexer. Where we could get a report of changes
we always used it, and only relied on walkers or crawlers
where we had to.

Fergus

>
>--Noble
>
>On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>> Hello,
>>
>> I have almost finished a new DIH EntityProcessor which
>> I am calling the manifestEnityProcessor. It is designed
>> around the idea that whatever demon is used to maintain
>> your set of a few 100,000 xml documents it is likely to
>> drop a report or log file explaining what has been changed
>> within your content store. This assumes a file based
>> content repository.
>>
>> The manifestEnityProcessor is used as follows
>>
>>       <entity name="jc"
>>               processor="ManifestEntityProcessor"
>>               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>>               rootEntity="false"
>>               dataSource="null"
>>
>>               allowRegex="^.*\.xml$"
>>               manifestFileName="/Volumes/ts/man-find.txt"
>>               manifestAddRegex="(.*)$"
>>               >
>>
>> The idea is you have a log file or other report, perhaps
>> from tar or zip, and you wish to use this to control the
>> indexing of the new content. The new entity fields are as
>> follows.
>>
>> manifestFileName is the name of the manifest file. If
>>                 this value is relative, it assumed to
>>                 be relative to baseDir. Required.
>>
>> manifestAddRegex is a required regex to identify lines
>>                 which when matched should cause docs to
>>                 be added to the index.
>>
>> manifestDelRegex is an optional value of a regex to
>>                 identify documents which when matched should
>>                 be deleted from the index **PLANNED**
>>
>> allowRegex       a required regex to identify the portion
>>                 of the ADD/DELete line identified above
>>                 which contains the file or pathname to
>>                 ADDed or DELeted. If the resulting value
>>                 relative, it assumed to be relative to
>>                 baseDir.
>>
>> What do I do next?
>>   Raise a JIRA issue and add the code?
>>   Is DIH the right place to add this?
>>   Suggestions for a different name?
>>   Suggestions on how to do the delete bitty from within an entity?
>>
>> Regards Fergus.
>--Noble Paul

-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Reply via email to