Hi Fergus open a JIRA issue anyway. put in your thoughts and we can
refine the requirements as a part of the discussion.

Basically the requirements are ,
1)read a file line by line
2) filter out lines (include or exclude ) based on a regex
3) extract parts (named parts) from the line using another regex

Noble


On Tue, Mar 10, 2009 at 1:50 AM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>>Hi Fergus,
>>The idea is that we have something generic which can be applicable to
>>a large set of users. If the manifest is a text file it can be read in
>>somestandard way (say line by line). So we can have an EntityProcessor
>>which reads a text file line and filer it by a regex like the way
>>'grep' works.
> Yes. That is what I have written. It is just an alternate form of the
> FileListEntityProcessor except that rather than walking the file system
> it reads from a file, line by line, and identifies the portion of the
> line containing the filename using a regexp.
>
>
>>
>>On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>>>>manifest processing has a very limited usecase. Why can't it be
>>>>processed using a PlainTextEntityProcessor and write a Tranformer to
>>>>read lines using regex?
>>>>
>>> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
>>> insight to see how this could be used to index each of the files
>>> listed by a 'tar xvf' report. Can you explain further?
>>>
>>> About the limited usecase. Verity thought it was useful enough
>>> to have there own "bulk insert file" or bif file format that
>>> did the same and was far less flexible.
>>>
>>> In my experience we generally start off with some kind of
>>> file walker or crawler looking after file repositories. But
>>> these always proved slow and unreliable and over time they
>>> were always replaced it with some kind of manifest based
>>> control of the indexer. Where we could get a report of changes
>>> we always used it, and only relied on walkers or crawlers
>>> where we had to.
>>>
>>> Fergus
>>>
>>>>
>>>>--Noble
>>>>
>>>>On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>>>>> Hello,
>>>>>
>>>>> I have almost finished a new DIH EntityProcessor which
>>>>> I am calling the manifestEnityProcessor. It is designed
>>>>> around the idea that whatever demon is used to maintain
>>>>> your set of a few 100,000 xml documents it is likely to
>>>>> drop a report or log file explaining what has been changed
>>>>> within your content store. This assumes a file based
>>>>> content repository.
>>>>>
>>>>> The manifestEnityProcessor is used as follows
>>>>>
>>>>>       <entity name="jc"
>>>>>               processor="ManifestEntityProcessor"
>>>>>               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>>>>>               rootEntity="false"
>>>>>               dataSource="null"
>>>>>
>>>>>               allowRegex="^.*\.xml$"
>>>>>               manifestFileName="/Volumes/ts/man-find.txt"
>>>>>               manifestAddRegex="(.*)$"
>>>>>               >
>>>>>
>>>>> The idea is you have a log file or other report, perhaps
>>>>> from tar or zip, and you wish to use this to control the
>>>>> indexing of the new content. The new entity fields are as
>>>>> follows.
>>>>>
>>>>> manifestFileName is the name of the manifest file. If
>>>>>                 this value is relative, it assumed to
>>>>>                 be relative to baseDir. Required.
>>>>>
>>>>> manifestAddRegex is a required regex to identify lines
>>>>>                 which when matched should cause docs to
>>>>>                 be added to the index.
>>>>>
>>>>> manifestDelRegex is an optional value of a regex to
>>>>>                 identify documents which when matched should
>>>>>                 be deleted from the index **PLANNED**
>>>>>
>>>>> allowRegex       a required regex to identify the portion
>>>>>                 of the ADD/DELete line identified above
>>>>>                 which contains the file or pathname to
>>>>>                 ADDed or DELeted. If the resulting value
>>>>>                 relative, it assumed to be relative to
>>>>>                 baseDir.
>>>>>
>>>>> What do I do next?
>>>>>   Raise a JIRA issue and add the code?
>>>>>   Is DIH the right place to add this?
>>>>>   Suggestions for a different name?
>>>>>   Suggestions on how to do the delete bitty from within an entity?
>>>>>
>>>>> Regards Fergus.
>>>>--Noble Paul
>>>
>>> --
>>>
>>> ===============================================================
>>> Fergus McMenemie               Email:fer...@twig.me.uk
>>> Techmore Ltd                   Phone:(UK) 07721 376021
>>>
>>> Unix/Mac/Intranets             Analyst Programmer
>>> ===============================================================
>>>
>>
>>
>>
>>--
>>--Noble Paul
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fer...@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
--Noble Paul

Reply via email to