Well,

the idea is that the solr engine indexes the contents of a web platform.

Each document is a user-side-URL out of which several fields would be fetched through various URL-get-documents (e.g. the full-text-view, e.g. the future openmath representation, e.g. the topics (URIs in an ontology), ...).

Would the alternate (and maybe equivalent) way to stream all documents into one XML document and let the XPath triage act through all fields? That would also work would take advantage of the XPathEntityProcessor's nice configuration.

What bothers me with the HttpDataSource example is that, for now, at least, it is configured to pull a single URL while what is needed (and would provide delta ability) is really to index a list of URLs (for which one would pull regularly the list of recently update URLs or simply use GET-if-modified-since on all of them).

I didn't think that modifying the XPathEntityProcessor was the right thing since it seems based on a single stream.

Hints for altnernative eagerly welcome.

paul


Le 23-janv.-09 à 05:45, Noble Paul നോബിള്‍ नोब्ळ् a écrit :

where is this url coming from? what is the content type of the stream?
is it plain text or html?

if yes, this is a possible enhancement to DIH



On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht <p...@activemath.org> wrote:

Hello list,

after searching around for quite a while, including in the DataImportHandler documentation on the wiki (which looks amazing), I couldn't find a way to indicate to solr that the tokens of that field should be the result of
analyzing the tokens of the stream at URL-xxx.

I know I was able to imitate that in plain-lucene by crafting a particular analyzer-filter who was only given the URL as content and who gave further
the tokens of the stream.

Is this the right way in solr?

thanks in advance.

paul



--
--Noble Paul

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to