Re: URL-import field type?

Noble Paul നോബിള്‍ नोब्ळ् Fri, 23 Jan 2009 01:11:30 -0800

On Fri, Jan 23, 2009 at 2:28 PM, Paul Libbrecht <p...@activemath.org> wrote:
> Well,
>
> the idea is that the solr engine indexes the contents of a web platform.
>
> Each document is a user-side-URL out of which several fields would be
> fetched through various URL-get-documents (e.g. the full-text-view, e.g. the
> future openmath representation, e.g. the topics (URIs in an ontology), ...).


if the response of these are URLs are well formed xpaths they can be
channeled to an XPathEntityProcessor (one per field) and they can be
processed

if the response is not XML ,then  there is no EntityProcessor that can
consume this. We may need to add one.
>
> Would the alternate (and maybe equivalent) way to stream all documents into
> one XML document and let the XPath triage act through all fields? That would
> also work would take advantage of the XPathEntityProcessor's nice
> configuration.

>
> What bothers me with the HttpDataSource example is that, for now, at least,
> it is configured to pull a single URL while what is needed (and would
> provide delta ability) is really to index a list of URLs (for which one
> would pull regularly the list of recently update URLs or simply use
> GET-if-modified-since on all of them).
The if-modified since is not supported by HttpdataSource. However you
can write a transformer which pings the URL w/ a if-modified-since
header an skip the document using the $skipDoc option
>
> I didn't think that modifying the XPathEntityProcessor was the right thing
> since it seems based on a single stream.
>
> Hints for altnernative eagerly welcome.
>
> paul
>
>
> Le 23-janv.-09 à 05:45, Noble Paul നോബിള്‍ नोब्ळ् a écrit :
>
>> where is this url coming from? what is the content type of the stream?
>> is it plain text or html?
>>
>> if yes, this is a possible enhancement to DIH
>>
>>
>>
>> On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht <p...@activemath.org>
>> wrote:
>>>
>>> Hello list,
>>>
>>> after searching around for quite a while, including in the
>>> DataImportHandler
>>> documentation on the wiki (which looks amazing), I couldn't find a way to
>>> indicate to solr that the tokens of that field should be the result of
>>> analyzing the tokens of the stream at URL-xxx.
>>>
>>> I know I was able to imitate that in plain-lucene by crafting a
>>> particular
>>> analyzer-filter who was only given the URL as content and who gave
>>> further
>>> the tokens of the stream.
>>>
>>> Is this the right way in solr?
>>>
>>> thanks in advance.
>>>
>>> paul
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: URL-import field type?

Reply via email to