Re: storing the document URI in the index

Walter Underwood Tue, 12 Jun 2007 07:16:27 -0700

Solr doesn't have the URL of the document. The document is given
to Solr in an HTTP POST.


Solr is not a web spider, it is a search web service.

wunder


On 6/12/07 6:23 AM, "Ard Schrijvers" <[EMAIL PROTECTED]> wrote:

> Hello Otis, 
> 
> thanks for the info. Would it a be an improvement to be able to specify in the
> schema.xml wether or not the URI should be stored or not in a field which name
> you can also specify in the schema? It might be very well possible that you do
> not "own" the xml documents you index over http, and at the same time, you do
> not want to store its contents in the index. Since at indexing time the uri is
> known, adding it to the index is trivial.
> 
> Regards Ard
> 
> 
> 
> 
> You have to store the URI in a Field yourself.  That means you need to define
> that field in the schema and you have to set its value when adding documents.
> 
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> 
> ----- Original Message ----
> From: Ard Schrijvers <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 12, 2007 9:02:25 AM
> Subject: RE: storing the document URI in the index
> 
> Hello Erik, 
> 
> thanks for the fast answer (sry for my mail not indenting but must use webmail
> :-( ), but the problem I am facing is that I do not see solr storing the
> location of the documents it indexed. So, I need to store the location of a
> document in a field, but I do not see where solr would do this. Fetching the
> document will be done with the simple cocoon generator, so that is no problem,
> but of course, I need the url/uri to be in the index. I know I need it as a
> UN_TOKENIZED STORED field, but just see with LUKE that the location is not
> present in lucene index when solr "crawls" some directory with xml files,
> 
> Regards Ard Schrijvers
> 
> 
> Yes.  Set the field to be store and non-indexed, field type "string"
> is what I use.
> 
>> Or is everybody used to storing the contents of a document in the
>> lucene index (doesn't this imply a much larger index though?), so
>> instead of retrieving the document's content through a seperate
>> fetch over http/filesystem just show the result from the stored
>> content field?
> 
> This all depends on the needs of your project.  Its perfectly fine to
> store the text outside of the index, and that is the way it really
> has to be done for very large indexes where as few fields as possible
> are "stored".
> 
> If you're also asking about Solr fetching the remote resource, that
> is a different story altogether, and no it does not do that.  [though
> with the streaming capability you can feed in a document entirely
> from a URL, but I haven't experimented with that feature yet myself]
> 
>     Erik
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: storing the document URI in the index

Reply via email to