Re: URL search and indexing

Erik Hatcher Tue, 25 Jun 2013 07:59:19 -0700

If you want to query by domain, then index the domain (or just the last piece 
of it).  I'd suggest you somehow (either in your indexer code or via clever 
analysis tricks) peel off the last piece of the domain as its own string field 
so you get "com", "it", "edu", "gov", etc all as indexed values in a single 
field.


        Erik

On Jun 25, 2013, at 10:37 , Flavio Pompermaier wrote:

> Basically I have to design the solr document and I was thinking that
> actually users could be more interested in filtering by domain (*.it or
> *.com), however I cannot exclude more site-related queries (like '
> http://lucene.apache.org/solr/*').
> From what I understood I should configure my schema.xml like:
> 
> <fields>
>   <field name="url" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
>   <field name="tokenized-url" type="text_general" indexed="true"
> stored="false" multiValued="false"/>
>   ...
> </fields>
> <uniqueKey>url</uniqueKey>
>   <copyField source="url" dest="tokenized-url"/>
> 
>    <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> Does the text_general field type fit to my needs on URLs? Or should I use a
> more specific tokenizer?
> 
> 
> On Tue, Jun 25, 2013 at 2:00 PM, Jack Krupansky 
> <j...@basetechnology.com>wrote:
> 
>> As Jan indicates, your users could perform regular expression queries on a
>> URL string field, but maybe you should tell us more about your use case and
>> how your users really want to search.
>> 
>> One technique is to copy the URL to a tokenized text field. Then, users
>> can search for names and sub-sequences that occur in the URL without the
>> need for wildcards or regular expressions.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Jan Høydahl
>> Sent: Tuesday, June 25, 2013 6:28 AM
>> 
>> To: solr-user@lucene.apache.org
>> Subject: Re: URL search and indexing
>> 
>> Probably a good match for the RegExp feature of Solr (given that your url
>> is not tokenized)
>> e.g. q=url:/.*\.it$/
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>> 25. juni 2013 kl. 12:17 skrev Flavio Pompermaier <pomperma...@okkam.it>:
>> 
>> Hi to everybody,
>>> I'm quite new to Solr so maybe my question could be trivial for you..
>>> In my use case I have to index stuff contained in some URL so i use url as
>>> key of my document and I treat it like a string.
>>> 
>>> However I'd like to be able to query by domain name, like *.it or *.
>>> somesite.com, what's the best strategy? I tought to made a URL to path
>>> transfromation and indexed using solr.**PathHierarchyTokenizerFactory but
>>> maybe there's a simpler solution..isn't it?
>>> 
>>> Best,
>>> Flavio
>>> 
>>> --
>>> 
>>> Flavio Pompermaier
>>> *Development Department
>>> *_____________________________**__________________
>>> *OKKAM**Srl **- www.okkam.it*
>>> 
>>> *Phone:* +(39) 0461 283 702
>>> *Fax:* + (39) 0461 186 6433
>>> *Email:* f.pomperma...@okkam.it
>>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
>>> *Registered office:* Trento (Italy), via Segantini 23
>>> 
>>> Confidentially notice. This e-mail transmission may contain legally
>>> privileged and/or confidential information. Please do not read it if you
>>> are not the intended recipient(S). Any use, distribution, reproduction or
>>> disclosure by any other person is strictly prohibited. If you have
>>> received
>>> this e-mail in error, please notify the sender and destroy the original
>>> transmission and its attachments without reading or saving it in any
>>> manner.
>>> 
>> 
>> 
> 
> 
> -- 
> 
> Flavio Pompermaier
> *Development Department
> *_______________________________________________
> *OKKAM**Srl **- www.okkam.it*
> 
> *Phone:* +(39) 0461 283 702
> *Fax:* + (39) 0461 186 6433
> *Email:* f.pomperma...@okkam.it
> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> *Registered office:* Trento (Italy), via Segantini 23
> 
> Confidentially notice. This e-mail transmission may contain legally
> privileged and/or confidential information. Please do not read it if you
> are not the intended recipient(S). Any use, distribution, reproduction or
> disclosure by any other person is strictly prohibited. If you have received
> this e-mail in error, please notify the sender and destroy the original
> transmission and its attachments without reading or saving it in any manner.

Re: URL search and indexing

Reply via email to