Re: URL search and indexing

Erick Erickson Thu, 27 Jun 2013 10:32:25 -0700

Right, string fields are a little tricky, they're easy to confuse with
fields that actually _do_ something.


By default, norms and term frequencies are turned off for types based on '
class="solr.StrField" '. So any field length normalization (i.e. terms that
appear in shorter fields count more) and term frequencies calculations are
_not_ include in the score calculation.

Try blowing your index away and adding this to your fields to see the
difference....

omitNorms="false" omitTermFreqAndPositions="false"

You probably want to either turn these on explicitly for your string types
or use a type based on 'class="solr.TextField" ' since these options
default to "false" for text fields. If you use something like
"keywordTokenizerFactory" you also won't get your URL split up into pieces.
And in that case you can also normalize the values with something like
lowerCaseFilter which you can't do with "string" types since they're
completely unanalyzed.

Best
Erick


On Wed, Jun 26, 2013 at 11:34 AM, Flavio Pompermaier
<pomperma...@okkam.it>wrote:

> Obviously I messed up with email thread...however I found a problem
> indexing my document via post.sh.
> This is basically my schema.xml:
>
> <schema name="dopa-schema" version="1.5">
>  <fields>
>    <field name="url" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
>    <field name="itemid" type="string" indexed="true" stored="true"
> multiValued="true"/>
>    <field name="_version_" type="long" indexed="true" stored="true"/>
>  </fields>
>  <uniqueKey>url</uniqueKey>
>   <types>
>     <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> />
>     <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> positionIncrementGap="0"/>
>  </types>
> </schema>
>
> and this is the document I tried to upload via post.sh:
>
> <add>
> <doc>
>   <field name="url">http://test.example.org/first.html</field>
>   <field name="itemid">1000</field>
>   <field name="itemid">1000</field>
>   <field name="itemid">1000</field>
>   <field name="itemid">5000</field>
> </doc>
> <doc>
>   <field name="url">http://test.example.org/second.html</field>
>   <field name="itemid">1000</field>
>   <field name="itemid">5000</field>
> </doc>
> </add>
>
> When playing with administration and debugging tools I discovered that
> searching for q=itemid:5000 gave me the same score for those docs, while I
> was expecting different term frequencies between the first and the second.
> In fact, using java to upload documents lead to correct results (3
> occurrences of item 1000 in the first doc and 1 in the second), e.g.:
> document1.addField("itemid", "1000");
> document1.addField("itemid", "1000");
> document1.addField("itemid", "1000");
>
> Am I right or am I missing something else?
>
>
> On Wed, Jun 26, 2013 at 5:18 PM, Jack Krupansky <j...@basetechnology.com
> >wrote:
>
> > If there is a bug... we should identify it. What's a sample post command
> > that you issued?
> >
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Flavio Pompermaier
> > Sent: Wednesday, June 26, 2013 10:53 AM
> >
> > To: solr-user@lucene.apache.org
> > Subject: Re: URL search and indexing
> >
> > I was doing exactly that and, thanks to the administration page and
> > explanation/debugging, I checked if results were those expected.
> > Unfortunately, results were not correct submitting updates trough post.sh
> > script (that use curl in the end).
> > Probably, if it founds the same tag (same value for the same field-name),
> > it will collapse them.
> > Rewriting the same document in Java and submitting the updates did the
> > things work correctly.
> >
> > In my opinion this is a bug (of the entire process, then I don't know it
> > this is a problem of curl or of the script itself).
> >
> > Best,
> > Flavio
> >
> > On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson <erickerick...@gmail.com
> >*
> > *wrote:
> >
> >  Flavio:
> >>
> >> You mention that you're new to Solr, so I thought I'd make sure
> >> you know that the admin/analysis page is your friend! I flat
> >> guarantee that as you try to index/search following the suggestions
> >> you'll scratch your head at your results and.... you'll discover that
> >> the analysis process isn't doing quite what you expect. The
> >> admin/analysis page shows you the transformation of the input
> >> at each stage, i.e. how the input is tokenized, what transformations
> >> are applied to each token etc. It's invaluable!
> >>
> >> Best
> >> Erick
> >>
> >> P.S. Feel free to un-check the "verbose" box, it provides lots
> >> of information but can be overwhelming, especially at first!
> >>
> >> On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
> >> <pomperma...@okkam.it> wrote:
> >> > Ok thank you all for the great help!
> >> > Now I'm ready to start playing with my index!
> >> >
> >> > Best,
> >> > Flavio
> >> >
> >> >
> >> > On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky <
> >> j...@basetechnology.com>wrote:
> >> >
> >> >> Yeah, URL Classify does only do so much. That's why you need to
> combine
> >> >> multiple methods.
> >> >>
> >> >> As a fourth method, you could code up a short JavaScript "**
> >> >> StatelessScriptUpdateProcessor****" that did something like take a
> >> full
> >> >> domain name (such as output by URL Classify) and turn it into
> multiple
> >> >> values, each with more of the prefix removed, so that "
> >> lucene.apache.org"
> >> >> would index as:
> >> >>
> >> >> lucene.apache.org
> >> >> apache.org
> >> >> apache
> >> >> .org
> >> >> org
> >> >>
> >> >> And then the user could query by any of those partial domain names.
> >> >>
> >> >> But, if you simply tokenize the URL (copy the URL string to a text
> >> field),
> >> >> you automatically get most of that. The user can query by a URL
> >> fragment,
> >> >> such as "apache.org", ".org", "lucene.apache.org", etc. and the
> >> >> tokenization will strip out the punctuation.
> >> >>
> >> >> I'll add this script to my list of examples to add in the next rev of
> >> >> my
> >> >> book.
> >> >>
> >> >>
> >> >> -- Jack Krupansky
> >> >>
> >> >> -----Original Message----- From: Flavio Pompermaier
> >> >> Sent: Tuesday, June 25, 2013 10:06 AM
> >> >>
> >> >> To: solr-user@lucene.apache.org
> >> >> Subject: Re: URL search and indexing
> >> >>
> >> >> I bought the book and looking at the example I still don't understand
> >> if it
> >> >> possible query all sub-urls of my URL.
> >> >> For example, if the URLClassifyProcessorFactory takes in input >>
> >> "url_s":"
> >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html<
> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html>
> >> <
> >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
> http://lucene.apache.org/solr/4_0_0/changes/Changes.html>
> >> >"
> >> >> and makes some
> >> >> outputs like
> >> >> - "url_domain_s":"lucene.apache.****org <http://lucene.apache.org>"
> >> >> - "url_canonical_s":"
> >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html<
> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html>
> >> <
> >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
> http://lucene.apache.org/solr/4_0_0/changes/Changes.html>
> >> >
> >> >> "
> >> >> How should I configure url_domain_s in order to be able to makes
> query
> >> like
> >> >> '*.apache.org'?
> >> >> How should I configure url_canonical_s in order to be able to makes
> >> query
> >> >> like 'http://lucene.apache.org/****solr/*<
> http://lucene.apache.org/**solr/*><
> >> http://lucene.apache.org/solr/*** <http://lucene.apache.org/solr/*>>
> >> >> '?
> >> >> Is it better to have two different fields for the two queries or
> could
> >> >> I
> >> >> create just one field for the two kind of queries (obviously for the
> >> former
> >> >> case then I should query something like *://.apache.org/*)?
> >> >>
> >> >>
> >> >> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky <
> >> j...@basetechnology.com>*
> >> >> *wrote:
> >> >>
> >> >>  There are examples in my book:
> >> >>> http://www.lulu.com/shop/jack-******krupansky/solr-4x-deep-**
> >> dive-****<
> http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****>
> >> <
> >> http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****<
> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**>
> >> >
> >> >>> early-access-release-1/ebook/******product-21079719.html<http:**//**
> >> >>> www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-**<
> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**>
> >> >>> early-access-release-1/ebook/****product-21079719.html<
> >> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
> >> early-access-release-1/ebook/**product-21079719.html<
> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
> >
> >> >
> >> >>> >
> >> >>>
> >> >>>
> >> >>> But... I still think you should use a tokenized text field as well -
> >> use
> >> >>> all three: raw string, tokenized text, and URL classification
> fields.
> >> >>>
> >> >>> -- Jack Krupansky
> >> >>>
> >> >>> -----Original Message----- From: Flavio Pompermaier
> >> >>> Sent: Tuesday, June 25, 2013 9:02 AM
> >> >>> To: solr-user@lucene.apache.org
> >> >>> Subject: Re: URL search and indexing
> >> >>>
> >> >>>
> >> >>> That's sound exactly what I'm looking for! However I cannot find an
> >> >>> example
> >> >>> of how to use it..could you help me please?
> >> >>> Moreover, about id field, isn't true that id field shouldn't be
> >> analyzed
> >> >>> as
> >> >>> suggested in
> >> >>>
> >> http://wiki.apache.org/solr/******UniqueKey#Text_field_in_the_**
> >> ****document<
> http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document>
> >> <
> >>
> http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document<
> http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document>
> >> >
> >> >>> <http://wiki.apache.**org/**solr/UniqueKey#Text_field_****
> >> in_the_document<
> >> http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document<
> http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document>
> >> >
> >> >>> >
> >> >>>
> >> >>> ?
> >> >>>
> >> >>>
> >> >>> On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl <jan....@cominvent.com
> >
> >> >>> wrote:
> >> >>>
> >> >>>  Sure you can query the url directly. Or if you choose you can split
> >> it up
> >> >>>
> >> >>>> in multiple components, e.g. using
> >> >>>> http://lucene.apache.org/solr/******4_3_0/solr-core/org/**
> >> apache/****<
> http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****>
> >> <
> >> http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****<
> http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**>
> >> >
> >> >>>> solr/update/processor/******URLClassifyProcessor.html<**http**
> >> >>>> ://lucene.apache.org/solr/4_3_****0/solr-core/org/apache/solr/****
> <http://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**>
> >> >>>> update/processor/****URLClassifyProcessor.html<
> >> http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**
> >> solr/update/processor/**URLClassifyProcessor.html<
> http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html
> >
> >> >
> >> >>>> >
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Jan Høydahl, search solution architect
> >> >>>> Cominvent AS - www.cominvent.com
> >> >>>>
> >> >>>> 25. juni 2013 kl. 14:10 skrev Flavio Pompermaier <
> >> pomperma...@okkam.it>:
> >> >>>>
> >> >>>> > Sorry but maybe I miss something here..could I declare url as key
> >> field
> >> >>>> and
> >> >>>> > query it too..?
> >> >>>> > At the moment, my schema.xml looks like:
> >> >>>> >
> >> >>>> > <fields>
> >> >>>> >     <field name="url" type="string" indexed="true" stored="true"
> >> >>>> > required="true" multiValued="false" />
> >> >>>> >
> >> >>>> >   <field name="category" type="string" indexed="true"
> >> stored="true"/>
> >> >>>> >   <field name="language" type="string" indexed="true"
> >> stored="true"/>
> >> >>>> >  ...
> >> >>>> >   <field name="_version_" type="long" indexed="true" >>>> >
> >> stored="true"/>
> >> >>>> >
> >> >>>> > </fields>
> >> >>>> > <uniqueKey>url</uniqueKey>
> >> >>>> >
> >> >>>> > Is it ok? or should I add a "baseurl" field of some kind to be
> able
> >> to
> >> >>>> > query all url coming from a certain domain (1st or 2nd level as
> >> well)?
> >> >>>> >
> >> >>>> > Best,
> >> >>>> > Flavio
> >> >>>> >
> >> >>>> >
> >> >>>> > On Tue, Jun 25, 2013 at 12:28 PM, Jan Høydahl <
> >> jan....@cominvent.com>
> >> >>>> wrote:
> >> >>>> >
> >> >>>> >> Probably a good match for the RegExp feature of Solr (given that
> >> your
> >> >>>> url
> >> >>>> >> is not tokenized)
> >> >>>> >> e.g. q=url:/.*\.it$/
> >> >>>> >>
> >> >>>> >> --
> >> >>>> >> Jan Høydahl, search solution architect
> >> >>>> >> Cominvent AS - www.cominvent.com
> >> >>>> >>
> >> >>>> >> 25. juni 2013 kl. 12:17 skrev Flavio Pompermaier <
> >> >>>> pomperma...@okkam.it
> >> >>>> >:
> >> >>>> >>
> >> >>>> >>> Hi to everybody,
> >> >>>> >>> I'm quite new to Solr so maybe my question could be trivial for
> >> you..
> >> >>>> >>> In my use case I have to index stuff contained in some URL so i
> >> use
> >> >>>> >>> url
> >> >>>> >> as
> >> >>>> >>> key of my document and I treat it like a string.
> >> >>>> >>>
> >> >>>> >>> However I'd like to be able to query by domain name, like *.it
> or
> >> *.
> >> >>>> >>> somesite.com, what's the best strategy? I tought to made a URL
> >> to
> >> >>>> path
> >> >>>> >>> transfromation and indexed using solr.****
> >> >>>> PathHierarchyTokenizerFactory
> >> >>>>
> >> >>>> >>> but
> >> >>>> >>> maybe there's a simpler solution..isn't it?
> >> >>>> >>>
> >> >>>> >>> Best,
> >> >>>> >>> Flavio
> >> >>>> >>>
> >> >>>> >>> --
> >> >>>> >>>
> >> >>>> >>> Flavio Pompermaier
> >> >>>> >>> *Development Department
> >> >>>> >>> *_____________________________******__________________
> >> >>>>
> >> >>>> >>> *OKKAM**Srl **- www.okkam.it*
> >> >>>> >>>
> >> >>>> >>> *Phone:* +(39) 0461 283 702
> >> >>>> >>> *Fax:* + (39) 0461 186 6433
> >> >>>> >>> *Email:* f.pomperma...@okkam.it
> >> >>>> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei
> >> Molini 2
> >> >>>> >>> *Registered office:* Trento (Italy), via Segantini 23
> >> >>>> >>>
> >> >>>> >>> Confidentially notice. This e-mail transmission may contain
> >> legally
> >> >>>> >>> privileged and/or confidential information. Please do not read
> it
> >> if
> >> >>>> you
> >> >>>> >>> are not the intended recipient(S). Any use, distribution, >>>
> >> >>>> reproduction
> >> >>>> or
> >> >>>> >>> disclosure by any other person is strictly prohibited. If you
> >> >>>> >>> have
> >> >>>> >> received
> >> >>>> >>> this e-mail in error, please notify the sender and destroy the
> >> >>>> >>>  >>>
> >> >>>> original
> >> >>>> >>> transmission and its attachments without reading or saving it
> in
> >> any
> >> >>>> >> manner.
> >> >>>> >>
> >> >>>> >>
> >> >>>> >
> >> >>>> >
> >> >>>> > --
> >> >>>> >
> >> >>>> > Flavio Pompermaier
> >> >>>> > *Development Department
> >> >>>> > *_____________________________******__________________
> >> >>>>
> >> >>>> > *OKKAM**Srl **- www.okkam.it*
> >> >>>> >
> >> >>>> > *Phone:* +(39) 0461 283 702
> >> >>>> > *Fax:* + (39) 0461 186 6433
> >> >>>> > *Email:* f.pomperma...@okkam.it
> >> >>>> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei
> Molini
> >> 2
> >> >>>> > *Registered office:* Trento (Italy), via Segantini 23
> >> >>>> >
> >> >>>> > Confidentially notice. This e-mail transmission may contain
> legally
> >> >>>> > privileged and/or confidential information. Please do not read it
> >> if >
> >> >>>> you
> >> >>>> > are not the intended recipient(S). Any use, distribution,
> >> reproduction
> >> >>>> > or
> >> >>>> > disclosure by any other person is strictly prohibited. If you
> have
> >> >>>> received
> >> >>>> > this e-mail in error, please notify the sender and destroy the
> >> original
> >> >>>> > transmission and its attachments without reading or saving it in
> >> >>>> > any
> >> >>>> manner.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>> --
> >> >>>
> >> >>> Flavio Pompermaier
> >> >>> *Development Department
> >> >>> *_____________________________******__________________
> >> >>>
> >> >>> *OKKAM**Srl **- www.okkam.it*
> >> >>>
> >> >>> *Phone:* +(39) 0461 283 702
> >> >>> *Fax:* + (39) 0461 186 6433
> >> >>> *Email:* f.pomperma...@okkam.it
> >> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini
> 2
> >> >>> *Registered office:* Trento (Italy), via Segantini 23
> >> >>>
> >> >>> Confidentially notice. This e-mail transmission may contain legally
> >> >>> privileged and/or confidential information. Please do not read it if
> >> you
> >> >>> are not the intended recipient(S). Any use, distribution,
> reproduction
> >> or
> >> >>> disclosure by any other person is strictly prohibited. If you have
> >> >>> received
> >> >>> this e-mail in error, please notify the sender and destroy the >>>
> >> original
> >> >>> transmission and its attachments without reading or saving it in any
> >> >>> manner.
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >> Flavio Pompermaier
> >> >> *Development Department
> >> >> *_____________________________****__________________
> >> >> *OKKAM**Srl **- www.okkam.it*
> >> >>
> >> >> *Phone:* +(39) 0461 283 702
> >> >> *Fax:* + (39) 0461 186 6433
> >> >> *Email:* f.pomperma...@okkam.it
> >> >> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> >> >> *Registered office:* Trento (Italy), via Segantini 23
> >> >>
> >> >> Confidentially notice. This e-mail transmission may contain legally
> >> >> privileged and/or confidential information. Please do not read it if
> >> >> you
> >> >> are not the intended recipient(S). Any use, distribution,
> reproduction
> >> or
> >> >> disclosure by any other person is strictly prohibited. If you have
> >> received
> >> >> this e-mail in error, please notify the sender and destroy the
> original
> >> >> transmission and its attachments without reading or saving it in any
> >> >> manner.
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Flavio Pompermaier
> >> > *Development Department
> >> > *_____________________________**__________________
> >> > *OKKAM**Srl **- www.okkam.it*
> >> >
> >> > *Phone:* +(39) 0461 283 702
> >> > *Fax:* + (39) 0461 186 6433
> >> > *Email:* f.pomperma...@okkam.it
> >> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> >> > *Registered office:* Trento (Italy), via Segantini 23
> >> >
> >> > Confidentially notice. This e-mail transmission may contain legally
> >> > privileged and/or confidential information. Please do not read it if
> you
> >> > are not the intended recipient(S). Any use, distribution, reproduction
> >> > or
> >> > disclosure by any other person is strictly prohibited. If you have
> >> received
> >> > this e-mail in error, please notify the sender and destroy the
> original
> >> > transmission and its attachments without reading or saving it in any
> >> manner.
> >>
> >>
> >
>

Re: URL search and indexing

Reply via email to