Re: URL search and indexing

Flavio Pompermaier Fri, 28 Jun 2013 01:26:00 -0700

Thanks for the explanation, I was missing exaclty that!
Now things works correctly also using the post script.
However I don't think I need norms if I use id of same lenght (UUID), right?
I just need strings with omitTermFreqAndPositions="false" I think.



On Thu, Jun 27, 2013 at 7:31 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Right, string fields are a little tricky, they're easy to confuse with
> fields that actually _do_ something.
>
> By default, norms and term frequencies are turned off for types based on '
> class="solr.StrField" '. So any field length normalization (i.e. terms that
> appear in shorter fields count more) and term frequencies calculations are
> _not_ include in the score calculation.
>
> Try blowing your index away and adding this to your fields to see the
> difference....
>
> omitNorms="false" omitTermFreqAndPositions="false"
>
> You probably want to either turn these on explicitly for your string types
> or use a type based on 'class="solr.TextField" ' since these options
> default to "false" for text fields. If you use something like
> "keywordTokenizerFactory" you also won't get your URL split up into pieces.
> And in that case you can also normalize the values with something like
> lowerCaseFilter which you can't do with "string" types since they're
> completely unanalyzed.
>
> Best
> Erick
>
>
> On Wed, Jun 26, 2013 at 11:34 AM, Flavio Pompermaier
> <pomperma...@okkam.it>wrote:
>
> > Obviously I messed up with email thread...however I found a problem
> > indexing my document via post.sh.
> > This is basically my schema.xml:
> >
> > <schema name="dopa-schema" version="1.5">
> >  <fields>
> >    <field name="url" type="string" indexed="true" stored="true"
> > required="true" multiValued="false" />
> >    <field name="itemid" type="string" indexed="true" stored="true"
> > multiValued="true"/>
> >    <field name="_version_" type="long" indexed="true" stored="true"/>
> >  </fields>
> >  <uniqueKey>url</uniqueKey>
> >   <types>
> >     <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> > />
> >     <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> > positionIncrementGap="0"/>
> >  </types>
> > </schema>
> >
> > and this is the document I tried to upload via post.sh:
> >
> > <add>
> > <doc>
> >   <field name="url">http://test.example.org/first.html</field>
> >   <field name="itemid">1000</field>
> >   <field name="itemid">1000</field>
> >   <field name="itemid">1000</field>
> >   <field name="itemid">5000</field>
> > </doc>
> > <doc>
> >   <field name="url">http://test.example.org/second.html</field>
> >   <field name="itemid">1000</field>
> >   <field name="itemid">5000</field>
> > </doc>
> > </add>
> >
> > When playing with administration and debugging tools I discovered that
> > searching for q=itemid:5000 gave me the same score for those docs, while
> I
> > was expecting different term frequencies between the first and the
> second.
> > In fact, using java to upload documents lead to correct results (3
> > occurrences of item 1000 in the first doc and 1 in the second), e.g.:
> > document1.addField("itemid", "1000");
> > document1.addField("itemid", "1000");
> > document1.addField("itemid", "1000");
> >
> > Am I right or am I missing something else?
> >
> >
> > On Wed, Jun 26, 2013 at 5:18 PM, Jack Krupansky <j...@basetechnology.com
> > >wrote:
> >
> > > If there is a bug... we should identify it. What's a sample post
> command
> > > that you issued?
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: Flavio Pompermaier
> > > Sent: Wednesday, June 26, 2013 10:53 AM
> > >
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: URL search and indexing
> > >
> > > I was doing exactly that and, thanks to the administration page and
> > > explanation/debugging, I checked if results were those expected.
> > > Unfortunately, results were not correct submitting updates trough
> post.sh
> > > script (that use curl in the end).
> > > Probably, if it founds the same tag (same value for the same
> field-name),
> > > it will collapse them.
> > > Rewriting the same document in Java and submitting the updates did the
> > > things work correctly.
> > >
> > > In my opinion this is a bug (of the entire process, then I don't know
> it
> > > this is a problem of curl or of the script itself).
> > >
> > > Best,
> > > Flavio
> > >
> > > On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson <
> erickerick...@gmail.com
> > >*
> > > *wrote:
> > >
> > >  Flavio:
> > >>
> > >> You mention that you're new to Solr, so I thought I'd make sure
> > >> you know that the admin/analysis page is your friend! I flat
> > >> guarantee that as you try to index/search following the suggestions
> > >> you'll scratch your head at your results and.... you'll discover that
> > >> the analysis process isn't doing quite what you expect. The
> > >> admin/analysis page shows you the transformation of the input
> > >> at each stage, i.e. how the input is tokenized, what transformations
> > >> are applied to each token etc. It's invaluable!
> > >>
> > >> Best
> > >> Erick
> > >>
> > >> P.S. Feel free to un-check the "verbose" box, it provides lots
> > >> of information but can be overwhelming, especially at first!
> > >>
> > >> On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
> > >> <pomperma...@okkam.it> wrote:
> > >> > Ok thank you all for the great help!
> > >> > Now I'm ready to start playing with my index!
> > >> >
> > >> > Best,
> > >> > Flavio
> > >> >
> > >> >
> > >> > On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky <
> > >> j...@basetechnology.com>wrote:
> > >> >
> > >> >> Yeah, URL Classify does only do so much. That's why you need to
> > combine
> > >> >> multiple methods.
> > >> >>
> > >> >> As a fourth method, you could code up a short JavaScript "**
> > >> >> StatelessScriptUpdateProcessor****" that did something like take a
> > >> full
> > >> >> domain name (such as output by URL Classify) and turn it into
> > multiple
> > >> >> values, each with more of the prefix removed, so that "
> > >> lucene.apache.org"
> > >> >> would index as:
> > >> >>
> > >> >> lucene.apache.org
> > >> >> apache.org
> > >> >> apache
> > >> >> .org
> > >> >> org
> > >> >>
> > >> >> And then the user could query by any of those partial domain names.
> > >> >>
> > >> >> But, if you simply tokenize the URL (copy the URL string to a text
> > >> field),
> > >> >> you automatically get most of that. The user can query by a URL
> > >> fragment,
> > >> >> such as "apache.org", ".org", "lucene.apache.org", etc. and the
> > >> >> tokenization will strip out the punctuation.
> > >> >>
> > >> >> I'll add this script to my list of examples to add in the next rev
> of
> > >> >> my
> > >> >> book.
> > >> >>
> > >> >>
> > >> >> -- Jack Krupansky
> > >> >>
> > >> >> -----Original Message----- From: Flavio Pompermaier
> > >> >> Sent: Tuesday, June 25, 2013 10:06 AM
> > >> >>
> > >> >> To: solr-user@lucene.apache.org
> > >> >> Subject: Re: URL search and indexing
> > >> >>
> > >> >> I bought the book and looking at the example I still don't
> understand
> > >> if it
> > >> >> possible query all sub-urls of my URL.
> > >> >> For example, if the URLClassifyProcessorFactory takes in input >>
> > >> "url_s":"
> > >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html<
> > http://lucene.apache.org/solr/**4_0_0/changes/Changes.html>
> > >> <
> > >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
> > http://lucene.apache.org/solr/4_0_0/changes/Changes.html>
> > >> >"
> > >> >> and makes some
> > >> >> outputs like
> > >> >> - "url_domain_s":"lucene.apache.****org <http://lucene.apache.org
> >"
> > >> >> - "url_canonical_s":"
> > >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html<
> > http://lucene.apache.org/solr/**4_0_0/changes/Changes.html>
> > >> <
> > >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
> > http://lucene.apache.org/solr/4_0_0/changes/Changes.html>
> > >> >
> > >> >> "
> > >> >> How should I configure url_domain_s in order to be able to makes
> > query
> > >> like
> > >> >> '*.apache.org'?
> > >> >> How should I configure url_canonical_s in order to be able to makes
> > >> query
> > >> >> like 'http://lucene.apache.org/****solr/*<
> > http://lucene.apache.org/**solr/*><
> > >> http://lucene.apache.org/solr/*** <http://lucene.apache.org/solr/*>>
> > >> >> '?
> > >> >> Is it better to have two different fields for the two queries or
> > could
> > >> >> I
> > >> >> create just one field for the two kind of queries (obviously for
> the
> > >> former
> > >> >> case then I should query something like *://.apache.org/*)?
> > >> >>
> > >> >>
> > >> >> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky <
> > >> j...@basetechnology.com>*
> > >> >> *wrote:
> > >> >>
> > >> >>  There are examples in my book:
> > >> >>> http://www.lulu.com/shop/jack-******krupansky/solr-4x-deep-**
> > >> dive-****<
> > http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****>
> > >> <
> > >> http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****<
> > http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**>
> > >> >
> > >> >>>
> early-access-release-1/ebook/******product-21079719.html<http:**//**
> > >> >>> www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-**<
> > http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**>
> > >> >>> early-access-release-1/ebook/****product-21079719.html<
> > >> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
> > >> early-access-release-1/ebook/**product-21079719.html<
> >
> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
> > >
> > >> >
> > >> >>> >
> > >> >>>
> > >> >>>
> > >> >>> But... I still think you should use a tokenized text field as
> well -
> > >> use
> > >> >>> all three: raw string, tokenized text, and URL classification
> > fields.
> > >> >>>
> > >> >>> -- Jack Krupansky
> > >> >>>
> > >> >>> -----Original Message----- From: Flavio Pompermaier
> > >> >>> Sent: Tuesday, June 25, 2013 9:02 AM
> > >> >>> To: solr-user@lucene.apache.org
> > >> >>> Subject: Re: URL search and indexing
> > >> >>>
> > >> >>>
> > >> >>> That's sound exactly what I'm looking for! However I cannot find
> an
> > >> >>> example
> > >> >>> of how to use it..could you help me please?
> > >> >>> Moreover, about id field, isn't true that id field shouldn't be
> > >> analyzed
> > >> >>> as
> > >> >>> suggested in
> > >> >>>
> > >> http://wiki.apache.org/solr/******UniqueKey#Text_field_in_the_**
> > >> ****document<
> > http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document
> >
> > >> <
> > >>
> > http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document
> <
> > http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document>
> > >> >
> > >> >>> <http://wiki.apache.**org/**solr/UniqueKey#Text_field_****
> > >> in_the_document<
> > >> http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document<
> > http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document>
> > >> >
> > >> >>> >
> > >> >>>
> > >> >>> ?
> > >> >>>
> > >> >>>
> > >> >>> On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl <
> jan....@cominvent.com
> > >
> > >> >>> wrote:
> > >> >>>
> > >> >>>  Sure you can query the url directly. Or if you choose you can
> split
> > >> it up
> > >> >>>
> > >> >>>> in multiple components, e.g. using
> > >> >>>> http://lucene.apache.org/solr/******4_3_0/solr-core/org/**
> > >> apache/****<
> > http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****>
> > >> <
> > >> http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****<
> > http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**>
> > >> >
> > >> >>>> solr/update/processor/******URLClassifyProcessor.html<**http**
> > >> >>>> ://
> lucene.apache.org/solr/4_3_****0/solr-core/org/apache/solr/****
> > <http://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**>
> > >> >>>> update/processor/****URLClassifyProcessor.html<
> > >> http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**
> > >> solr/update/processor/**URLClassifyProcessor.html<
> >
> http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html
> > >
> > >> >
> > >> >>>> >
> > >> >>>>
> > >> >>>>
> > >> >>>> --
> > >> >>>> Jan Høydahl, search solution architect
> > >> >>>> Cominvent AS - www.cominvent.com
> > >> >>>>
> > >> >>>> 25. juni 2013 kl. 14:10 skrev Flavio Pompermaier <
> > >> pomperma...@okkam.it>:
> > >> >>>>
> > >> >>>> > Sorry but maybe I miss something here..could I declare url as
> key
> > >> field
> > >> >>>> and
> > >> >>>> > query it too..?
> > >> >>>> > At the moment, my schema.xml looks like:
> > >> >>>> >
> > >> >>>> > <fields>
> > >> >>>> >     <field name="url" type="string" indexed="true"
> stored="true"
> > >> >>>> > required="true" multiValued="false" />
> > >> >>>> >
> > >> >>>> >   <field name="category" type="string" indexed="true"
> > >> stored="true"/>
> > >> >>>> >   <field name="language" type="string" indexed="true"
> > >> stored="true"/>
> > >> >>>> >  ...
> > >> >>>> >   <field name="_version_" type="long" indexed="true" >>>> >
> > >> stored="true"/>
> > >> >>>> >
> > >> >>>> > </fields>
> > >> >>>> > <uniqueKey>url</uniqueKey>
> > >> >>>> >
> > >> >>>> > Is it ok? or should I add a "baseurl" field of some kind to be
> > able
> > >> to
> > >> >>>> > query all url coming from a certain domain (1st or 2nd level as
> > >> well)?
> > >> >>>> >
> > >> >>>> > Best,
> > >> >>>> > Flavio
> > >> >>>> >
> > >> >>>> >
> > >> >>>> > On Tue, Jun 25, 2013 at 12:28 PM, Jan Høydahl <
> > >> jan....@cominvent.com>
> > >> >>>> wrote:
> > >> >>>> >
> > >> >>>> >> Probably a good match for the RegExp feature of Solr (given
> that
> > >> your
> > >> >>>> url
> > >> >>>> >> is not tokenized)
> > >> >>>> >> e.g. q=url:/.*\.it$/
> > >> >>>> >>
> > >> >>>> >> --
> > >> >>>> >> Jan Høydahl, search solution architect
> > >> >>>> >> Cominvent AS - www.cominvent.com
> > >> >>>> >>
> > >> >>>> >> 25. juni 2013 kl. 12:17 skrev Flavio Pompermaier <
> > >> >>>> pomperma...@okkam.it
> > >> >>>> >:
> > >> >>>> >>
> > >> >>>> >>> Hi to everybody,
> > >> >>>> >>> I'm quite new to Solr so maybe my question could be trivial
> for
> > >> you..
> > >> >>>> >>> In my use case I have to index stuff contained in some URL
> so i
> > >> use
> > >> >>>> >>> url
> > >> >>>> >> as
> > >> >>>> >>> key of my document and I treat it like a string.
> > >> >>>> >>>
> > >> >>>> >>> However I'd like to be able to query by domain name, like
> *.it
> > or
> > >> *.
> > >> >>>> >>> somesite.com, what's the best strategy? I tought to made a
> URL
> > >> to
> > >> >>>> path
> > >> >>>> >>> transfromation and indexed using solr.****
> > >> >>>> PathHierarchyTokenizerFactory
> > >> >>>>
> > >> >>>> >>> but
> > >> >>>> >>> maybe there's a simpler solution..isn't it?
> > >> >>>> >>>
> > >> >>>> >>> Best,
> > >> >>>> >>> Flavio
> > >> >>>> >>>
> > >> >>>> >>> --
> > >> >>>> >>>
> > >> >>>> >>> Flavio Pompermaier
> > >> >>>> >>> *Development Department
> > >> >>>> >>> *_____________________________******__________________
> > >> >>>>
> > >> >>>> >>> *OKKAM**Srl **- www.okkam.it*
> > >> >>>> >>>
> > >> >>>> >>> *Phone:* +(39) 0461 283 702
> > >> >>>> >>> *Fax:* + (39) 0461 186 6433
> > >> >>>> >>> *Email:* f.pomperma...@okkam.it
> > >> >>>> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei
> > >> Molini 2
> > >> >>>> >>> *Registered office:* Trento (Italy), via Segantini 23
> > >> >>>> >>>
> > >> >>>> >>> Confidentially notice. This e-mail transmission may contain
> > >> legally
> > >> >>>> >>> privileged and/or confidential information. Please do not
> read
> > it
> > >> if
> > >> >>>> you
> > >> >>>> >>> are not the intended recipient(S). Any use, distribution, >>>
> > >> >>>> reproduction
> > >> >>>> or
> > >> >>>> >>> disclosure by any other person is strictly prohibited. If you
> > >> >>>> >>> have
> > >> >>>> >> received
> > >> >>>> >>> this e-mail in error, please notify the sender and destroy
> the
> > >> >>>> >>>  >>>
> > >> >>>> original
> > >> >>>> >>> transmission and its attachments without reading or saving it
> > in
> > >> any
> > >> >>>> >> manner.
> > >> >>>> >>
> > >> >>>> >>
> > >> >>>> >
> > >> >>>> >
> > >> >>>> > --
> > >> >>>> >
> > >> >>>> > Flavio Pompermaier
> > >> >>>> > *Development Department
> > >> >>>> > *_____________________________******__________________
> > >> >>>>
> > >> >>>> > *OKKAM**Srl **- www.okkam.it*
> > >> >>>> >
> > >> >>>> > *Phone:* +(39) 0461 283 702
> > >> >>>> > *Fax:* + (39) 0461 186 6433
> > >> >>>> > *Email:* f.pomperma...@okkam.it
> > >> >>>> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei
> > Molini
> > >> 2
> > >> >>>> > *Registered office:* Trento (Italy), via Segantini 23
> > >> >>>> >
> > >> >>>> > Confidentially notice. This e-mail transmission may contain
> > legally
> > >> >>>> > privileged and/or confidential information. Please do not read
> it
> > >> if >
> > >> >>>> you
> > >> >>>> > are not the intended recipient(S). Any use, distribution,
> > >> reproduction
> > >> >>>> > or
> > >> >>>> > disclosure by any other person is strictly prohibited. If you
> > have
> > >> >>>> received
> > >> >>>> > this e-mail in error, please notify the sender and destroy the
> > >> original
> > >> >>>> > transmission and its attachments without reading or saving it
> in
> > >> >>>> > any
> > >> >>>> manner.
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>> --
> > >> >>>
> > >> >>> Flavio Pompermaier
> > >> >>> *Development Department
> > >> >>> *_____________________________******__________________
> > >> >>>
> > >> >>> *OKKAM**Srl **- www.okkam.it*
> > >> >>>
> > >> >>> *Phone:* +(39) 0461 283 702
> > >> >>> *Fax:* + (39) 0461 186 6433
> > >> >>> *Email:* f.pomperma...@okkam.it
> > >> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei
> Molini
> > 2
> > >> >>> *Registered office:* Trento (Italy), via Segantini 23
> > >> >>>
> > >> >>> Confidentially notice. This e-mail transmission may contain
> legally
> > >> >>> privileged and/or confidential information. Please do not read it
> if
> > >> you
> > >> >>> are not the intended recipient(S). Any use, distribution,
> > reproduction
> > >> or
> > >> >>> disclosure by any other person is strictly prohibited. If you have
> > >> >>> received
> > >> >>> this e-mail in error, please notify the sender and destroy the >>>
> > >> original
> > >> >>> transmission and its attachments without reading or saving it in
> any
> > >> >>> manner.
> > >> >>>
> > >> >>>
> > >> >>
> > >> >>
> > >> >> --
> > >> >>
> > >> >> Flavio Pompermaier
> > >> >> *Development Department
> > >> >> *_____________________________****__________________
> > >> >> *OKKAM**Srl **- www.okkam.it*
> > >> >>
> > >> >> *Phone:* +(39) 0461 283 702
> > >> >> *Fax:* + (39) 0461 186 6433
> > >> >> *Email:* f.pomperma...@okkam.it
> > >> >> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei
> Molini 2
> > >> >> *Registered office:* Trento (Italy), via Segantini 23
> > >> >>
> > >> >> Confidentially notice. This e-mail transmission may contain legally
> > >> >> privileged and/or confidential information. Please do not read it
> if
> > >> >> you
> > >> >> are not the intended recipient(S). Any use, distribution,
> > reproduction
> > >> or
> > >> >> disclosure by any other person is strictly prohibited. If you have
> > >> received
> > >> >> this e-mail in error, please notify the sender and destroy the
> > original
> > >> >> transmission and its attachments without reading or saving it in
> any
> > >> >> manner.
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > Flavio Pompermaier
> > >> > *Development Department
> > >> > *_____________________________**__________________
> > >> > *OKKAM**Srl **- www.okkam.it*
> > >> >
> > >> > *Phone:* +(39) 0461 283 702
> > >> > *Fax:* + (39) 0461 186 6433
> > >> > *Email:* f.pomperma...@okkam.it
> > >> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini
> 2
> > >> > *Registered office:* Trento (Italy), via Segantini 23
> > >> >
> > >> > Confidentially notice. This e-mail transmission may contain legally
> > >> > privileged and/or confidential information. Please do not read it if
> > you
> > >> > are not the intended recipient(S). Any use, distribution,
> reproduction
> > >> > or
> > >> > disclosure by any other person is strictly prohibited. If you have
> > >> received
> > >> > this e-mail in error, please notify the sender and destroy the
> > original
> > >> > transmission and its attachments without reading or saving it in any
> > >> manner.
> > >>
> > >>
> > >
> >
>

Re: URL search and indexing

Reply via email to