field length normalisation is based upon the number of terms in a field, not the number of characters in a term. I guess with multivalued string fields, that would mean a field with lots of values (but one match) would score lower than one with only one matching value.
Upayavira On Fri, Jun 28, 2013, at 09:24 AM, Flavio Pompermaier wrote: > Thanks for the explanation, I was missing exaclty that! > Now things works correctly also using the post script. > However I don't think I need norms if I use id of same lenght (UUID), > right? > I just need strings with omitTermFreqAndPositions="false" I think. > > > On Thu, Jun 27, 2013 at 7:31 PM, Erick Erickson > <erickerick...@gmail.com>wrote: > > > Right, string fields are a little tricky, they're easy to confuse with > > fields that actually _do_ something. > > > > By default, norms and term frequencies are turned off for types based on ' > > class="solr.StrField" '. So any field length normalization (i.e. terms that > > appear in shorter fields count more) and term frequencies calculations are > > _not_ include in the score calculation. > > > > Try blowing your index away and adding this to your fields to see the > > difference.... > > > > omitNorms="false" omitTermFreqAndPositions="false" > > > > You probably want to either turn these on explicitly for your string types > > or use a type based on 'class="solr.TextField" ' since these options > > default to "false" for text fields. If you use something like > > "keywordTokenizerFactory" you also won't get your URL split up into pieces. > > And in that case you can also normalize the values with something like > > lowerCaseFilter which you can't do with "string" types since they're > > completely unanalyzed. > > > > Best > > Erick > > > > > > On Wed, Jun 26, 2013 at 11:34 AM, Flavio Pompermaier > > <pomperma...@okkam.it>wrote: > > > > > Obviously I messed up with email thread...however I found a problem > > > indexing my document via post.sh. > > > This is basically my schema.xml: > > > > > > <schema name="dopa-schema" version="1.5"> > > > <fields> > > > <field name="url" type="string" indexed="true" stored="true" > > > required="true" multiValued="false" /> > > > <field name="itemid" type="string" indexed="true" stored="true" > > > multiValued="true"/> > > > <field name="_version_" type="long" indexed="true" stored="true"/> > > > </fields> > > > <uniqueKey>url</uniqueKey> > > > <types> > > > <fieldType name="string" class="solr.StrField" sortMissingLast="true" > > > /> > > > <fieldType name="long" class="solr.TrieLongField" precisionStep="0" > > > positionIncrementGap="0"/> > > > </types> > > > </schema> > > > > > > and this is the document I tried to upload via post.sh: > > > > > > <add> > > > <doc> > > > <field name="url">http://test.example.org/first.html</field> > > > <field name="itemid">1000</field> > > > <field name="itemid">1000</field> > > > <field name="itemid">1000</field> > > > <field name="itemid">5000</field> > > > </doc> > > > <doc> > > > <field name="url">http://test.example.org/second.html</field> > > > <field name="itemid">1000</field> > > > <field name="itemid">5000</field> > > > </doc> > > > </add> > > > > > > When playing with administration and debugging tools I discovered that > > > searching for q=itemid:5000 gave me the same score for those docs, while > > I > > > was expecting different term frequencies between the first and the > > second. > > > In fact, using java to upload documents lead to correct results (3 > > > occurrences of item 1000 in the first doc and 1 in the second), e.g.: > > > document1.addField("itemid", "1000"); > > > document1.addField("itemid", "1000"); > > > document1.addField("itemid", "1000"); > > > > > > Am I right or am I missing something else? > > > > > > > > > On Wed, Jun 26, 2013 at 5:18 PM, Jack Krupansky <j...@basetechnology.com > > > >wrote: > > > > > > > If there is a bug... we should identify it. What's a sample post > > command > > > > that you issued? > > > > > > > > > > > > -- Jack Krupansky > > > > > > > > -----Original Message----- From: Flavio Pompermaier > > > > Sent: Wednesday, June 26, 2013 10:53 AM > > > > > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: URL search and indexing > > > > > > > > I was doing exactly that and, thanks to the administration page and > > > > explanation/debugging, I checked if results were those expected. > > > > Unfortunately, results were not correct submitting updates trough > > post.sh > > > > script (that use curl in the end). > > > > Probably, if it founds the same tag (same value for the same > > field-name), > > > > it will collapse them. > > > > Rewriting the same document in Java and submitting the updates did the > > > > things work correctly. > > > > > > > > In my opinion this is a bug (of the entire process, then I don't know > > it > > > > this is a problem of curl or of the script itself). > > > > > > > > Best, > > > > Flavio > > > > > > > > On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson < > > erickerick...@gmail.com > > > >* > > > > *wrote: > > > > > > > > Flavio: > > > >> > > > >> You mention that you're new to Solr, so I thought I'd make sure > > > >> you know that the admin/analysis page is your friend! I flat > > > >> guarantee that as you try to index/search following the suggestions > > > >> you'll scratch your head at your results and.... you'll discover that > > > >> the analysis process isn't doing quite what you expect. The > > > >> admin/analysis page shows you the transformation of the input > > > >> at each stage, i.e. how the input is tokenized, what transformations > > > >> are applied to each token etc. It's invaluable! > > > >> > > > >> Best > > > >> Erick > > > >> > > > >> P.S. Feel free to un-check the "verbose" box, it provides lots > > > >> of information but can be overwhelming, especially at first! > > > >> > > > >> On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier > > > >> <pomperma...@okkam.it> wrote: > > > >> > Ok thank you all for the great help! > > > >> > Now I'm ready to start playing with my index! > > > >> > > > > >> > Best, > > > >> > Flavio > > > >> > > > > >> > > > > >> > On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky < > > > >> j...@basetechnology.com>wrote: > > > >> > > > > >> >> Yeah, URL Classify does only do so much. That's why you need to > > > combine > > > >> >> multiple methods. > > > >> >> > > > >> >> As a fourth method, you could code up a short JavaScript "** > > > >> >> StatelessScriptUpdateProcessor****" that did something like take a > > > >> full > > > >> >> domain name (such as output by URL Classify) and turn it into > > > multiple > > > >> >> values, each with more of the prefix removed, so that " > > > >> lucene.apache.org" > > > >> >> would index as: > > > >> >> > > > >> >> lucene.apache.org > > > >> >> apache.org > > > >> >> apache > > > >> >> .org > > > >> >> org > > > >> >> > > > >> >> And then the user could query by any of those partial domain names. > > > >> >> > > > >> >> But, if you simply tokenize the URL (copy the URL string to a text > > > >> field), > > > >> >> you automatically get most of that. The user can query by a URL > > > >> fragment, > > > >> >> such as "apache.org", ".org", "lucene.apache.org", etc. and the > > > >> >> tokenization will strip out the punctuation. > > > >> >> > > > >> >> I'll add this script to my list of examples to add in the next rev > > of > > > >> >> my > > > >> >> book. > > > >> >> > > > >> >> > > > >> >> -- Jack Krupansky > > > >> >> > > > >> >> -----Original Message----- From: Flavio Pompermaier > > > >> >> Sent: Tuesday, June 25, 2013 10:06 AM > > > >> >> > > > >> >> To: solr-user@lucene.apache.org > > > >> >> Subject: Re: URL search and indexing > > > >> >> > > > >> >> I bought the book and looking at the example I still don't > > understand > > > >> if it > > > >> >> possible query all sub-urls of my URL. > > > >> >> For example, if the URLClassifyProcessorFactory takes in input >> > > > >> "url_s":" > > > >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html< > > > http://lucene.apache.org/solr/**4_0_0/changes/Changes.html> > > > >> < > > > >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html< > > > http://lucene.apache.org/solr/4_0_0/changes/Changes.html> > > > >> >" > > > >> >> and makes some > > > >> >> outputs like > > > >> >> - "url_domain_s":"lucene.apache.****org <http://lucene.apache.org > > >" > > > >> >> - "url_canonical_s":" > > > >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html< > > > http://lucene.apache.org/solr/**4_0_0/changes/Changes.html> > > > >> < > > > >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html< > > > http://lucene.apache.org/solr/4_0_0/changes/Changes.html> > > > >> > > > > >> >> " > > > >> >> How should I configure url_domain_s in order to be able to makes > > > query > > > >> like > > > >> >> '*.apache.org'? > > > >> >> How should I configure url_canonical_s in order to be able to makes > > > >> query > > > >> >> like 'http://lucene.apache.org/****solr/*< > > > http://lucene.apache.org/**solr/*>< > > > >> http://lucene.apache.org/solr/*** <http://lucene.apache.org/solr/*>> > > > >> >> '? > > > >> >> Is it better to have two different fields for the two queries or > > > could > > > >> >> I > > > >> >> create just one field for the two kind of queries (obviously for > > the > > > >> former > > > >> >> case then I should query something like *://.apache.org/*)? > > > >> >> > > > >> >> > > > >> >> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky < > > > >> j...@basetechnology.com>* > > > >> >> *wrote: > > > >> >> > > > >> >> There are examples in my book: > > > >> >>> http://www.lulu.com/shop/jack-******krupansky/solr-4x-deep-** > > > >> dive-****< > > > http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****> > > > >> < > > > >> http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****< > > > http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**> > > > >> > > > > >> >>> > > early-access-release-1/ebook/******product-21079719.html<http:**//** > > > >> >>> www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-**< > > > http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**> > > > >> >>> early-access-release-1/ebook/****product-21079719.html< > > > >> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** > > > >> early-access-release-1/ebook/**product-21079719.html< > > > > > http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html > > > > > > > >> > > > > >> >>> > > > > >> >>> > > > >> >>> > > > >> >>> But... I still think you should use a tokenized text field as > > well - > > > >> use > > > >> >>> all three: raw string, tokenized text, and URL classification > > > fields. > > > >> >>> > > > >> >>> -- Jack Krupansky > > > >> >>> > > > >> >>> -----Original Message----- From: Flavio Pompermaier > > > >> >>> Sent: Tuesday, June 25, 2013 9:02 AM > > > >> >>> To: solr-user@lucene.apache.org > > > >> >>> Subject: Re: URL search and indexing > > > >> >>> > > > >> >>> > > > >> >>> That's sound exactly what I'm looking for! However I cannot find > > an > > > >> >>> example > > > >> >>> of how to use it..could you help me please? > > > >> >>> Moreover, about id field, isn't true that id field shouldn't be > > > >> analyzed > > > >> >>> as > > > >> >>> suggested in > > > >> >>> > > > >> http://wiki.apache.org/solr/******UniqueKey#Text_field_in_the_** > > > >> ****document< > > > http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document > > > > > > >> < > > > >> > > > http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document > > < > > > http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document> > > > >> > > > > >> >>> <http://wiki.apache.**org/**solr/UniqueKey#Text_field_**** > > > >> in_the_document< > > > >> http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document< > > > http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document> > > > >> > > > > >> >>> > > > > >> >>> > > > >> >>> ? > > > >> >>> > > > >> >>> > > > >> >>> On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl < > > jan....@cominvent.com > > > > > > > >> >>> wrote: > > > >> >>> > > > >> >>> Sure you can query the url directly. Or if you choose you can > > split > > > >> it up > > > >> >>> > > > >> >>>> in multiple components, e.g. using > > > >> >>>> http://lucene.apache.org/solr/******4_3_0/solr-core/org/** > > > >> apache/****< > > > http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****> > > > >> < > > > >> http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****< > > > http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**> > > > >> > > > > >> >>>> solr/update/processor/******URLClassifyProcessor.html<**http** > > > >> >>>> :// > > lucene.apache.org/solr/4_3_****0/solr-core/org/apache/solr/**** > > > <http://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**> > > > >> >>>> update/processor/****URLClassifyProcessor.html< > > > >> http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/** > > > >> solr/update/processor/**URLClassifyProcessor.html< > > > > > http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html > > > > > > > >> > > > > >> >>>> > > > > >> >>>> > > > >> >>>> > > > >> >>>> -- > > > >> >>>> Jan Høydahl, search solution architect > > > >> >>>> Cominvent AS - www.cominvent.com > > > >> >>>> > > > >> >>>> 25. juni 2013 kl. 14:10 skrev Flavio Pompermaier < > > > >> pomperma...@okkam.it>: > > > >> >>>> > > > >> >>>> > Sorry but maybe I miss something here..could I declare url as > > key > > > >> field > > > >> >>>> and > > > >> >>>> > query it too..? > > > >> >>>> > At the moment, my schema.xml looks like: > > > >> >>>> > > > > >> >>>> > <fields> > > > >> >>>> > <field name="url" type="string" indexed="true" > > stored="true" > > > >> >>>> > required="true" multiValued="false" /> > > > >> >>>> > > > > >> >>>> > <field name="category" type="string" indexed="true" > > > >> stored="true"/> > > > >> >>>> > <field name="language" type="string" indexed="true" > > > >> stored="true"/> > > > >> >>>> > ... > > > >> >>>> > <field name="_version_" type="long" indexed="true" >>>> > > > > >> stored="true"/> > > > >> >>>> > > > > >> >>>> > </fields> > > > >> >>>> > <uniqueKey>url</uniqueKey> > > > >> >>>> > > > > >> >>>> > Is it ok? or should I add a "baseurl" field of some kind to be > > > able > > > >> to > > > >> >>>> > query all url coming from a certain domain (1st or 2nd level as > > > >> well)? > > > >> >>>> > > > > >> >>>> > Best, > > > >> >>>> > Flavio > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > On Tue, Jun 25, 2013 at 12:28 PM, Jan Høydahl < > > > >> jan....@cominvent.com> > > > >> >>>> wrote: > > > >> >>>> > > > > >> >>>> >> Probably a good match for the RegExp feature of Solr (given > > that > > > >> your > > > >> >>>> url > > > >> >>>> >> is not tokenized) > > > >> >>>> >> e.g. q=url:/.*\.it$/ > > > >> >>>> >> > > > >> >>>> >> -- > > > >> >>>> >> Jan Høydahl, search solution architect > > > >> >>>> >> Cominvent AS - www.cominvent.com > > > >> >>>> >> > > > >> >>>> >> 25. juni 2013 kl. 12:17 skrev Flavio Pompermaier < > > > >> >>>> pomperma...@okkam.it > > > >> >>>> >: > > > >> >>>> >> > > > >> >>>> >>> Hi to everybody, > > > >> >>>> >>> I'm quite new to Solr so maybe my question could be trivial > > for > > > >> you.. > > > >> >>>> >>> In my use case I have to index stuff contained in some URL > > so i > > > >> use > > > >> >>>> >>> url > > > >> >>>> >> as > > > >> >>>> >>> key of my document and I treat it like a string. > > > >> >>>> >>> > > > >> >>>> >>> However I'd like to be able to query by domain name, like > > *.it > > > or > > > >> *. > > > >> >>>> >>> somesite.com, what's the best strategy? I tought to made a > > URL > > > >> to > > > >> >>>> path > > > >> >>>> >>> transfromation and indexed using solr.**** > > > >> >>>> PathHierarchyTokenizerFactory > > > >> >>>> > > > >> >>>> >>> but > > > >> >>>> >>> maybe there's a simpler solution..isn't it? > > > >> >>>> >>> > > > >> >>>> >>> Best, > > > >> >>>> >>> Flavio > > > >> >>>> >>> > > > >> >>>> >>> -- > > > >> >>>> >>> > > > >> >>>> >>> Flavio Pompermaier > > > >> >>>> >>> *Development Department > > > >> >>>> >>> *_____________________________******__________________ > > > >> >>>> > > > >> >>>> >>> *OKKAM**Srl **- www.okkam.it* > > > >> >>>> >>> > > > >> >>>> >>> *Phone:* +(39) 0461 283 702 > > > >> >>>> >>> *Fax:* + (39) 0461 186 6433 > > > >> >>>> >>> *Email:* f.pomperma...@okkam.it > > > >> >>>> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > > > >> Molini 2 > > > >> >>>> >>> *Registered office:* Trento (Italy), via Segantini 23 > > > >> >>>> >>> > > > >> >>>> >>> Confidentially notice. This e-mail transmission may contain > > > >> legally > > > >> >>>> >>> privileged and/or confidential information. Please do not > > read > > > it > > > >> if > > > >> >>>> you > > > >> >>>> >>> are not the intended recipient(S). Any use, distribution, >>> > > > >> >>>> reproduction > > > >> >>>> or > > > >> >>>> >>> disclosure by any other person is strictly prohibited. If you > > > >> >>>> >>> have > > > >> >>>> >> received > > > >> >>>> >>> this e-mail in error, please notify the sender and destroy > > the > > > >> >>>> >>> >>> > > > >> >>>> original > > > >> >>>> >>> transmission and its attachments without reading or saving it > > > in > > > >> any > > > >> >>>> >> manner. > > > >> >>>> >> > > > >> >>>> >> > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > -- > > > >> >>>> > > > > >> >>>> > Flavio Pompermaier > > > >> >>>> > *Development Department > > > >> >>>> > *_____________________________******__________________ > > > >> >>>> > > > >> >>>> > *OKKAM**Srl **- www.okkam.it* > > > >> >>>> > > > > >> >>>> > *Phone:* +(39) 0461 283 702 > > > >> >>>> > *Fax:* + (39) 0461 186 6433 > > > >> >>>> > *Email:* f.pomperma...@okkam.it > > > >> >>>> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > > > Molini > > > >> 2 > > > >> >>>> > *Registered office:* Trento (Italy), via Segantini 23 > > > >> >>>> > > > > >> >>>> > Confidentially notice. This e-mail transmission may contain > > > legally > > > >> >>>> > privileged and/or confidential information. Please do not read > > it > > > >> if > > > > >> >>>> you > > > >> >>>> > are not the intended recipient(S). Any use, distribution, > > > >> reproduction > > > >> >>>> > or > > > >> >>>> > disclosure by any other person is strictly prohibited. If you > > > have > > > >> >>>> received > > > >> >>>> > this e-mail in error, please notify the sender and destroy the > > > >> original > > > >> >>>> > transmission and its attachments without reading or saving it > > in > > > >> >>>> > any > > > >> >>>> manner. > > > >> >>>> > > > >> >>>> > > > >> >>>> > > > >> >>>> > > > >> >>> -- > > > >> >>> > > > >> >>> Flavio Pompermaier > > > >> >>> *Development Department > > > >> >>> *_____________________________******__________________ > > > >> >>> > > > >> >>> *OKKAM**Srl **- www.okkam.it* > > > >> >>> > > > >> >>> *Phone:* +(39) 0461 283 702 > > > >> >>> *Fax:* + (39) 0461 186 6433 > > > >> >>> *Email:* f.pomperma...@okkam.it > > > >> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > > Molini > > > 2 > > > >> >>> *Registered office:* Trento (Italy), via Segantini 23 > > > >> >>> > > > >> >>> Confidentially notice. This e-mail transmission may contain > > legally > > > >> >>> privileged and/or confidential information. Please do not read it > > if > > > >> you > > > >> >>> are not the intended recipient(S). Any use, distribution, > > > reproduction > > > >> or > > > >> >>> disclosure by any other person is strictly prohibited. If you have > > > >> >>> received > > > >> >>> this e-mail in error, please notify the sender and destroy the >>> > > > >> original > > > >> >>> transmission and its attachments without reading or saving it in > > any > > > >> >>> manner. > > > >> >>> > > > >> >>> > > > >> >> > > > >> >> > > > >> >> -- > > > >> >> > > > >> >> Flavio Pompermaier > > > >> >> *Development Department > > > >> >> *_____________________________****__________________ > > > >> >> *OKKAM**Srl **- www.okkam.it* > > > >> >> > > > >> >> *Phone:* +(39) 0461 283 702 > > > >> >> *Fax:* + (39) 0461 186 6433 > > > >> >> *Email:* f.pomperma...@okkam.it > > > >> >> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > > Molini 2 > > > >> >> *Registered office:* Trento (Italy), via Segantini 23 > > > >> >> > > > >> >> Confidentially notice. This e-mail transmission may contain legally > > > >> >> privileged and/or confidential information. Please do not read it > > if > > > >> >> you > > > >> >> are not the intended recipient(S). Any use, distribution, > > > reproduction > > > >> or > > > >> >> disclosure by any other person is strictly prohibited. If you have > > > >> received > > > >> >> this e-mail in error, please notify the sender and destroy the > > > original > > > >> >> transmission and its attachments without reading or saving it in > > any > > > >> >> manner. > > > >> >> > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > > > > >> > Flavio Pompermaier > > > >> > *Development Department > > > >> > *_____________________________**__________________ > > > >> > *OKKAM**Srl **- www.okkam.it* > > > >> > > > > >> > *Phone:* +(39) 0461 283 702 > > > >> > *Fax:* + (39) 0461 186 6433 > > > >> > *Email:* f.pomperma...@okkam.it > > > >> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini > > 2 > > > >> > *Registered office:* Trento (Italy), via Segantini 23 > > > >> > > > > >> > Confidentially notice. This e-mail transmission may contain legally > > > >> > privileged and/or confidential information. Please do not read it if > > > you > > > >> > are not the intended recipient(S). Any use, distribution, > > reproduction > > > >> > or > > > >> > disclosure by any other person is strictly prohibited. If you have > > > >> received > > > >> > this e-mail in error, please notify the sender and destroy the > > > original > > > >> > transmission and its attachments without reading or saving it in any > > > >> manner. > > > >> > > > >> > > > > > > > > >