Thanks for the explanation, I was missing exaclty that! Now things works correctly also using the post script. However I don't think I need norms if I use id of same lenght (UUID), right? I just need strings with omitTermFreqAndPositions="false" I think.
On Thu, Jun 27, 2013 at 7:31 PM, Erick Erickson <erickerick...@gmail.com>wrote: > Right, string fields are a little tricky, they're easy to confuse with > fields that actually _do_ something. > > By default, norms and term frequencies are turned off for types based on ' > class="solr.StrField" '. So any field length normalization (i.e. terms that > appear in shorter fields count more) and term frequencies calculations are > _not_ include in the score calculation. > > Try blowing your index away and adding this to your fields to see the > difference.... > > omitNorms="false" omitTermFreqAndPositions="false" > > You probably want to either turn these on explicitly for your string types > or use a type based on 'class="solr.TextField" ' since these options > default to "false" for text fields. If you use something like > "keywordTokenizerFactory" you also won't get your URL split up into pieces. > And in that case you can also normalize the values with something like > lowerCaseFilter which you can't do with "string" types since they're > completely unanalyzed. > > Best > Erick > > > On Wed, Jun 26, 2013 at 11:34 AM, Flavio Pompermaier > <pomperma...@okkam.it>wrote: > > > Obviously I messed up with email thread...however I found a problem > > indexing my document via post.sh. > > This is basically my schema.xml: > > > > <schema name="dopa-schema" version="1.5"> > > <fields> > > <field name="url" type="string" indexed="true" stored="true" > > required="true" multiValued="false" /> > > <field name="itemid" type="string" indexed="true" stored="true" > > multiValued="true"/> > > <field name="_version_" type="long" indexed="true" stored="true"/> > > </fields> > > <uniqueKey>url</uniqueKey> > > <types> > > <fieldType name="string" class="solr.StrField" sortMissingLast="true" > > /> > > <fieldType name="long" class="solr.TrieLongField" precisionStep="0" > > positionIncrementGap="0"/> > > </types> > > </schema> > > > > and this is the document I tried to upload via post.sh: > > > > <add> > > <doc> > > <field name="url">http://test.example.org/first.html</field> > > <field name="itemid">1000</field> > > <field name="itemid">1000</field> > > <field name="itemid">1000</field> > > <field name="itemid">5000</field> > > </doc> > > <doc> > > <field name="url">http://test.example.org/second.html</field> > > <field name="itemid">1000</field> > > <field name="itemid">5000</field> > > </doc> > > </add> > > > > When playing with administration and debugging tools I discovered that > > searching for q=itemid:5000 gave me the same score for those docs, while > I > > was expecting different term frequencies between the first and the > second. > > In fact, using java to upload documents lead to correct results (3 > > occurrences of item 1000 in the first doc and 1 in the second), e.g.: > > document1.addField("itemid", "1000"); > > document1.addField("itemid", "1000"); > > document1.addField("itemid", "1000"); > > > > Am I right or am I missing something else? > > > > > > On Wed, Jun 26, 2013 at 5:18 PM, Jack Krupansky <j...@basetechnology.com > > >wrote: > > > > > If there is a bug... we should identify it. What's a sample post > command > > > that you issued? > > > > > > > > > -- Jack Krupansky > > > > > > -----Original Message----- From: Flavio Pompermaier > > > Sent: Wednesday, June 26, 2013 10:53 AM > > > > > > To: solr-user@lucene.apache.org > > > Subject: Re: URL search and indexing > > > > > > I was doing exactly that and, thanks to the administration page and > > > explanation/debugging, I checked if results were those expected. > > > Unfortunately, results were not correct submitting updates trough > post.sh > > > script (that use curl in the end). > > > Probably, if it founds the same tag (same value for the same > field-name), > > > it will collapse them. > > > Rewriting the same document in Java and submitting the updates did the > > > things work correctly. > > > > > > In my opinion this is a bug (of the entire process, then I don't know > it > > > this is a problem of curl or of the script itself). > > > > > > Best, > > > Flavio > > > > > > On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson < > erickerick...@gmail.com > > >* > > > *wrote: > > > > > > Flavio: > > >> > > >> You mention that you're new to Solr, so I thought I'd make sure > > >> you know that the admin/analysis page is your friend! I flat > > >> guarantee that as you try to index/search following the suggestions > > >> you'll scratch your head at your results and.... you'll discover that > > >> the analysis process isn't doing quite what you expect. The > > >> admin/analysis page shows you the transformation of the input > > >> at each stage, i.e. how the input is tokenized, what transformations > > >> are applied to each token etc. It's invaluable! > > >> > > >> Best > > >> Erick > > >> > > >> P.S. Feel free to un-check the "verbose" box, it provides lots > > >> of information but can be overwhelming, especially at first! > > >> > > >> On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier > > >> <pomperma...@okkam.it> wrote: > > >> > Ok thank you all for the great help! > > >> > Now I'm ready to start playing with my index! > > >> > > > >> > Best, > > >> > Flavio > > >> > > > >> > > > >> > On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky < > > >> j...@basetechnology.com>wrote: > > >> > > > >> >> Yeah, URL Classify does only do so much. That's why you need to > > combine > > >> >> multiple methods. > > >> >> > > >> >> As a fourth method, you could code up a short JavaScript "** > > >> >> StatelessScriptUpdateProcessor****" that did something like take a > > >> full > > >> >> domain name (such as output by URL Classify) and turn it into > > multiple > > >> >> values, each with more of the prefix removed, so that " > > >> lucene.apache.org" > > >> >> would index as: > > >> >> > > >> >> lucene.apache.org > > >> >> apache.org > > >> >> apache > > >> >> .org > > >> >> org > > >> >> > > >> >> And then the user could query by any of those partial domain names. > > >> >> > > >> >> But, if you simply tokenize the URL (copy the URL string to a text > > >> field), > > >> >> you automatically get most of that. The user can query by a URL > > >> fragment, > > >> >> such as "apache.org", ".org", "lucene.apache.org", etc. and the > > >> >> tokenization will strip out the punctuation. > > >> >> > > >> >> I'll add this script to my list of examples to add in the next rev > of > > >> >> my > > >> >> book. > > >> >> > > >> >> > > >> >> -- Jack Krupansky > > >> >> > > >> >> -----Original Message----- From: Flavio Pompermaier > > >> >> Sent: Tuesday, June 25, 2013 10:06 AM > > >> >> > > >> >> To: solr-user@lucene.apache.org > > >> >> Subject: Re: URL search and indexing > > >> >> > > >> >> I bought the book and looking at the example I still don't > understand > > >> if it > > >> >> possible query all sub-urls of my URL. > > >> >> For example, if the URLClassifyProcessorFactory takes in input >> > > >> "url_s":" > > >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html< > > http://lucene.apache.org/solr/**4_0_0/changes/Changes.html> > > >> < > > >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html< > > http://lucene.apache.org/solr/4_0_0/changes/Changes.html> > > >> >" > > >> >> and makes some > > >> >> outputs like > > >> >> - "url_domain_s":"lucene.apache.****org <http://lucene.apache.org > >" > > >> >> - "url_canonical_s":" > > >> >> http://lucene.apache.org/solr/****4_0_0/changes/Changes.html< > > http://lucene.apache.org/solr/**4_0_0/changes/Changes.html> > > >> < > > >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html< > > http://lucene.apache.org/solr/4_0_0/changes/Changes.html> > > >> > > > >> >> " > > >> >> How should I configure url_domain_s in order to be able to makes > > query > > >> like > > >> >> '*.apache.org'? > > >> >> How should I configure url_canonical_s in order to be able to makes > > >> query > > >> >> like 'http://lucene.apache.org/****solr/*< > > http://lucene.apache.org/**solr/*>< > > >> http://lucene.apache.org/solr/*** <http://lucene.apache.org/solr/*>> > > >> >> '? > > >> >> Is it better to have two different fields for the two queries or > > could > > >> >> I > > >> >> create just one field for the two kind of queries (obviously for > the > > >> former > > >> >> case then I should query something like *://.apache.org/*)? > > >> >> > > >> >> > > >> >> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky < > > >> j...@basetechnology.com>* > > >> >> *wrote: > > >> >> > > >> >> There are examples in my book: > > >> >>> http://www.lulu.com/shop/jack-******krupansky/solr-4x-deep-** > > >> dive-****< > > http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****> > > >> < > > >> http://www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-****< > > http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**> > > >> > > > >> >>> > early-access-release-1/ebook/******product-21079719.html<http:**//** > > >> >>> www.lulu.com/shop/jack-****krupansky/solr-4x-deep-dive-**< > > http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**> > > >> >>> early-access-release-1/ebook/****product-21079719.html< > > >> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** > > >> early-access-release-1/ebook/**product-21079719.html< > > > http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html > > > > > >> > > > >> >>> > > > >> >>> > > >> >>> > > >> >>> But... I still think you should use a tokenized text field as > well - > > >> use > > >> >>> all three: raw string, tokenized text, and URL classification > > fields. > > >> >>> > > >> >>> -- Jack Krupansky > > >> >>> > > >> >>> -----Original Message----- From: Flavio Pompermaier > > >> >>> Sent: Tuesday, June 25, 2013 9:02 AM > > >> >>> To: solr-user@lucene.apache.org > > >> >>> Subject: Re: URL search and indexing > > >> >>> > > >> >>> > > >> >>> That's sound exactly what I'm looking for! However I cannot find > an > > >> >>> example > > >> >>> of how to use it..could you help me please? > > >> >>> Moreover, about id field, isn't true that id field shouldn't be > > >> analyzed > > >> >>> as > > >> >>> suggested in > > >> >>> > > >> http://wiki.apache.org/solr/******UniqueKey#Text_field_in_the_** > > >> ****document< > > http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document > > > > >> < > > >> > > http://wiki.apache.org/solr/****UniqueKey#Text_field_in_the_****document > < > > http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document> > > >> > > > >> >>> <http://wiki.apache.**org/**solr/UniqueKey#Text_field_**** > > >> in_the_document< > > >> http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document< > > http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document> > > >> > > > >> >>> > > > >> >>> > > >> >>> ? > > >> >>> > > >> >>> > > >> >>> On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl < > jan....@cominvent.com > > > > > >> >>> wrote: > > >> >>> > > >> >>> Sure you can query the url directly. Or if you choose you can > split > > >> it up > > >> >>> > > >> >>>> in multiple components, e.g. using > > >> >>>> http://lucene.apache.org/solr/******4_3_0/solr-core/org/** > > >> apache/****< > > http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****> > > >> < > > >> http://lucene.apache.org/solr/****4_3_0/solr-core/org/apache/****< > > http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**> > > >> > > > >> >>>> solr/update/processor/******URLClassifyProcessor.html<**http** > > >> >>>> :// > lucene.apache.org/solr/4_3_****0/solr-core/org/apache/solr/**** > > <http://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**> > > >> >>>> update/processor/****URLClassifyProcessor.html< > > >> http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/** > > >> solr/update/processor/**URLClassifyProcessor.html< > > > http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html > > > > > >> > > > >> >>>> > > > >> >>>> > > >> >>>> > > >> >>>> -- > > >> >>>> Jan Høydahl, search solution architect > > >> >>>> Cominvent AS - www.cominvent.com > > >> >>>> > > >> >>>> 25. juni 2013 kl. 14:10 skrev Flavio Pompermaier < > > >> pomperma...@okkam.it>: > > >> >>>> > > >> >>>> > Sorry but maybe I miss something here..could I declare url as > key > > >> field > > >> >>>> and > > >> >>>> > query it too..? > > >> >>>> > At the moment, my schema.xml looks like: > > >> >>>> > > > >> >>>> > <fields> > > >> >>>> > <field name="url" type="string" indexed="true" > stored="true" > > >> >>>> > required="true" multiValued="false" /> > > >> >>>> > > > >> >>>> > <field name="category" type="string" indexed="true" > > >> stored="true"/> > > >> >>>> > <field name="language" type="string" indexed="true" > > >> stored="true"/> > > >> >>>> > ... > > >> >>>> > <field name="_version_" type="long" indexed="true" >>>> > > > >> stored="true"/> > > >> >>>> > > > >> >>>> > </fields> > > >> >>>> > <uniqueKey>url</uniqueKey> > > >> >>>> > > > >> >>>> > Is it ok? or should I add a "baseurl" field of some kind to be > > able > > >> to > > >> >>>> > query all url coming from a certain domain (1st or 2nd level as > > >> well)? > > >> >>>> > > > >> >>>> > Best, > > >> >>>> > Flavio > > >> >>>> > > > >> >>>> > > > >> >>>> > On Tue, Jun 25, 2013 at 12:28 PM, Jan Høydahl < > > >> jan....@cominvent.com> > > >> >>>> wrote: > > >> >>>> > > > >> >>>> >> Probably a good match for the RegExp feature of Solr (given > that > > >> your > > >> >>>> url > > >> >>>> >> is not tokenized) > > >> >>>> >> e.g. q=url:/.*\.it$/ > > >> >>>> >> > > >> >>>> >> -- > > >> >>>> >> Jan Høydahl, search solution architect > > >> >>>> >> Cominvent AS - www.cominvent.com > > >> >>>> >> > > >> >>>> >> 25. juni 2013 kl. 12:17 skrev Flavio Pompermaier < > > >> >>>> pomperma...@okkam.it > > >> >>>> >: > > >> >>>> >> > > >> >>>> >>> Hi to everybody, > > >> >>>> >>> I'm quite new to Solr so maybe my question could be trivial > for > > >> you.. > > >> >>>> >>> In my use case I have to index stuff contained in some URL > so i > > >> use > > >> >>>> >>> url > > >> >>>> >> as > > >> >>>> >>> key of my document and I treat it like a string. > > >> >>>> >>> > > >> >>>> >>> However I'd like to be able to query by domain name, like > *.it > > or > > >> *. > > >> >>>> >>> somesite.com, what's the best strategy? I tought to made a > URL > > >> to > > >> >>>> path > > >> >>>> >>> transfromation and indexed using solr.**** > > >> >>>> PathHierarchyTokenizerFactory > > >> >>>> > > >> >>>> >>> but > > >> >>>> >>> maybe there's a simpler solution..isn't it? > > >> >>>> >>> > > >> >>>> >>> Best, > > >> >>>> >>> Flavio > > >> >>>> >>> > > >> >>>> >>> -- > > >> >>>> >>> > > >> >>>> >>> Flavio Pompermaier > > >> >>>> >>> *Development Department > > >> >>>> >>> *_____________________________******__________________ > > >> >>>> > > >> >>>> >>> *OKKAM**Srl **- www.okkam.it* > > >> >>>> >>> > > >> >>>> >>> *Phone:* +(39) 0461 283 702 > > >> >>>> >>> *Fax:* + (39) 0461 186 6433 > > >> >>>> >>> *Email:* f.pomperma...@okkam.it > > >> >>>> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > > >> Molini 2 > > >> >>>> >>> *Registered office:* Trento (Italy), via Segantini 23 > > >> >>>> >>> > > >> >>>> >>> Confidentially notice. This e-mail transmission may contain > > >> legally > > >> >>>> >>> privileged and/or confidential information. Please do not > read > > it > > >> if > > >> >>>> you > > >> >>>> >>> are not the intended recipient(S). Any use, distribution, >>> > > >> >>>> reproduction > > >> >>>> or > > >> >>>> >>> disclosure by any other person is strictly prohibited. If you > > >> >>>> >>> have > > >> >>>> >> received > > >> >>>> >>> this e-mail in error, please notify the sender and destroy > the > > >> >>>> >>> >>> > > >> >>>> original > > >> >>>> >>> transmission and its attachments without reading or saving it > > in > > >> any > > >> >>>> >> manner. > > >> >>>> >> > > >> >>>> >> > > >> >>>> > > > >> >>>> > > > >> >>>> > -- > > >> >>>> > > > >> >>>> > Flavio Pompermaier > > >> >>>> > *Development Department > > >> >>>> > *_____________________________******__________________ > > >> >>>> > > >> >>>> > *OKKAM**Srl **- www.okkam.it* > > >> >>>> > > > >> >>>> > *Phone:* +(39) 0461 283 702 > > >> >>>> > *Fax:* + (39) 0461 186 6433 > > >> >>>> > *Email:* f.pomperma...@okkam.it > > >> >>>> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > > Molini > > >> 2 > > >> >>>> > *Registered office:* Trento (Italy), via Segantini 23 > > >> >>>> > > > >> >>>> > Confidentially notice. This e-mail transmission may contain > > legally > > >> >>>> > privileged and/or confidential information. Please do not read > it > > >> if > > > >> >>>> you > > >> >>>> > are not the intended recipient(S). Any use, distribution, > > >> reproduction > > >> >>>> > or > > >> >>>> > disclosure by any other person is strictly prohibited. If you > > have > > >> >>>> received > > >> >>>> > this e-mail in error, please notify the sender and destroy the > > >> original > > >> >>>> > transmission and its attachments without reading or saving it > in > > >> >>>> > any > > >> >>>> manner. > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>> -- > > >> >>> > > >> >>> Flavio Pompermaier > > >> >>> *Development Department > > >> >>> *_____________________________******__________________ > > >> >>> > > >> >>> *OKKAM**Srl **- www.okkam.it* > > >> >>> > > >> >>> *Phone:* +(39) 0461 283 702 > > >> >>> *Fax:* + (39) 0461 186 6433 > > >> >>> *Email:* f.pomperma...@okkam.it > > >> >>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > Molini > > 2 > > >> >>> *Registered office:* Trento (Italy), via Segantini 23 > > >> >>> > > >> >>> Confidentially notice. This e-mail transmission may contain > legally > > >> >>> privileged and/or confidential information. Please do not read it > if > > >> you > > >> >>> are not the intended recipient(S). Any use, distribution, > > reproduction > > >> or > > >> >>> disclosure by any other person is strictly prohibited. If you have > > >> >>> received > > >> >>> this e-mail in error, please notify the sender and destroy the >>> > > >> original > > >> >>> transmission and its attachments without reading or saving it in > any > > >> >>> manner. > > >> >>> > > >> >>> > > >> >> > > >> >> > > >> >> -- > > >> >> > > >> >> Flavio Pompermaier > > >> >> *Development Department > > >> >> *_____________________________****__________________ > > >> >> *OKKAM**Srl **- www.okkam.it* > > >> >> > > >> >> *Phone:* +(39) 0461 283 702 > > >> >> *Fax:* + (39) 0461 186 6433 > > >> >> *Email:* f.pomperma...@okkam.it > > >> >> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei > Molini 2 > > >> >> *Registered office:* Trento (Italy), via Segantini 23 > > >> >> > > >> >> Confidentially notice. This e-mail transmission may contain legally > > >> >> privileged and/or confidential information. Please do not read it > if > > >> >> you > > >> >> are not the intended recipient(S). Any use, distribution, > > reproduction > > >> or > > >> >> disclosure by any other person is strictly prohibited. If you have > > >> received > > >> >> this e-mail in error, please notify the sender and destroy the > > original > > >> >> transmission and its attachments without reading or saving it in > any > > >> >> manner. > > >> >> > > >> > > > >> > > > >> > > > >> > -- > > >> > > > >> > Flavio Pompermaier > > >> > *Development Department > > >> > *_____________________________**__________________ > > >> > *OKKAM**Srl **- www.okkam.it* > > >> > > > >> > *Phone:* +(39) 0461 283 702 > > >> > *Fax:* + (39) 0461 186 6433 > > >> > *Email:* f.pomperma...@okkam.it > > >> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini > 2 > > >> > *Registered office:* Trento (Italy), via Segantini 23 > > >> > > > >> > Confidentially notice. This e-mail transmission may contain legally > > >> > privileged and/or confidential information. Please do not read it if > > you > > >> > are not the intended recipient(S). Any use, distribution, > reproduction > > >> > or > > >> > disclosure by any other person is strictly prohibited. If you have > > >> received > > >> > this e-mail in error, please notify the sender and destroy the > > original > > >> > transmission and its attachments without reading or saving it in any > > >> manner. > > >> > > >> > > > > > >