impact of omitTermFreqAndPositions="true"

2012-01-10 Thread Samarendra Pratap
Hi,
 I understand that setting omitTermFreqAndPositions="true" for a field in
schema.xml stores less information in the index with some restrictions e.g.
phrase search.

 But does setting this property as "true" for a field which is of type
"string", "int" or is analyzed by KeywordAnalyzer makes any difference in
index size, performance etc.?

Thanks

-- 
Regards,
Samar


Question about updating index with custom field types

2012-01-10 Thread 罗赛
Hello everyone,

I have a question on how to update index using xml messages when there are
some complex custom field types in my index...like:

And field offer has some attributes in it...

I've read page, http://wiki.apache.org/solr/UpdateXmlMessages and example
shows that xml should be like:


  
05991
Bridgewater
Perl
Java
  
  [ ... [ ... ]]



So, could u tell me how to write the XML or is there any other method to
update index with custom field types?

Thanks,

-- 
Best wishes

Sai


Re: best way to force substitutions in data

2012-01-10 Thread Dmitry Kan
how about using regular expressions:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory


On Tue, Jan 10, 2012 at 1:14 AM, geeky2  wrote:

> Hello all,
>
> i have been reading the solr book as well as searching the archives of this
> list to learn how the various processors and transformers work, but have
> not
> found anything that looks like a good fit.
>
> i have a database with approximately 7Million rows that i am bringing in to
> solr.
>
> for a very small sub-set of these 7Million rows (about 130 rows), i need to
> substitute an old part number for a new part number.  i know ahead of time
> all 130 part numbers that need to be substituted and their new values.
>
> example:
>
> old part#   new part#
> M123b   C987
>
> thank you,
> mark
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/best-way-to-force-substitutions-in-data-tp3646195p3646195.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,

Dmitry Kan


Re: best way to force substitutions in data

2012-01-10 Thread Gora Mohanty
On Tue, Jan 10, 2012 at 4:44 AM, geeky2  wrote:
[...]
> i have a database with approximately 7Million rows that i am bringing in to
> solr.
>
> for a very small sub-set of these 7Million rows (about 130 rows), i need to
> substitute an old part number for a new part number.  i know ahead of time
> all 130 part numbers that need to be substituted and their new values.
>
> example:
>
> old part#       new part#
> M123b           C987
[...]

* Is it possible to do these replacements in the database, once and for
  all?
* Do you want the part numbers to be replaced, or for both to exist? In
  the latter case, a user can find the part by searching for either the old
  or the new part number, so I believe that this should be the preferred
  solution.
  - For replacement while indexing, use a transformer. A RegexTransformer,
or ScriptTransformer should do the trick. Please see
http://wiki.apache.org/solr/DataImportHandler#Transformer
  - For keeping both values, use synonyms.

Regards,
Gora


Facet Query using Dates

2012-01-10 Thread Mauro Asprea
Hi, I;m having issues using the "new" way of faceting dates with the Query 
Facets. 

The issue is that it is returning wrong counts. I tested it using a Date Facet 
instead and the Dated one did result correct counters. I'm using Sunspot RSolr 
client and I'm using also new folding/group feature.

This is my query tested against the Solr admin web interface:

/select?wt=ruby&fq=type:Movie&fq=event_id_i:[1%20TO%20*]&sort=location_weight_i%20desc&q="Actividad%20paranormal%203"&fl=*%20score&qf=name_texts%20location_name_text&defType=dismax&start=0&rows=12&group=true&group.field=event_id_str_s&group.field=location_name_str_s&group.sort=date_start_dt%20asc&group.limit=10&group.limit=1&facet=true&f.date_start_facet_dt.facet.mincount=1&facet.date=date_start_facet_dt&f.date_start_facet_dt.facet.date.start=2012-01-10T09:44:22Z&f.date_start_facet_dt.facet.date.end=2012-01-11T08:59:59Z&f.date_start_facet_dt.facet.date.gap=%2B86400SECONDS&facet.query=-date_start_facet_dt:[2012\-01\-10T09\:44\:22Z%20TO%202012\-01\-11T08\:59\:59Z]

The important parts here are:

The Query Facet
facet.query=-date_start_facet_dt:[2012\-01\-10T09\:44\:22Z TO 
2012\-01\-11T08\:59\:59Z]


The Date Facet
f.date_start_facet_dt.facet.mincount=1
facet.date=date_start_facet_dt
f.date_start_facet_dt.facet.date.start=2012-01-10T09:44:22Z
f.date_start_facet_dt.facet.date.end=2012-01-11T08:59:59Z
f.date_start_facet_dt.facet.date.gap=%2B86400SECONDS

As you can see both facets have the same "range"

Now the important part of the results:
'facet_counts'=>{
  'facet_queries'=>{
'-date_start_facet_dt:[2012\\-01\\-10T09\\:44\\:22Z TO 
2012\\-01\\-11T08\\:59\\:59Z]' => 26},
  'facet_fields'=>{},
  'facet_dates'=> {
'date_start_facet_dt'=>{
'2012-01-10T09:44:22Z'=>4,
'gap'=>'+86400SECONDS',
'start'=>'2012-01-10T09:44:22Z',
'end'=>'2012-01-11T09:44:22Z'}}
  ,'facet_ranges'=>{}}}


As you can see for the same ranges I'm getting different counts. The CORRECT 
one is the facet_dates ones.

BTW I'm using Solr Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 
14:54:38 

Is this a KNOW BUG? Please Help! :D

-- 
Mauro Asprea

E-Mail: mauroasp...@gmail.com
Mobile: +34 654297582
Skype: mauro.asprea




Re: Facet Query using Dates

2012-01-10 Thread Mauro Asprea
I think I solve it... It seems to be because of the - that's just before the 
query facet name

-- 
Mauro Asprea

E-Mail: mauroasp...@gmail.com
Mobile: +34 654297582
Skype: mauro.asprea



On Tuesday, January 10, 2012 at 11:33 AM, Mauro Asprea wrote:

> Hi, I;m having issues using the "new" way of faceting dates with the Query 
> Facets. 
> 
> The issue is that it is returning wrong counts. I tested it using a Date 
> Facet instead and the Dated one did result correct counters. I'm using 
> Sunspot RSolr client and I'm using also new folding/group feature.
> 
> This is my query tested against the Solr admin web interface:
> 
> /select?wt=ruby&fq=type:Movie&fq=event_id_i:[1%20TO%20*]&sort=location_weight_i%20desc&q="Actividad%20paranormal%203"&fl=*%20score&qf=name_texts%20location_name_text&defType=dismax&start=0&rows=12&group=true&group.field=event_id_str_s&group.field=location_name_str_s&group.sort=date_start_dt%20asc&group.limit=10&group.limit=1&facet=true&f.date_start_facet_dt.facet.mincount=1&facet.date=date_start_facet_dt&f.date_start_facet_dt.facet.date.start=2012-01-10T09:44:22Z&f.date_start_facet_dt.facet.date.end=2012-01-11T08:59:59Z&f.date_start_facet_dt.facet.date.gap=%2B86400SECONDS&facet.query=-date_start_facet_dt:[2012\-01\-10T09\:44\:22Z%20TO%202012\-01\-11T08\:59\:59Z]
> 
> The important parts here are:
> 
> The Query Facet
> facet.query=-date_start_facet_dt:[2012\-01\-10T09\:44\:22Z TO 
> 2012\-01\-11T08\:59\:59Z]
> 
> 
> The Date Facet
> f.date_start_facet_dt.facet.mincount=1
> facet.date=date_start_facet_dt
> f.date_start_facet_dt.facet.date.start=2012-01-10T09:44:22Z
> f.date_start_facet_dt.facet.date.end=2012-01-11T08:59:59Z
> f.date_start_facet_dt.facet.date.gap=%2B86400SECONDS
> 
> As you can see both facets have the same "range"
> 
> Now the important part of the results:
> 'facet_counts'=>{
>   'facet_queries'=>{
> '-date_start_facet_dt:[2012\\-01\\-10T09\\:44\\:22Z TO 
> 2012\\-01\\-11T08\\:59\\:59Z]' => 26},
>   'facet_fields'=>{},
>   'facet_dates'=> {
> 'date_start_facet_dt'=>{
> '2012-01-10T09:44:22Z'=>4,
> 'gap'=>'+86400SECONDS',
> 'start'=>'2012-01-10T09:44:22Z',
> 'end'=>'2012-01-11T09:44:22Z'}}
>   ,'facet_ranges'=>{}}}
> 
> 
> As you can see for the same ranges I'm getting different counts. The CORRECT 
> one is the facet_dates ones.
> 
> BTW I'm using Solr Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 
> 14:54:38 
> 
> Is this a KNOW BUG? Please Help! :D
> 
> -- 
> Mauro Asprea
> 
> E-Mail: mauroasp...@gmail.com (mailto:mauroasp...@gmail.com)
> Mobile: +34 654297582
> Skype: mauro.asprea
> 
> 
> 
> 




Re: Solr core as a dispatcher

2012-01-10 Thread Hector Castro
In my case the cores are populated with different records that adhere to the 
same schema. The question about randomly distributing requests is because each 
core has the `shards` parameter populated so that it can hit the other core's 
indexes.

My question is more about the advantages (if any) of utilizing a dispatcher 
core vs. simply querying the populated cores. 

--
Hector

On Jan 10, 2012, at 1:57 AM, shlomi java  wrote:

> If you want to randomly distribute requests across shards, then I think
> it's a case of Replication.
> 
> In Replication setup, all cores have the same schema AND data, so query any
> core should return the same result. It is used to support heavy load. Of
> course such setup will required some kind of load balancer.
> 
> In Distributed Search the shards have the same schema, but NOT the same
> data. So there is no point of randomly querying a shard, because we will
> get randomly different results.
> 
> ShlomiJ
> 
> On Tue, Jan 10, 2012 at 2:15 AM, Hector Castro  wrote:
> 
>> Hi,
>> 
>> Has anyone had success with multicore single node Solr configurations that
>> have one core acting solely as a dispatcher for the other cores?  For
>> example, say you had 4 populated Solr cores – configure a 5th to be the
>> definitive endpoint with `shards` containing cores 1-4.
>> 
>> Is there any advantage to this setup over simply having requests
>> distributed randomly across the 4 populated cores (all with `shards` equal
>> to cores 1-4)?  Is it even worth distributing requests across the cores
>> over always hitting the same one?
>> 
>> Thanks,
>> 
>> --
>> Hector
>> 
>> 


Two documents with same ID but different hash

2012-01-10 Thread Hyttinen Lauri

Hello,

I sent some data into the solr/lucene index but when I query
the data I see weird results.

There are documents with identical id fields but they have different 
hash values.

Apart from the hash values the results are the same.

I thought it was impossible to have documents with same uniqueKey in the 
index?

Evidently this is not the case? Could the index be corrupt somehow?

from schema.xml:
id


Best regards,
Lauri Hyttinen


RE: Match raw query string

2012-01-10 Thread McCarroll, Robert
Thank you for your patience and assistance.  XML is not my forte, but layoffs 
and attrition have reduced IT staff well below minimum functional levels here.  
Thanks to your help, the exact title matches have made it to the first page of 
results.


Robert McCarroll 
Systems Administration 
NYS Department of Civil Service



-Original Message-
From: Emmanuel Espina [mailto:espinaemman...@gmail.com] 
Sent: Monday, January 09, 2012 3:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Match raw query string

No, omitTermFreqAndPositions and omitNorms parameter must be set in
the definition of the field in the schema.xml. (in the example config
is shown).

Have you analyzed the scoring information produced by debugQuery=true?
Add that to the query parameters. That will produce information for
each document. Compare how the score is calculated in the case of the
(undesired) first document, and in the case of the relevant document
that should be the first. That should show you why the score is not
being calculated correctly.

Also, it may not apply to your case but qs should be higher than 0
mm can be interfering. Remove it for testing if the problem persists.

Thanks

http://wiki.apache.org/solr/CommonQueryParameters#debugQuery

2012/1/9 McCarroll, Robert :
> The query comes off of the search page looking like:
>
> :/solr_/select?q=Budget%20Examiner%2FBudget%20Examiner%20%28Public%20Finance%29&hl=true&hl.fragsize=200&wt=json&start=0
>
> And the solrconfig section for the parser in use looks like:
>
>  
>    
>     dismax
>     explicit
>     0.01
>     
>        title^5000.0 content id^2000.0
>     
>     
>        2<-1 5<-2 6<90%
>     
>     3
>     title^5000.0 id^2000.0 content
>     0
> 
>     content
>     0
>     title
>     regex 
>    
>  
>
> The high boost numbers were set as part of an attempt to see if we could get 
> the title match to come up to the first page of results.  It started out at 
> about 5 and 2 for id.  (id in this case being the URL of the page).
>
> If I wanted to ass the omitTermFreqAndPositions to qf, would it look like 
> something like:
>
> title^5000.0 id^2000.0 
> content
>
> ?
>
>
> Robert McCarroll
> Systems Administration
> NYS Department of Civil Service
>
>
>
>
> -Original Message-
> From: Emmanuel Espina [mailto:espinaemman...@gmail.com]
> Sent: Monday, January 09, 2012 1:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Match raw query string
>
> How are you building your query? For your case it appears that the
> edismax query parser should solve it
>
> A good solution to this kind of problem involves:
> Storing norms (omitNorms=false) in the fields to search
> Storing the position of the terms (omitTermFreqAndPositions=false) in
> the fields to search
> Configuring dismax so the title gets a higher boost (qf parameter
> configured to something like "title^3 body")
> Configuring dismax so phrase queries in the title gets an even higher
> boost (pf parameter configured to something like "title^10 body")
>
> References:
> http://wiki.apache.org/solr/DisMaxQParserPlugin
> http://wiki.apache.org/solr/SchemaXml
>
>
> 2012/1/9 McCarroll, Robert :
>>  We're in the process of implementing solr to search our web site, and
>> have run into a response tuning issue.  When a user searches for a
>> string which is an exact match of a document title, for example "Budget
>> Examiner/Budget Examiner(Public Finance)", the number of hits in the
>> body of much longer pages on words stemming from the same roots drowns
>> out the exact title match so that it is deeply buried in the search
>> results, regardless of how much weight is given to the title field.  Is
>> there a way to configure solr so that raw query string matches for query
>> strings of more than two or three words appear before all other search
>> results, followed by non-exact title matches and have content matches
>> sort last?
>>
>>
>> Robert McCarroll
>> Systems Administration
>> NYS Department of Civil Service
>>
>>
>>


Re: how to rebuild snowball lib in solr

2012-01-10 Thread Erick Erickson
On a very quick glance, it looks like the source is at:
./lucene/contrib/analyzers/common/src/java/org/tartarus/snowball

and from there just compile Lucene and/or Solr as you normally would.

See: http://wiki.apache.org/solr/HowToContribute

Best
Erick

On Mon, Jan 9, 2012 at 2:13 PM,   wrote:
> Hello,
>
> I know solr uses snowball libs, but I was unable to locate jar file in lib 
> dirs. I need to modify one of the language stemmers and put it back to solr.
>
> Any ideas, how can it be done?
>
> Thanks in advance.
> Alex.
>


Re: Two documents with same ID but different hash

2012-01-10 Thread Hyttinen Lauri

Hello again,

Well after further review the ID's are different. The difference was 
just so small I missed it after staring it for a few hours.


BR,
Lauri

On 01/10/2012 02:20 PM, Hyttinen Lauri wrote:

Hello,

I sent some data into the solr/lucene index but when I query
the data I see weird results.

There are documents with identical id fields but they have different 
hash values.

Apart from the hash values the results are the same.

I thought it was impossible to have documents with same uniqueKey in 
the index?

Evidently this is not the case? Could the index be corrupt somehow?

from schema.xml:
id


Best regards,
Lauri Hyttinen


Re: Do Hignlighting + proximity using surround query parser

2012-01-10 Thread Ahmet Arslan
> I am not able to do highlighting with surround query parser
> on the returned
> results.
> I have tried the highlighting component but it does not
> return highlighted
> results.

Highlighter does not recognize Surround Query. It must be re-written to enable 
highlighting in o.a.s.search.QParser#getHighlightQuery() method.

Not sure this functionality should be added in SOLR-2703 or a separate jira 
issue. 



Re: best way to force substitutions in data

2012-01-10 Thread geeky2
thank you both for the information.

Gora, when you mentioned:

>>
- For keeping both values, use synonyms. 
<<

what did you mean exactly.

mark

--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-to-force-substitutions-in-data-tp3646195p3647920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Doing url search in solr is slow

2012-01-10 Thread yu shen
Hi Erick,

I change all my url fields into text (they were string fields before), and
added a WordDelimiterFilterFactory, so that url fields can be tokenized
into several words. But I still got around 15 seconds response time
measured using debugyQuery=on, and most of the time still spend on
DebugComponent. The query I use did not have any prepended asterisk.
(Excuse me if the context description is still not complete enought)

Is there any other margin to boost the query performance?

Spark

2012/1/10 yu shen 

> Hi Erick,
>
> I only added debugyQuery=on to the url, and did not do any configuration
> with regard to DebugComponent. Seems like 'string' type should be
> substituted with 'text' type.
>
> I will paste the result here after I did some experiments.
>
> Spark
>
>
> 2012/1/9 Erick Erickson 
>
>> Do you by chance have the debugQuery on by default?
>> Because if you look down in the "timing" section,
>> you can see the times the various components took to do
>> their work, there are two sections "prepare" and "process".
>>
>> The cumulative time is 17.156 seconds. Of which 17.156
>> seconds is reported to be in the DebugComponent.
>>
>> So what happens if you just turn that component off? Because
>> I don't see anything in your output that really looks like it is
>> taking any time. Of course if you've changed your code from
>> *url* to url*, that will account for time too, since the infix  case
>> requires that every term in the fields in question be examined.
>>
>> About WordDelimiterFilterFactory That is irrelevant for a "string"
>> type. It's an oen question whether a string type is what you
>> want, but that is determined by your problem space. You might
>> spend some time with admin/analysis to see the effects of
>> various analysis chains. "string" is used when you want no
>> tokenization, no case transformations etc.
>>
>> Best
>> Erick
>>
>> On Mon, Jan 9, 2012 at 10:04 AM, yu shen  wrote:
>> > Hi Erick,
>> >
>> > Thanks for you reply. Actually I did the following search:
>> > survey_url:http\://www.someurl.com/sch/i.html* referal_url:http\://
>> > www.someurl.com/sch/i.html* page_url:http\://
>> www.someurl.com/sch/i.html*
>> >
>> > I did not prepend any asterisk to the field value, but only append to
>> them.
>> >
>> > I analyze url field on solr admin page, and it give me this, meaning the
>> > url is not tokenized. I notice you mentioned a
>> WordDelimiterFilterFactory.
>> > Do I need to configure it in schema.xml or some place else?
>> > term position 1 term text http://www.someurl.com/sch/i.html* term type
>> > word source
>> > start,end 0,31
>> > I add the debugQuery=on to the query url, I got this (Sorry to paste
>> such
>> > long encrypted code here, they are really mysterious to me)
>> > 
>> >survey_url:http\://
>> > www.someurl.com/sch/i.html*
>> > referal_url:http\://www.someurl.com/sch/i.html*page_url:http\://
>> > www.someurl.com/sch/i.html*
>> >survey_url:http\://
>> www.someurl.com/sch/i.html*referal_url:http\://
>> > www.someurl.com/sch/i.html* page_url:http\://
>> www.someurl.com/sch/i.html*
>> > 
>> >survey_url:
>> http://www.someurl.com/sch/i.html*referal_url:
>> > http://www.someurl.com/sch/i.html* page_url:
>> > http://www.someurl.com/sch/i.html*
>> >survey_url:
>> > http://www.someurl.com/sch/i.html* referal_url:
>> > http://www.someurl.com/sch/i.html* page_url:
>> > http://www.someurl.com/sch/i.html*
>> >
>> >
>> > 0.76980036 = (MATCH) product of:
>> >  1.1547005 = (MATCH) sum of:
>> >0.57735026 = (MATCH) ConstantScoreQuery(referal_url:
>> > http://www.someurl.com/sch/i.html*), product of:
>> >  1.0 = boost
>> >  0.57735026 = queryNorm
>> >0.57735026 = (MATCH) ConstantScoreQuery(page_url:
>> > http://www.someurl.com/sch/i.html*), product of:
>> >  1.0 = boost
>> >  0.57735026 = queryNorm
>> >  0.667 = coord(2/3)
>> >
>> >
>> > 0.76980036 = (MATCH) product of:
>> >  1.1547005 = (MATCH) sum of:
>> >0.57735026 = (MATCH) ConstantScoreQuery(referal_url:
>> > http://www.someurl.com/sch/i.html*), product of:
>> >  1.0 = boost
>> >  0.57735026 = queryNorm
>> >0.57735026 = (MATCH) ConstantScoreQuery(page_url:
>> > http://www.someurl.com/sch/i.html*), product of:
>> >  1.0 = boost
>> >  0.57735026 = queryNorm
>> >  0.667 = coord(2/3)
>> >
>> >
>> > 0.76980036 = (MATCH) product of:
>> >  1.1547005 = (MATCH) sum of:
>> >0.57735026 = (MATCH) ConstantScoreQuery(referal_url:
>> > http://www.someurl.com/sch/i.html*), product of:
>> >  1.0 = boost
>> >  0.57735026 = queryNorm
>> >0.57735026 = (MATCH) ConstantScoreQuery(page_url:
>> > http://www.someurl.com/sch/i.html*), product of:
>> >  1.0 = boost
>> >  0.57735026 = queryNorm
>> >  0.667 = coord(2/3)
>> >
>> >
>> > 0.76980036 = (MATCH) product of:
>> >  1.1547

Re: best way to force substitutions in data

2012-01-10 Thread Gora Mohanty
On Tue, Jan 10, 2012 at 9:04 PM, geeky2  wrote:
> thank you both for the information.
>
> Gora, when you mentioned:
>
>>>
> - For keeping both values, use synonyms.
> <<
>
> what did you mean exactly.
[...]

Please take a look at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
and the default Solr configuration for an example of how to use
synonyms.

Searching Google for "Solr synonyms" also turn up several
examples.

Regards,
Gora


Re: Two documents with same ID but different hash

2012-01-10 Thread Erick Erickson
I have no idea what you mean by "different hash", and you
haven't provided much information go on here.

What is your evidence that the document is in the index
twice? If you're inspecting the index at a low level
that's expected, since documents are just marked
as deleted not immediately removed from the index.

Are you using shards? If so, is it possible that you've indexed
documents with the same ID to different shards?

Best
Erick

On Tue, Jan 10, 2012 at 7:20 AM, Hyttinen Lauri  wrote:
> Hello,
>
> I sent some data into the solr/lucene index but when I query
> the data I see weird results.
>
> There are documents with identical id fields but they have different hash
> values.
> Apart from the hash values the results are the same.
>
> I thought it was impossible to have documents with same uniqueKey in the
> index?
> Evidently this is not the case? Could the index be corrupt somehow?
>
> from schema.xml:
> id
>
>
> Best regards,
> Lauri Hyttinen


Re: Solr core as a dispatcher

2012-01-10 Thread Shawn Heisey

On 1/9/2012 5:15 PM, Hector Castro wrote:

Hi,

Has anyone had success with multicore single node Solr configurations that have 
one core acting solely as a dispatcher for the other cores?  For example, say 
you had 4 populated Solr cores – configure a 5th to be the definitive endpoint 
with `shards` containing cores 1-4.

Is there any advantage to this setup over simply having requests distributed 
randomly across the 4 populated cores (all with `shards` equal to cores 1-4)?  
Is it even worth distributing requests across the cores over always hitting the 
same one?


I've got a setup where a single index chain consists of seven cores 
across two servers.  Those seven cores do not have the shards parameter 
in them.  I have what you call a dispatcher core (I call it a broker 
core) that contains the shards parameter, but has no index data.


If you use a dispatcher core, your application does not need to be 
concerned with the makeup of your index, so you don't need to include a 
shards parameter with your request.  You can change the index 
distribution and not have to worry about your application 
configuration.  This is the major advantage to doing it this way.  
Distributed search has some overhead and not all Solr features work with 
it, so if your application already knows which core will contain the 
data it is trying to find, it is better to query the right core directly.


Be careful that you aren't adding a shards parameter to a core 
configuration that points at itself.  Solr will, as of the last time I 
checked, try to complete the recursion and will fail.


Thanks,
Shawn



How to debug DIH with MySQL?

2012-01-10 Thread Walter Underwood
I see a missing required "title" field for every document when I'm using DIH. 
Yes, these documents have titles in the database. Is there a way to see what 
exact queries are sent to MySQL or received by MySQL?

Here is a relevant chunk of the dataConfig:


  
  
  
  
  

  

wunder
--
Walter Underwood
wun...@wunderwood.org
Search Guy, Chegg



Re: How to debug DIH with MySQL?

2012-01-10 Thread dan whelan

just a guess but this might need to change from

${biblio.id}

to

${book.id}

Since the entity name is book instead of biblio



On 1/10/12 10:37 AM, Walter Underwood wrote:

I see a missing required "title" field for every document when I'm using DIH. 
Yes, these documents have titles in the database. Is there a way to see what exact 
queries are sent to MySQL or received by MySQL?

Here is a relevant chunk of the dataConfig:

 
   
   
   

   

   

wunder
--
Walter Underwood
wun...@wunderwood.org
Search Guy, Chegg





Re: How to debug DIH with MySQL?

2012-01-10 Thread Walter Underwood
Thanks! That looks like it fixed the problem. This list continues to be awesome.

Is the function of the name attribute actually described in the docs? I could 
not figure out what it was for.

wunder

On Jan 10, 2012, at 10:41 AM, dan whelan wrote:

> just a guess but this might need to change from
> 
> ${biblio.id}
> 
> to
> 
> ${book.id}
> 
> Since the entity name is book instead of biblio
> 
> 
> 
> On 1/10/12 10:37 AM, Walter Underwood wrote:
>> I see a missing required "title" field for every document when I'm using 
>> DIH. Yes, these documents have titles in the database. Is there a way to see 
>> what exact queries are sent to MySQL or received by MySQL?
>> 
>> Here is a relevant chunk of the dataConfig:
>> 
>> 
>>   
>>   
>>   
>> 
>>   >query="select title from biblio_title where 
>> biblio_id='${biblio.id}'">
>>  
>>   
>> 
>> wunder
>> --
>> Walter Underwood
>> wun...@wunderwood.org
>> Search Guy, Chegg
>> 
> 

--
Walter Underwood
wun...@wunderwood.org





Re: How to debug DIH with MySQL?

2012-01-10 Thread Gora Mohanty
On Wed, Jan 11, 2012 at 12:37 AM, Walter Underwood
 wrote:
> Thanks! That looks like it fixed the problem. This list continues to be 
> awesome.
>
> Is the function of the name attribute actually described in the docs? I could 
> not figure out what it was for.

Yes, it is, though maybe not very prominently.

>From under http://wiki.apache.org/solr/DataImportHandler,
Configuration in data-config.xml > Schema for the data config :

The default attributes for an entity are:

* name (required) : A unique name used to identify an entity
...

Regards,
Gora


Re: How to debug DIH with MySQL?

2012-01-10 Thread Walter Underwood
Right, but that says exactly nothing about how that identifier is used. --wunder

On Jan 10, 2012, at 11:23 AM, Gora Mohanty wrote:

> On Wed, Jan 11, 2012 at 12:37 AM, Walter Underwood
>  wrote:
>> Thanks! That looks like it fixed the problem. This list continues to be 
>> awesome.
>> 
>> Is the function of the name attribute actually described in the docs? I 
>> could not figure out what it was for.
> 
> Yes, it is, though maybe not very prominently.
> 
> From under http://wiki.apache.org/solr/DataImportHandler,
> Configuration in data-config.xml > Schema for the data config :
> 
> The default attributes for an entity are:
> 
>* name (required) : A unique name used to identify an entity
> ...
> 
> Regards,
> Gora






SpellCheck Help

2012-01-10 Thread Donald Organ
I am trying to get the IndexBasedSpellChecker to work.  I believe I have
everything setup properly and the spellcheck component seems to be running
but the suggestions list is empty.

I am using SOLR 3.5 with Jetty.

My solrconfig.xml and schema.xml are as follows:

solrconfig.xml:  http://pastie.org/private/z7sharm0ajlmm9hpy41v7g
schema.xml: http://pastie.org/private/ykim99unbqfhumxxzbs6g


RE: SpellCheck Help

2012-01-10 Thread Dyer, James
Three things to check:

1. Use a higher spellcheck.count than 1.   Try 10.  IndexBasedSpellChecker 
pre-filters the possibilities in a first pass of a 2-pass process.  If 
spellcheck.count is too low, all the good suggestions might get filtered on the 
first pass and then it won't find anything on the second.

2. Be sure you're building the dictionary.  Try adding "spellcheck.build=true" 
to your first query.  You need to do do this every time you start the solr core.

3. Try a lower spellcheck.accuracy.  Maybe the default .5 instead of the .7 
you've got.

One other thing to consider:

- If the "misspelled" word exists in your index, the spellchecker won't try to 
correct it.  This is true even if you're omitting words from the dictionary 
(for intance, by using "thresholdTokenFrequency")

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Donald Organ [mailto:dor...@donaldorgan.com] 
Sent: Tuesday, January 10, 2012 1:26 PM
To: solr-user@lucene.apache.org
Subject: SpellCheck Help

I am trying to get the IndexBasedSpellChecker to work.  I believe I have
everything setup properly and the spellcheck component seems to be running
but the suggestions list is empty.

I am using SOLR 3.5 with Jetty.

My solrconfig.xml and schema.xml are as follows:

solrconfig.xml:  http://pastie.org/private/z7sharm0ajlmm9hpy41v7g
schema.xml: http://pastie.org/private/ykim99unbqfhumxxzbs6g


Re: SpellCheck Help

2012-01-10 Thread Donald Organ
my copyField was defined as copyfield   <--- notice the lowercase f




On Tue, Jan 10, 2012 at 2:50 PM, Dyer, James wrote:

> Three things to check:
>
> 1. Use a higher spellcheck.count than 1.   Try 10.  IndexBasedSpellChecker
> pre-filters the possibilities in a first pass of a 2-pass process.  If
> spellcheck.count is too low, all the good suggestions might get filtered on
> the first pass and then it won't find anything on the second.
>
> 2. Be sure you're building the dictionary.  Try adding
> "spellcheck.build=true" to your first query.  You need to do do this every
> time you start the solr core.
>
> 3. Try a lower spellcheck.accuracy.  Maybe the default .5 instead of the
> .7 you've got.
>
> One other thing to consider:
>
> - If the "misspelled" word exists in your index, the spellchecker won't
> try to correct it.  This is true even if you're omitting words from the
> dictionary (for intance, by using "thresholdTokenFrequency")
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Donald Organ [mailto:dor...@donaldorgan.com]
> Sent: Tuesday, January 10, 2012 1:26 PM
> To: solr-user@lucene.apache.org
> Subject: SpellCheck Help
>
> I am trying to get the IndexBasedSpellChecker to work.  I believe I have
> everything setup properly and the spellcheck component seems to be running
> but the suggestions list is empty.
>
> I am using SOLR 3.5 with Jetty.
>
> My solrconfig.xml and schema.xml are as follows:
>
> solrconfig.xml:  http://pastie.org/private/z7sharm0ajlmm9hpy41v7g
> schema.xml: http://pastie.org/private/ykim99unbqfhumxxzbs6g
>


Stemming numbers

2012-01-10 Thread Tanner Postert
We've had some issues with people searching for a document with the
search term '200 movies'. The document is actually title 'two hundred
movies'.

Do we need to add every number to our  synonyms dictionary to
accomplish this? Is it best done at index or search time?


RE: ignoreTikaException value

2012-01-10 Thread TRAN-NGOC Minh
Thanks for your reply.

I added the argument in the solrconfig.xml and it worked like a charm.

Thanks again

Minh

-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] 
Sent: mardi 10 janvier 2012 01:25
To: solr-user@lucene.apache.org
Subject: Re: ignoreTikaException value

(12/01/10 6:31), TRAN-NGOC Minh wrote:
> Last year a patch with an IgnoreTikaexception has been developped.
>
> My question is how could I change the IgnoreTikaexception flag value

Just setting ignoreTikaException=true request parameter should work when you 
calling ExtractingRequestHandler. Or you can set it as default in 
solrconfig.xml:

   
 
 :
   true
 :
 
   

koji
--
http://www.rondhuit.com/en/


Re: Stemming numbers

2012-01-10 Thread Ted Dunning
On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert wrote:

> We've had some issues with people searching for a document with the
> search term '200 movies'. The document is actually title 'two hundred
> movies'.
>
> Do we need to add every number to our  synonyms dictionary to
> accomplish this?


That is one way to deal with this.

But it depends on a lot of hand engineering of special cases.  That is good
to have for the low hanging fruit, but it only takes you so far.  You can
also automate the discovery of such cases to a certain degree by analyzing
query logs.


> Is it best done at index or search time?
>

I would say that opinion is divided on this and in the end, you probably
have to do versions of this at both times.  This is especially true if you
want to include secondary information like inferred query purpose
(obviously only available at query time) and inferred document
characteristics (best known at indexing time).  Partly the choice about
when to do this is driven by which trade-offs you are OK making.  For
instance, some people are driven by index size but not query response time.
 They would probably opt for pushing load to the query.  Others may be
bound by response time or query throughput.  They may wish to minimize
query complexity and size.


Re: Stemming numbers

2012-01-10 Thread Tanner Postert
You mention "that is one way to do it" is there another i'm not seeing?

On Jan 10, 2012, at 4:34 PM, Ted Dunning  wrote:

> On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert 
> wrote:
>
>> We've had some issues with people searching for a document with the
>> search term '200 movies'. The document is actually title 'two hundred
>> movies'.
>>
>> Do we need to add every number to our  synonyms dictionary to
>> accomplish this?
>
>
> That is one way to deal with this.
>
> But it depends on a lot of hand engineering of special cases.  That is good
> to have for the low hanging fruit, but it only takes you so far.  You can
> also automate the discovery of such cases to a certain degree by analyzing
> query logs.
>
>
>> Is it best done at index or search time?
>>
>
> I would say that opinion is divided on this and in the end, you probably
> have to do versions of this at both times.  This is especially true if you
> want to include secondary information like inferred query purpose
> (obviously only available at query time) and inferred document
> characteristics (best known at indexing time).  Partly the choice about
> when to do this is driven by which trade-offs you are OK making.  For
> instance, some people are driven by index size but not query response time.
> They would probably opt for pushing load to the query.  Others may be
> bound by response time or query throughput.  They may wish to minimize
> query complexity and size.


Re: Stemming numbers

2012-01-10 Thread Ted Dunning
I was afraid you would say that.

See http://fora.tv/2009/10/14/ACM_Data_Mining_SIG_Ted_Dunning#fullprogram,
click on the Recommendations section to skip to the good part.

The point is that cross recommendation can let you learn what sorts of
rewrites of this kind are needed.  The idea is that you let your clever
users teach you about what sort of rewrites are necessary so that your less
clever users will benefit.

The engineering effort is higher going in and I wouldn't recommend it if
you have no development budget, but the total effort to get a really high
performing system would be less than trying to engineer all possible
rewrites by hand.

On Tue, Jan 10, 2012 at 10:21 PM, Tanner Postert
wrote:

> You mention "that is one way to do it" is there another i'm not seeing?
>
> On Jan 10, 2012, at 4:34 PM, Ted Dunning  wrote:
>
> > On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert <
> tanner.post...@gmail.com>wrote:
> >
> >> We've had some issues with people searching for a document with the
> >> search term '200 movies'. The document is actually title 'two hundred
> >> movies'.
> >>
> >> Do we need to add every number to our  synonyms dictionary to
> >> accomplish this?
> >
> >
> > That is one way to deal with this.
> >
> > But it depends on a lot of hand engineering of special cases.  That is
> good
> > to have for the low hanging fruit, but it only takes you so far.  You can
> > also automate the discovery of such cases to a certain degree by
> analyzing
> > query logs.
> >
> >
> >> Is it best done at index or search time?
> >>
> >
> > I would say that opinion is divided on this and in the end, you probably
> > have to do versions of this at both times.  This is especially true if
> you
> > want to include secondary information like inferred query purpose
> > (obviously only available at query time) and inferred document
> > characteristics (best known at indexing time).  Partly the choice about
> > when to do this is driven by which trade-offs you are OK making.  For
> > instance, some people are driven by index size but not query response
> time.
> > They would probably opt for pushing load to the query.  Others may be
> > bound by response time or query throughput.  They may wish to minimize
> > query complexity and size.
>


Re: Stemming numbers

2012-01-10 Thread Otis Gospodnetic
Hi Tanner,

Here is another simple way: AutoComplete.
You know what your users are searching for, you can identify top queries and 
you can identify common queries that are not finding matches.  This all allows 
you to figure out what to feed in AutoComplete.  And hopefully your 
AutoComplete doesn't just perform a search with selected suggestion (e.g. 200 
movies), but "translates" that to either a redirect to the specific associated 
item or a "translated query".

Another related approach is handling this with DYM or Related Searches type 
functionality.  Didn't check out Ted's link yet, but it sounds like that may be 
related.  We've had some luck building this from query logs, where we've 
examined query patterns and figured out that when people query for e.g."200 
movies" they really wanted "two hundred movies".  Think about Google's query 
spelling suggestions.

Otis

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



- Original Message -
> From: Tanner Postert 
> To: "solr-user@lucene.apache.org" 
> Cc: 
> Sent: Tuesday, January 10, 2012 10:21 PM
> Subject: Re: Stemming numbers
> 
> You mention "that is one way to do it" is there another i'm not 
> seeing?
> 
> On Jan 10, 2012, at 4:34 PM, Ted Dunning  wrote:
> 
>>  On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert 
> wrote:
>> 
>>>  We've had some issues with people searching for a document with the
>>>  search term '200 movies'. The document is actually title 
> 'two hundred
>>>  movies'.
>>> 
>>>  Do we need to add every number to our  synonyms dictionary to
>>>  accomplish this?
>> 
>> 
>>  That is one way to deal with this.
>> 
>>  But it depends on a lot of hand engineering of special cases.  That is good
>>  to have for the low hanging fruit, but it only takes you so far.  You can
>>  also automate the discovery of such cases to a certain degree by analyzing
>>  query logs.
>> 
>> 
>>>  Is it best done at index or search time?
>>> 
>> 
>>  I would say that opinion is divided on this and in the end, you probably
>>  have to do versions of this at both times.  This is especially true if you
>>  want to include secondary information like inferred query purpose
>>  (obviously only available at query time) and inferred document
>>  characteristics (best known at indexing time).  Partly the choice about
>>  when to do this is driven by which trade-offs you are OK making.  For
>>  instance, some people are driven by index size but not query response time.
>>  They would probably opt for pushing load to the query.  Others may be
>>  bound by response time or query throughput.  They may wish to minimize
>>  query complexity and size.
>


Re: stopwords as privacy measure

2012-01-10 Thread Michael Lissner
It's a bit of a privacy through obscurity measure, unfortunately. The 
problem is that American courts do a lousy job of removing social 
security numbers from cases that I put on my site. I do anonymization 
before sending the cases to Solr, but if you're clever (and the 
stopwords weren't in place) you could search for evidence of my 
anonymization efforts and then backtrack to the original cases at the 
court sites, where you'd find the SSNs...


It's a boondoggle, but the stopwords should help.

Mike



On Mon 09 Jan 2012 04:30:22 AM PST, Erik Hatcher wrote:

Mike -

Indeed users won't be able to *search* for things removed by the stop filter at 
index time (the terms literally aren't in the index then).  But be careful with 
the stored value.  Analysis does not affect stored content.

Are you anonymizing before sending to Solr (if so, why stop-word block?).  If 
not, if you're storing that content it could be returned to the searching 
client.   If you aren't anonymizing before sending to Solr, how are you using 
the stop word filtering to do this?

Erik

On Jan 8, 2012, at 23:08 , Michael Lissner wrote:


I've got them configured at index and query time, so sounds like I'm all set.

I'm doing anonymization of social security numbers, converting them to 
xxx-xx-. I don't *think* users can find a way of identifying these docs if 
the stopwords-based block works.

Thank you both for the confirmation.

Mike

On Sun 08 Jan 2012 09:32:53 PM PST, Gora Mohanty wrote:

On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner
   wrote:

I have a unique use case where I have words in my corpus that users
shouldn't ever be allowed to search for. My theory is that if I add these to
the stopwords list, that should do the trick.


Yes, that should work. Are you including the stop words at index-time,
query-time, or both? Normally, you should do both.

If done at the time of indexing, these terms will not even be in the
index, so I cannot think of any security issues.

Regards,
Gora




Re: Solr core as a dispatcher

2012-01-10 Thread shlomi java
Straying a bit from the subject,

don't you think it will be useful to have the shards parameter used also in
the index, in order to maintain document uniqueness?
I mean as an out of the box feature of Solr.

Because the situation today is that a Solr's client working with a sharded
Solr is responsible for keeping a document uniqueness across all shards.

*Solution *- let Solr decide in which shard to index a document, using a
plugable hashing method.

What do you think?

ShlomiJ

On Tue, Jan 10, 2012 at 6:15 PM, Shawn Heisey  wrote:

> On 1/9/2012 5:15 PM, Hector Castro wrote:
>
>> Hi,
>>
>> Has anyone had success with multicore single node Solr configurations
>> that have one core acting solely as a dispatcher for the other cores?  For
>> example, say you had 4 populated Solr cores – configure a 5th to be the
>> definitive endpoint with `shards` containing cores 1-4.
>>
>> Is there any advantage to this setup over simply having requests
>> distributed randomly across the 4 populated cores (all with `shards` equal
>> to cores 1-4)?  Is it even worth distributing requests across the cores
>> over always hitting the same one?
>>
>
> I've got a setup where a single index chain consists of seven cores across
> two servers.  Those seven cores do not have the shards parameter in them.
>  I have what you call a dispatcher core (I call it a broker core) that
> contains the shards parameter, but has no index data.
>
> If you use a dispatcher core, your application does not need to be
> concerned with the makeup of your index, so you don't need to include a
> shards parameter with your request.  You can change the index distribution
> and not have to worry about your application configuration.  This is the
> major advantage to doing it this way.  Distributed search has some overhead
> and not all Solr features work with it, so if your application already
> knows which core will contain the data it is trying to find, it is better
> to query the right core directly.
>
> Be careful that you aren't adding a shards parameter to a core
> configuration that points at itself.  Solr will, as of the last time I
> checked, try to complete the recursion and will fail.
>
> Thanks,
> Shawn
>
>


Multiple Sort for Group/Folding

2012-01-10 Thread Mauro Asprea
Hi, I'm having some issues trying to sort my grouped results by more than one 
field. If I use just one, independently  of which I use it just work fine (I 
mean it sorts). 

I have a case that the first sorting key is equal for all the head docs of each 
group, so I expect to return the groups sorted by its second sorting key. But 
its not the case. Only sorts using the first key no matter what.

This is my query:
fq=type:Movie&fq={!tag=cg01w4p3bcj3}date_start_dt:[2012\-01\-11T06\:47\:38Z TO 
2012\-01\-12T08\:59\:59Z]&fq=event_id_i:[1 TO *]&sort=location_weight_i desc, 
weight_i desc&q="Village Avellaneda"&fl=* score&qf=name_texts 
location_name_text&defType=dismax&start=0&rows=12&group=true&group.field=event_id_str_s&group.field=location_name_str_s&group.sort=date_start_dt
 
asc&group.limit=10&group.limit=1&facet=true&facet.query={!ex=cg01w4p3bcj3}date_start_dt:[2012\-01\-11T06\:47\:38Z
 TO 
2012\-01\-12T08\:59\:59Z]&facet.query={!ex=cg01w4p3bcj3}date_start_dt:[2012\-01\-12T09\:00\:00Z
 TO 
2012\-01\-13T08\:59\:59Z]&facet.query={!ex=cg01w4p3bcj3}date_start_dt:[2012\-01\-13T21\:00\:00Z
 TO 
2012\-01\-16T08\:59\:59Z]&facet.query={!ex=cg01w4p3bcj3}date_start_dt:[2012\-01\-11T06\:47\:38Z
 TO 
2012\-01\-18T12\:47\:38Z]&facet.query={!ex=cg01w4p3bcj3}date_start_dt:[2012\-01\-11T06\:47\:38Z
 TO 2012\-02\-10T12\:47\:38Z]

Using the last Solr release 3.5


Thanks!
-- 
Mauro Asprea

E-Mail: mauroasp...@gmail.com
Mobile: +34 654297582
Skype: mauro.asprea




Re: FastVectorHighlighter wiki corrections

2012-01-10 Thread Michael Lissner

Hi,

I didn't hear any responses here, so I went ahead and made a bunch of 
changes to the highlighting parameters wiki:
 - Highlighter is now known as Original Highlighter so it's more clear 
that Highlighter doesn't just refer to the highlighting utilities generally.
 - I need help with fragsize. The wiki says to set it to either 0 or a 
huge number to disable fragmenting. Which is it?
 - the wiki says that hl.useFastVectorHighlighter is defaulted to 
false. I read somewhere that FVH is True when the data has been indexed 
with termVectors, termPositions and termOffsets. Is that correct?


Thanks,

Mike


On 01/07/2012 10:24 PM, Michael Lissner wrote:
I switched over to the FastVectorHighlighter, but I'm struggling with 
the highlighting wiki. For example, it took me a while to figure out 
that "Highlighter only" means that a parameter doesn't work for FVH.


Can somebody wise tell me if the following are valid corrections I can 
make:
 - fragSize=0 can be accomplished in FVH by creating a fragListBuilder 
in your config:
class="solr.highlight.SingleFragListBuilder"/>

   and then calling it with hl.fragListBuilder=single
 - fragListBuilder supports field level overrides (this isn't 
mentioned currently)
 - the wiki says that hl.useFastVectorHighlighter is defaulted to 
false. I read somewhere that FVH is True when the data has been 
indexed with termVectors, termPositions and termOffsets. Is that correct?


Thanks,

Mike