Re: Delete from Solr index...

2010-01-29 Thread vnchoudhary

I am looking for following solution in C#, Please provide sample code if
possible:-

1. Delete all the index using delete query.
2. Take backup of all the old index, before regenerate.
3. Try to write unlike query for a field to delete stale index.
4. How can use transaction under index generation (delete all old index and
generate index), so that if any error occurs than it will not affect old
indexes.





ryantxu wrote:
> 
> escher2k wrote:
>> I am trying to remove documents from my index using "delete by query".
>> However when I did this, the deleted
>> items seem to remain. This is the format of the XML file I am using -
>> 
>> load_id:20070424150841
>> load_id:20070425145301
>> load_id:20070426145301
>> load_id:20070427145302
>> load_id:20070428145301
>> load_id:20070429145301
>> 
>> When I do the deletes individually, it seems to work (i.e. create each of
>> the above in a separate file). Does this
>> mean that each delete query request has to be executed separately ?
>> 
> 
> correct, delete (unlike ) only accepts one command.
> 
> Just to note, if "load_id" is your unique key, you could also use:
>   20070424150841
> 
> This will give you better performance and does not commit the changes 
> until you explicitly send 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Delete-from-Solr-index...-tp10264940p27369849.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Querying for multi-term phrases only . . .

2010-01-29 Thread Erik Hatcher
You can avoid one word terms by setting outputUnigrams="false" on the  
ShingleFilterFactory configuration.


Erik

On Jan 28, 2010, at 11:29 PM, Christopher Ball wrote:


I am curious how I can query for multi-term phrases using the
TermsComponent?



The field I am searching has been shingled so it contains 2 and 3 word
phrases.



For example in the sample results below I want to only get back  
multi-word
phrases such as "table of contents" and "under the" but not the  
single word

terms such as "year" and "significant"



25302

25162

25097

17501

17359



Appreciate any ideas,



Christopher





Re: Newbie Question on Custom Query Generation

2010-01-29 Thread Erik Hatcher
dismax won't quite give you the same query result.  What you can do  
pretty easily, though, is create a QParser and QParserPlugin pair,  
register it solrconfig.xml and then use &defType=.   
Pretty straightforward.  Have a look at Solr's various QParserPlugin  
implementations for details.


Erik

On Jan 29, 2010, at 12:30 AM, Abin Mathew wrote:

Hi I want to generate my own customized query from the input string  
entered

by the user. It should look something like this

*Search field : Microsoft*
*
Generated Query*  :
description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
role:microsoft requi
rement:microsoft company:microsoft city:microsoft)^5.0)  
tags:microsoft^2.0

title:microsoft^3.5 functionalArea:microsoft

*The lucene code we used is like this*
BooleanQuery must = new BooleanQuery();

addToBooleanQuery(must, "tags", inputData, synonymAnalyzer, 1.5f);
addToBooleanQuery(must, "title", inputData, synonymAnalyzer);
addToBooleanQuery(must, "role", inputData, synonymAnalyzer);
addToBooleanQuery(query, "description", inputData, synonymAnalyzer);
addToBooleanQuery(must, "requirement", inputData, synonymAnalyzer);
addToBooleanQuery(must, "company", inputData, standardAnalyzer);
addToBooleanQuery(must, "city", inputData, standardAnalyzer);
must.setBoost(5.0f);
query.add(must, Occur.MUST);
addToBooleanQuery(query, "tags", includeAll, synonymAnalyzer, 2.0f);
addToBooleanQuery(query, "title", includeAll, synonymAnalyzer, 3.5f);
addToBooleanQuery(query, "functionalArea", inputData,  
synonymAnalyzer,);

*
In Simple english*
addToBooleanQuery will add the particular field to the query after  
analysing

using the analyser mentioned and setting a boost as specified
So there "MUST" be a keyword match with any of the fields
tags,title,role,description,requirement,company,city and it "SHOULD"  
occur

in the fields tags,title and functionalArea.

Hope you have got an idea of my requirement. I am not asking anyone  
to do it
for me. Please let me know where can i start and give me some useful  
tips to
move ahead with this. I believe that it has to do with modifying the  
XML
configuration file and setting the parameters in Dismax handler. But  
I am

still not sure. Please help

Thanks & Regards
Abin Mathew




Aggregated facet value counts?

2010-01-29 Thread Peter S

Hi,

 

I was wondering if anyone had come across this use case, and if this type of 
faceting is possible:

 

The requirement is to build a query such that an aggregated facet count of 
common (and'ed) field values form the basis of each returned facet count.

 

For example:

Let's say I have a number of documents in an index with, among others, the 
fields 'host' and 'user':

 

Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1

 

Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4

 

Doc7  host:machine_1   user:user_4

 

Is it possible to get facets back that would give the count of documents that 
have common host AND user values (preferably ordered - i.e. host then user for 
this example, so as not to create a factorial explosion)? Note that the caller 
wouldn't know what machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could work for 
this, but I believe facet queries work on a different plane than this 
requirement (narrowing the term count, a.o.t. aggregating).

 

For the example above, the desired result would be:

 

machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)

 

machine_2/user_1 (2)

machine_2/user_4 (1)

 

Has anyone had a need for this type of faceting and found a way to achieve it?

 

Many thanks,

Peter

 

 
  
_
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Aggregated facet value counts?

2010-01-29 Thread Erik Hatcher
When faced with this type of situation where the data is entirely  
available at index-time, simply create an aggregated field that glues  
the two pieces together, and facet on that.


Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this  
type of faceting is possible:




The requirement is to build a query such that an aggregated facet  
count of common (and'ed) field values form the basis of each  
returned facet count.




For example:

Let's say I have a number of documents in an index with, among  
others, the fields 'host' and 'user':




Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1



Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4



Doc7  host:machine_1   user:user_4



Is it possible to get facets back that would give the count of  
documents that have common host AND user values (preferably ordered  
- i.e. host then user for this example, so as not to create a  
factorial explosion)? Note that the caller wouldn't know what  
machine and user values exist, only the field names.


I've tried using facet queries in various ways to see if they could  
work for this, but I believe facet queries work on a different plane  
than this requirement (narrowing the term count, a.o.t. aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to  
achieve it?




Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.  
Tell us now

http://clk.atdmt.com/UKM/go/195013117/direct/01/




RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Hi Erik,

 

Thanks for your reply. That's an interesting idea doing it at index-time, and a 
good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the caller to 
specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination isn't (short 
of creating fields for every possible combination).

 

Peter


 
> From: erik.hatc...@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Aggregated facet value counts?
> Date: Fri, 29 Jan 2010 06:30:27 -0500
> 
> When faced with this type of situation where the data is entirely 
> available at index-time, simply create an aggregated field that glues 
> the two pieces together, and facet on that.
> 
> Erik
> 
> On Jan 29, 2010, at 6:16 AM, Peter S wrote:
> 
> >
> > Hi,
> >
> >
> >
> > I was wondering if anyone had come across this use case, and if this 
> > type of faceting is possible:
> >
> >
> >
> > The requirement is to build a query such that an aggregated facet 
> > count of common (and'ed) field values form the basis of each 
> > returned facet count.
> >
> >
> >
> > For example:
> >
> > Let's say I have a number of documents in an index with, among 
> > others, the fields 'host' and 'user':
> >
> >
> >
> > Doc1 host:machine_1 user:user_1
> >
> > Doc2 host:machine_1 user:user_2
> >
> > Doc3 host:machine_1 user:user_1
> >
> > Doc3 host:machine_1 user:user_1
> >
> >
> >
> > Doc4 host:machine_2 user:user_1
> >
> > Doc5 host:machine_2 user:user_1
> >
> > Doc6 host:machine_2 user:user_4
> >
> >
> >
> > Doc7 host:machine_1 user:user_4
> >
> >
> >
> > Is it possible to get facets back that would give the count of 
> > documents that have common host AND user values (preferably ordered 
> > - i.e. host then user for this example, so as not to create a 
> > factorial explosion)? Note that the caller wouldn't know what 
> > machine and user values exist, only the field names.
> >
> > I've tried using facet queries in various ways to see if they could 
> > work for this, but I believe facet queries work on a different plane 
> > than this requirement (narrowing the term count, a.o.t. aggregating).
> >
> >
> >
> > For the example above, the desired result would be:
> >
> >
> >
> > machine_1/user_1 (3)
> >
> > machine_1/user_2 (1)
> >
> > machine_1/user_4 (1)
> >
> >
> >
> > machine_2/user_1 (2)
> >
> > machine_2/user_4 (1)
> >
> >
> >
> > Has anyone had a need for this type of faceting and found a way to 
> > achieve it?
> >
> >
> >
> > Many thanks,
> >
> > Peter
> >
> >
> >
> >
> > 
> > _
> > We want to hear all your funny, exciting and crazy Hotmail stories. 
> > Tell us now
> > http://clk.atdmt.com/UKM/go/195013117/direct/01/
> 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Marc Sturlese

I am testing trunk and have seen a different behaviour when loading
updateProcessors wich I don't know if it's normal (at least with multicore)
Before I use to use an updateProcessorChain this way:



   myChain

  






It does not work in current trunk. I have debuged the code and I have seen
now UpdateProcessorChain is loaded via:

  public  T initPlugins(List pluginInfos, Map
registry, Class type, String defClassName) {
T def = null;
for (PluginInfo info : pluginInfos) {
  T o = createInitInstance(info,type, type.getSimpleName(),
defClassName);
  registry.put(info.name, o);
  if(info.isDefault()){
def = o;
  }
}
return def;
  }

As I don't have default="true" in the configuration, my custom
processorChain is not used. Setting default="true" makes it work:



   myChain

  






As far as I understand, if you specify the chain you want to use in here:


   myChain



Shouldn't be necesary to set it as default.
Is it going to be kept this way?

Thanks in advance



-- 
View this message in context: 
http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Aggregated facet value counts?

2010-01-29 Thread Erik Hatcher
Creating values for every possible combination is what you're asking  
Solr to do at query-time, and as far as I know there isn't really a  
way to accomplish that like you're asking.   Is the need really to be  
arbitrary here?


Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index- 
time, and a good idea for known field combinations.


The only thing is

How to handle arbitrary field combinations? - i.e. to allow the  
caller to specify any combination of fields at query-time?


So, yes, the data is available at index-time, but the combination  
isn't (short of creating fields for every possible combination).




Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that glues
the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this
type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably ordered
- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could
work for this, but I believe facet queries work on a different plane
than this requirement (narrowing the term count, a.o.t.  
aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to
achieve it?



Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.
Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/




_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/




Re: loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess . default=true should not be necessary if there is only one
updateRequestProcessorChain specified . Open an issue

On Fri, Jan 29, 2010 at 6:06 PM, Marc Sturlese  wrote:
>
> I am testing trunk and have seen a different behaviour when loading
> updateProcessors wich I don't know if it's normal (at least with multicore)
> Before I use to use an updateProcessorChain this way:
>
> 
>    
>       myChain
>    
> 
> 
>     class="org.apache.solr.update.processor.CustomUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.LogUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.RunUpdateProcessorFactory" />
> 
>
> It does not work in current trunk. I have debuged the code and I have seen
> now UpdateProcessorChain is loaded via:
>
>  public  T initPlugins(List pluginInfos, Map
> registry, Class type, String defClassName) {
>    T def = null;
>    for (PluginInfo info : pluginInfos) {
>      T o = createInitInstance(info,type, type.getSimpleName(),
> defClassName);
>      registry.put(info.name, o);
>      if(info.isDefault()){
>            def = o;
>      }
>    }
>    return def;
>  }
>
> As I don't have default="true" in the configuration, my custom
> processorChain is not used. Setting default="true" makes it work:
>
> 
>    
>       myChain
>    
> 
> 
>     class="org.apache.solr.update.processor.CustomUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.LogUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.RunUpdateProcessorFactory" />
> 
>
> As far as I understand, if you specify the chain you want to use in here:
> 
>    
>       myChain
>    
> 
>
> Shouldn't be necesary to set it as default.
> Is it going to be kept this way?
>
> Thanks in advance
>
>
>
> --
> View this message in context: 
> http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Well, it wouldn't be 'every' combination - more of 'any' combination at 
query-time.
 
The 'arbitrary' part of the requirement is because it's not practical to 
predict every combination a user might ask for, although generally users would 
tend to search for similar/the same query combinations (but perhaps with 
different date ranges, for example).
 
If 'predicted aggregate fields' were calculated at index-time on, say, 10 
fields (the schema in question actually as 73 fields), that's 3,628,801 new 
fields. A large percentage of these would likely never be used (which ones 
would depend on the user, environment etc.).
 

Perhaps a more 'typical' use case than my network-based example would be a 
product search web page, where you want to show the number of products that are 
made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] 
(15) ). To obtain the (15) facet count value, you would have to correlate the 
number of Sony products (say, (861)), and the products that fall into the [600 
TO 800] price range (say, (1226) ). The (15) would be the intersection of the 
Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that 
filter queries could only do this for document hits if you know the field 
values ahead of time (e.g. fq=manufacturer:Sony&fq=price:[600 TO 800])? The 
facets could then be derived by simply counting the numFound for each result 
set.

 

If there were subsearch support in Solr (i.e. take the output of a query and 
use it as input into another) that included facets [perhaps there is such 
support?], it might be used to achieve this effect.


A custom query parser plugin could work, maybe? I suppose it would need to 
gather up all the separate facets and correlate them according to the input 
query (e.g. host and user, or manufacturer and price range). Such a mechanism 
would be crying out for caching, but perhaps it could leverage the existing 
field and query caches.
 

Peter

 


> From: erik.hatc...@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Aggregated facet value counts?
> Date: Fri, 29 Jan 2010 07:39:44 -0500
> 
> Creating values for every possible combination is what you're asking 
> Solr to do at query-time, and as far as I know there isn't really a 
> way to accomplish that like you're asking. Is the need really to be 
> arbitrary here?
> 
> Erik
> 
> On Jan 29, 2010, at 7:25 AM, Peter S wrote:
> 
> >
> > Hi Erik,
> >
> >
> >
> > Thanks for your reply. That's an interesting idea doing it at index- 
> > time, and a good idea for known field combinations.
> >
> > The only thing is
> >
> > How to handle arbitrary field combinations? - i.e. to allow the 
> > caller to specify any combination of fields at query-time?
> >
> > So, yes, the data is available at index-time, but the combination 
> > isn't (short of creating fields for every possible combination).
> >
> >
> >
> > Peter
> >
> >
> >
> >> From: erik.hatc...@gmail.com
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Aggregated facet value counts?
> >> Date: Fri, 29 Jan 2010 06:30:27 -0500
> >>
> >> When faced with this type of situation where the data is entirely
> >> available at index-time, simply create an aggregated field that glues
> >> the two pieces together, and facet on that.
> >>
> >> Erik
> >>
> >> On Jan 29, 2010, at 6:16 AM, Peter S wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>>
> >>>
> >>> I was wondering if anyone had come across this use case, and if this
> >>> type of faceting is possible:
> >>>
> >>>
> >>>
> >>> The requirement is to build a query such that an aggregated facet
> >>> count of common (and'ed) field values form the basis of each
> >>> returned facet count.
> >>>
> >>>
> >>>
> >>> For example:
> >>>
> >>> Let's say I have a number of documents in an index with, among
> >>> others, the fields 'host' and 'user':
> >>>
> >>>
> >>>
> >>> Doc1 host:machine_1 user:user_1
> >>>
> >>> Doc2 host:machine_1 user:user_2
> >>>
> >>> Doc3 host:machine_1 user:user_1
> >>>
> >>> Doc3 host:machine_1 user:user_1
> >>>
> >>>
> >>>
> >>> Doc4 host:machine_2 user:user_1
> >>>
> >>> Doc5 host:machine_2 user:user_1
> >>>
> >>> Doc6 host:machine_2 user:user_4
> >>>
> >>>
> >>>
> >>> Doc7 host:machine_1 user:user_4
> >>>
> >>>
> >>>
> >>> Is it possible to get facets back that would give the count of
> >>> documents that have common host AND user values (preferably ordered
> >>> - i.e. host then user for this example, so as not to create a
> >>> factorial explosion)? Note that the caller wouldn't know what
> >>> machine and user values exist, only the field names.
> >>>
> >>> I've tried using facet queries in various ways to see if they could
> >>> work for this, but I believe facet queries work on a different plane
> >>> than this requirement (narrowing the term count, a.o.t. 
> >>> aggregating).
> >>>
> >>>
> >>>
> >>> For the example above, the desired result would be:
> >>>
> >>>
> >>>
> >>> machine_1/user_1 (3)
> >>>
> >>> machine_1/user_2 (1)
> >>>
> >>

multi term, multi field, auto suggest

2010-01-29 Thread Lukas Kahwe Smith
Hi,

So over the course of the last two weeks I have been trying to come up with an 
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can 
have german, english, italian or french names. people have an additional 
firstname field. We also want to do auto suggest on the street and city names 
as well as on emails and telefon numbers. as such we are treating phonenumbers 
as text.

We do have the option for the user to use phonetic searches or to split 
(especially the compound german words), but I guess we will leave that out of 
the auto suggest.
We do expect that some users will type in properly cased strings, while some 
may just type in all lowercase.
We are using the dismax defType for our normal queries.

There will probably be less than 20M entities.

As such I guess the best approach is to copy all of the above mentioned fields 
(name, firstname, city, street, email, telefon) into a new field called "all".
It seems the best approach is to use facet.prefix for our requirements. We will 
therefore split of the last term in the query and pass it in as the 
"facet.prefix" while the rest is passed in as the "q" parameter.

Since facet's are driven out of the index, we will use the following type 
definition for this "all" field:

  



  


So essentially the idea is to just split on whitespace, remove stop words and 
word delimiters.

The query would then look something like the following if the user would enter 
"Kaltenreider Ver":
http://localhost:8983/solr/core0/select?defType=dismax&qf=all&q= 
Kaltenreider&indent=on&facet=on&facet.limit=10&facet.mincount=1&facet.field=all&rows=0&facet.prefix=Ver

Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of 
ram, albeit all of that will be shared with apache, mysql slave and a php app? 
Ah well questions like that are impossible to answer, so just trying to ask if 
you expect this to be really heavy. I noticed that in my initial testing with 
2M on my laptop facets seemed to be fine, though the first request was slow and 
the memory use spiked to 300MB. But I presume its just loading stuff into cache 
and concurrent requests shouldnt cause the memory use to go up linearly.

I am still a bit unsure how to handle both the lowercased and the case 
preserved version:

So here are some examples:
UBS => ubs|UBS
Kreuzstrasse => kreuzstrasse|Kreuzstrasse

So when I type "Kreu" I would get a suggestion of "Kreuzstrasse" and with 
"kreu" I would get "kreuzstrasse".
Since I do not expect any words to start with a lowercase letter and still 
contain some upper case letter we should be fine with this approach.

As in I doubt there would be stuff like "fooBar" which would lead to suggestion 
both "foobar" and "fooBar".

How can I achieve this?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org


Is optimizing always necessary?

2010-01-29 Thread Marcus Herou
If one only have additions do I then need to optimize the index at all ?

I thought that only update/deletes created "holes" in the index. Or should
the index be sorted on disk at all times, is that the reason ?

Cheers

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: Aggregated facet value counts?

2010-01-29 Thread Erik Hatcher
Sounds like what you're asking for is tree faceting.  A basic  
implementation is available in SOLR-792, but one that could also take  
facet.queries, numeric or date range buckets, to tree on would be a  
nice improvement.


Still, the underlying implementation will simply enumerate all the  
possible values (SOLR-792 has some short-circuiting when the top-level  
has zero, of course).  A client-side application could do this with  
multiple requests to Solr.


Subsearch - sure, just make more requests to Solr, rearranging the  
parameters.


I'd still say that in general for this type of need that it'll  
"generally" be less arbitrary and locking some things in during  
indexing will be the pragmatic way to go for most cases.


Erik



On Jan 29, 2010, at 9:28 AM, Peter S wrote:



Well, it wouldn't be 'every' combination - more of 'any' combination  
at query-time.


The 'arbitrary' part of the requirement is because it's not  
practical to predict every combination a user might ask for,  
although generally users would tend to search for similar/the same  
query combinations (but perhaps with different date ranges, for  
example).


If 'predicted aggregate fields' were calculated at index-time on,  
say, 10 fields (the schema in question actually as 73 fields),  
that's 3,628,801 new fields. A large percentage of these would  
likely never be used (which ones would depend on the user,  
environment etc.).



Perhaps a more 'typical' use case than my network-based example  
would be a product search web page, where you want to show the  
number of products that are made by a manufacturer and within a  
certain price range (e.g. Sony [$600-$800] (15) ). To obtain the  
(15) facet count value, you would have to correlate the number of  
Sony products (say, (861)), and the products that fall into the [600  
TO 800] price range (say, (1226) ). The (15) would be the  
intersection of the Sony hits and the price range hits by  
'manufacturer:Sony'. Am I right that filter queries could only do  
this for document hits if you know the field values ahead of time  
(e.g. fq=manufacturer:Sony&fq=price:[600 TO 800])? The facets could  
then be derived by simply counting the numFound for each result set.




If there were subsearch support in Solr (i.e. take the output of a  
query and use it as input into another) that included facets  
[perhaps there is such support?], it might be used to achieve this  
effect.



A custom query parser plugin could work, maybe? I suppose it would  
need to gather up all the separate facets and correlate them  
according to the input query (e.g. host and user, or manufacturer  
and price range). Such a mechanism would be crying out for caching,  
but perhaps it could leverage the existing field and query caches.



Peter





From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 07:39:44 -0500

Creating values for every possible combination is what you're asking
Solr to do at query-time, and as far as I know there isn't really a
way to accomplish that like you're asking. Is the need really to be
arbitrary here?

Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index-
time, and a good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the
caller to specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination
isn't (short of creating fields for every possible combination).



Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that  
glues

the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if  
this

type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably  
ordered

- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names

Re: Is optimizing always necessary?

2010-01-29 Thread Wangsheng Mei
In addition to destory the "holes" in the index, optimization is also used
to merge multiple small indexes into a bigger one.
Although I have not got specific performace data, I can imagine that this
will lead to performace benifits.
Supposing you have thousands of small indexes,  open-close these indexes
again and again should be time costing.

2010/1/30 Marcus Herou 

> If one only have additions do I then need to optimize the index at all ?
>
> I thought that only update/deletes created "holes" in the index. Or should
> the index be sorted on disk at all times, is that the reason ?
>
> Cheers
>
> //Marcus
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
>



-- 
梅旺生


Re: Newbie Question on Custom Query Generation

2010-01-29 Thread Wangsheng Mei
What's the point of generating your own query?
Are you sure that solr query syntax cannot satisfy your need?

2010/1/29 Abin Mathew 

> Hi I want to generate my own customized query from the input string entered
> by the user. It should look something like this
>
> *Search field : Microsoft*
> *
> Generated Query*  :
> description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
> role:microsoft requi
> rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0
> title:microsoft^3.5 functionalArea:microsoft
>
> *The lucene code we used is like this*
> BooleanQuery must = new BooleanQuery();
>
> addToBooleanQuery(must, "tags", inputData, synonymAnalyzer, 1.5f);
> addToBooleanQuery(must, "title", inputData, synonymAnalyzer);
> addToBooleanQuery(must, "role", inputData, synonymAnalyzer);
> addToBooleanQuery(query, "description", inputData, synonymAnalyzer);
> addToBooleanQuery(must, "requirement", inputData, synonymAnalyzer);
> addToBooleanQuery(must, "company", inputData, standardAnalyzer);
> addToBooleanQuery(must, "city", inputData, standardAnalyzer);
> must.setBoost(5.0f);
> query.add(must, Occur.MUST);
> addToBooleanQuery(query, "tags", includeAll, synonymAnalyzer, 2.0f);
> addToBooleanQuery(query, "title", includeAll, synonymAnalyzer, 3.5f);
> addToBooleanQuery(query, "functionalArea", inputData, synonymAnalyzer,);
> *
> In Simple english*
> addToBooleanQuery will add the particular field to the query after
> analysing
> using the analyser mentioned and setting a boost as specified
> So there "MUST" be a keyword match with any of the fields
> tags,title,role,description,requirement,company,city and it "SHOULD" occur
> in the fields tags,title and functionalArea.
>
> Hope you have got an idea of my requirement. I am not asking anyone to do
> it
> for me. Please let me know where can i start and give me some useful tips
> to
> move ahead with this. I believe that it has to do with modifying the XML
> configuration file and setting the parameters in Dismax handler. But I am
> still not sure. Please help
>
> Thanks & Regards
> Abin Mathew
>



-- 
梅旺生


Re: boosting unexpired documents

2010-01-29 Thread Wangsheng Mei
I think you can combine serveral solr supplied standard function query to
achieve this.

similar to:
&bf=map(map(div(ms(NOW, expiration),8640),-1,0,0), 1,1,1)

Furthermore, you would implement your own function and register it in
solrconfig.xml using valueSourceParser tag.

2010/1/29 Andy 

> Ah, thank you!
>
>
>
> --- On Fri, 1/29/10, Lance Norskog  wrote:
>
> > From: Lance Norskog 
> > Subject: Re: boosting unexpired documents
> > To: solr-user@lucene.apache.org
> > Date: Friday, January 29, 2010, 12:32 AM
> > You add a range query on the date,
> > and boost documents within that
> > date range. Check out the 'boost query' feature of dismax.
> >
> > http://www.lucidimagination.com/search/document/CDRG_ch07_7.4.2.9
> >
> > It's also possible with the standard query parser but a
> > pain in the neck:
> >
> > (value)^2 OR (NOT value)
> >
> >
> >
> > On Thu, Jan 28, 2010 at 6:58 PM, Andy 
> > wrote:
> > > My documents have a field "expiration" that is the
> > expiration date of that doc.
> > >
> > > I want to give a boost to all documents that haven't
> > expired. I still want to have expired documents returned,
> > but unexpired documents should be given priority.
> > >
> > > Ideally the boost amount for all unexpired documents
> > should be the same. i.e. whether the expiration date is
> > tomorrow or a month from now wouldn't make a difference.
> > Like wise all expired documents should be treated the same,
> > whether it expired yesterday or a year ago.
> > >
> > > Is that something possible? I read
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
> > but that's not quite what I want.
> > >
> > >
> > >
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>
>
>
>


-- 
梅旺生


Solr duplicates detection!!

2010-01-29 Thread Wangsheng Mei
Document Duplication Detection

[image: ] Solr1.4 

目录

   1. Document Duplication Detection <#Document_Duplication_Detection>
   2. Overview <#Overview>
  1. Goals <#Goals>
  2. Design <#Design>
   3. Notes <#Notes>
   4. Configuration <#Configuration>
  1. solrconfig.xml <#solrconfig.xml>
 1. Note <#Note>
  2. Settings <#Settings>

 Overview

Preventing duplicate or near duplicate documents from entering an index or
tagging documents with a signature/fingerprint for duplicate field
collapsing can be efficiently achieved with a low collision or fuzzy hash
algorithm. Solr should natively support deduplication techniques of this
type and allow for the easy addition of new hash/signature implementations.

Goals

   - Efficient, hash based exact/near document duplication detection and
   blocking.
   - Allow for both duplicate collapsing in search results as well as
   deduplication on adding a document.

 Design

Signature

A class capable of generating a signature String from the concatenation of a
group of specified document fields.

public abstract class Signature {
  public void init(SolrParams nl) {
  }

  public abstract String calculate(String content);
}

Implementations:

MD5Signature

128 bit hash used for exact duplicate detection.

Lookup3Signature 

64 bit hash used for exact duplicate detection, much faster than MD5 and
smaller to index

TextProfileSignature 

Fuzzy hashing implementation from nutch for near duplicate detection. Its
tunable but works best on longer text.

There are other more sophisticated algorithms for fuzzy/near hashing that
could be added later.

Notes

Adding in the dedupe process will change the allowDups setting so that it
applies to an update Term (with field signatureField in this case) rather
than the unique field Term (of course the signatureField could be the unique
field, but generally you want the unique field to be unique)

When a document is added, a signature will automatically be generated and
attached to the document in the specified signatureField.

Configuration

solrconfig.xml

The SignatureUpdateProcessorFactory
has to be registered in the
solrconfig.xml as part of the
UpdateRequest  Chain:

Accepting all defaults:

  



  

Example settings:

  
  

  true
  false
  id
  name,features,cat
  org.apache.solr.update.processor.Lookup3Signature



  

 Note

Also be sure to change your update handlers to use the defined chain, i.e.

  

  dedupe

  

The update processor can also be specified per request with a parameter of
update.processor=dedupe

Settings

*Setting*

*Default*

*Description*

signatureClass

org.apache.solr.update.processor.Lookup3Signature 

A Signature implementation for generating a signature hash.

fields

all fields

The fields to use to generate the signature hash in a comma separated list.
By default, all fields on the document will be used.

signatureField

signatureField

The name of the field used to hold the fingerprint/signature. Be sure the
field is defined in schema.xml.

enabled

true

Enable/disable dedupe factory processing


-- 
梅旺生


Re: Solr duplicates detection!!

2010-01-29 Thread Wangsheng Mei
Sorry by sending wrong message, this should go to my own mail box  :(

2010/1/30 Wangsheng Mei 

> Document Duplication Detection
>
> [image: ] Solr1.4 
>
> 目录
>
>1. Document Duplication 
> Detection<#1267b655a97b48f5_Document_Duplication_Detection>
>2. Overview <#1267b655a97b48f5_Overview>
>   1. Goals <#1267b655a97b48f5_Goals>
>   2. Design <#1267b655a97b48f5_Design>
>3. Notes <#1267b655a97b48f5_Notes>
>4. Configuration <#1267b655a97b48f5_Configuration>
>   1. solrconfig.xml <#1267b655a97b48f5_solrconfig.xml>
>  1. Note <#1267b655a97b48f5_Note>
>   2. Settings <#1267b655a97b48f5_Settings>
>
>  Overview
>
> Preventing duplicate or near duplicate documents from entering an index or
> tagging documents with a signature/fingerprint for duplicate field
> collapsing can be efficiently achieved with a low collision or fuzzy hash
> algorithm. Solr should natively support deduplication techniques of this
> type and allow for the easy addition of new hash/signature implementations.
>
> Goals
>
>- Efficient, hash based exact/near document duplication detection and
>blocking.
>- Allow for both duplicate collapsing in search results as well as
>deduplication on adding a document.
>
>  Design
>
> Signature
>
> A class capable of generating a signature String from the concatenation of
> a group of specified document fields.
>
> public abstract class Signature {
>   public void init(SolrParams nl) {
>   }
>
>   public abstract String calculate(String content);
> }
>
> Implementations:
>
> MD5Signature
>
> 128 bit hash used for exact duplicate detection.
>
> Lookup3Signature 
>
> 64 bit hash used for exact duplicate detection, much faster than MD5 and
> smaller to index
>
> TextProfileSignature 
>
> Fuzzy hashing implementation from nutch for near duplicate detection. Its
> tunable but works best on longer text.
>
> There are other more sophisticated algorithms for fuzzy/near hashing that
> could be added later.
>
> Notes
>
> Adding in the dedupe process will change the allowDups setting so that it
> applies to an update Term (with field signatureField in this case) rather
> than the unique field Term (of course the signatureField could be the unique
> field, but generally you want the unique field to be unique)
>
> When a document is added, a signature will automatically be generated and
> attached to the document in the specified signatureField.
>
> Configuration
>
> solrconfig.xml
>
> The 
> SignatureUpdateProcessorFactoryhas
>  to be registered in the solrconfig.xml as part of the
> UpdateRequest  Chain:
>
> Accepting all defaults:
>
>   
>
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> 
> 
>
>   
>
> Example settings:
>
>   
>   
>  class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>
>   true
>   false
>   id
>   name,features,cat
>
>name="signatureClass">org.apache.solr.update.processor.Lookup3Signature
> 
> 
> 
>
>   
>
>  Note
>
> Also be sure to change your update handlers to use the defined chain, i.e.
>
>   
> 
>   dedupe
>
> 
>   
>
> The update processor can also be specified per request with a parameter of
> update.processor=dedupe
>
> Settings
>
> *Setting*
>
> *Default*
>
> *Description*
>
> signatureClass
>
> org.apache.solr.update.processor.Lookup3Signature
>
> A Signature implementation for generating a signature hash.
>
> fields
>
> all fields
>
> The fields to use to generate the signature hash in a comma separated list.
> By default, all fields on the document will be used.
>
> signatureField
>
> signatureField
>
> The name of the field used to hold the fingerprint/signature. Be sure the
> field is defined in schema.xml.
>
> enabled
>
> true
>
> Enable/disable dedupe factory processing
>
>
> --
> 梅旺生
>



-- 
梅旺生


Deleting spelll checker index

2010-01-29 Thread darniz

Hello all,
We are using Index based spell checker.
i was wondering with the help of any url parameters can we delete the spell
check index directory.
please let me know 
thans
darniz


-- 
View this message in context: 
http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27376823.html
Sent from the Solr - User mailing list archive at Nabble.com.



Auto Suggest with multiple space separated words

2010-01-29 Thread Nair, Manas
Hi Experts,
 
I need an auto suggest functionality using SOLR which gives me the feel of 
using the fire fox browser. In short, if I type in a prefix, the results should 
drop down even if the prefix is not the starting of the drop down items.
 
Example: If I search for Lin, then the results could be 
[Abe Lincoln, Lindsay Lohan, Sarah Palin, Gasoline .].
 
Please suggest the best approach.
 
Any help is greatly appreciated.
 
Thankyou,
Manas Nair


distributed search and failed core

2010-01-29 Thread Joe Calderon
hello *, in distributed search when a shard goes down, an error is
returned and the search fails, is there a way to avoid the error and
return the results from the shards that are still up?

thx much

--joe


Re: Basic questions about Solr cost in programming time

2010-01-29 Thread Sven Maurmann

Hi!

Of course the answer depends (as usually) very much on the features
you want to realize. But Solr can be set up very fast. When we created
our first prototype, it took us about a week to get it running with
spell phoneme search, spell checking, facetting - and even collapsing
(using the famous 236-patch).

It is definitely very nice that you can do a lot of things using the
available components and only configuring them inside solrconfig.xml
and schema.xml.

And you may well start with the standard distribution.

Cheers,
   Sven

--On Dienstag, 26. Januar 2010 12:00 -0800 Jeff Crump 
 wrote:



Hi,
I hope this message is OK for this list.

I'm looking into search solutions for an intranet site built with Drupal.
Eventually we'd like to scale to enterprise search, which would include
the Drupal site, a document repository, and Jive SBS (collaboration
software). I'm interested in Lucene/Solr because of its scalability,
faceted search and optimization features, and because it is free. Our
problem is that we are a non-profit organization with only three very
busy programmers/sys admins supporting our employees around the world.

To help me argue for Solr in terms of total cost, I'm hoping that members
of this list can share their insights about the following:

* About how many hours of programming did it take you to set up your
instance of Lucene/Solr (not counting time spent on optimization)?

* Are there any disadvantages of going with a certified distribution
rather than the standard distribution?


Thanks and best regards,
Jeff

Jeff Crump
jcr...@hq.mercycorps.org


RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Tree faceting - that sounds very interesting indeed. I'll have a look into that 
and see how it fits, as well as any improvements for adding facet queries, 
cross-field aggregation, date range etc. There could be some very nice 
use-cases for such functionality. Just wondering how this would work with 
distributed shards/multi-core...


Many Thanks! 

Peter

 

 
> From: erik.hatc...@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Aggregated facet value counts?
> Date: Fri, 29 Jan 2010 12:20:07 -0500
> 
> Sounds like what you're asking for is tree faceting. A basic 
> implementation is available in SOLR-792, but one that could also take 
> facet.queries, numeric or date range buckets, to tree on would be a 
> nice improvement.
> 
> Still, the underlying implementation will simply enumerate all the 
> possible values (SOLR-792 has some short-circuiting when the top-level 
> has zero, of course). A client-side application could do this with 
> multiple requests to Solr.
> 
> Subsearch - sure, just make more requests to Solr, rearranging the 
> parameters.
> 
> I'd still say that in general for this type of need that it'll 
> "generally" be less arbitrary and locking some things in during 
> indexing will be the pragmatic way to go for most cases.
> 
> Erik
> 
> 
> 
> On Jan 29, 2010, at 9:28 AM, Peter S wrote:
> 
> >
> > Well, it wouldn't be 'every' combination - more of 'any' combination 
> > at query-time.
> >
> > The 'arbitrary' part of the requirement is because it's not 
> > practical to predict every combination a user might ask for, 
> > although generally users would tend to search for similar/the same 
> > query combinations (but perhaps with different date ranges, for 
> > example).
> >
> > If 'predicted aggregate fields' were calculated at index-time on, 
> > say, 10 fields (the schema in question actually as 73 fields), 
> > that's 3,628,801 new fields. A large percentage of these would 
> > likely never be used (which ones would depend on the user, 
> > environment etc.).
> >
> >
> > Perhaps a more 'typical' use case than my network-based example 
> > would be a product search web page, where you want to show the 
> > number of products that are made by a manufacturer and within a 
> > certain price range (e.g. Sony [$600-$800] (15) ). To obtain the 
> > (15) facet count value, you would have to correlate the number of 
> > Sony products (say, (861)), and the products that fall into the [600 
> > TO 800] price range (say, (1226) ). The (15) would be the 
> > intersection of the Sony hits and the price range hits by 
> > 'manufacturer:Sony'. Am I right that filter queries could only do 
> > this for document hits if you know the field values ahead of time 
> > (e.g. fq=manufacturer:Sony&fq=price:[600 TO 800])? The facets could 
> > then be derived by simply counting the numFound for each result set.
> >
> >
> >
> > If there were subsearch support in Solr (i.e. take the output of a 
> > query and use it as input into another) that included facets 
> > [perhaps there is such support?], it might be used to achieve this 
> > effect.
> >
> >
> > A custom query parser plugin could work, maybe? I suppose it would 
> > need to gather up all the separate facets and correlate them 
> > according to the input query (e.g. host and user, or manufacturer 
> > and price range). Such a mechanism would be crying out for caching, 
> > but perhaps it could leverage the existing field and query caches.
> >
> >
> > Peter
> >
> >
> >
> >
> >> From: erik.hatc...@gmail.com
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Aggregated facet value counts?
> >> Date: Fri, 29 Jan 2010 07:39:44 -0500
> >>
> >> Creating values for every possible combination is what you're asking
> >> Solr to do at query-time, and as far as I know there isn't really a
> >> way to accomplish that like you're asking. Is the need really to be
> >> arbitrary here?
> >>
> >> Erik
> >>
> >> On Jan 29, 2010, at 7:25 AM, Peter S wrote:
> >>
> >>>
> >>> Hi Erik,
> >>>
> >>>
> >>>
> >>> Thanks for your reply. That's an interesting idea doing it at index-
> >>> time, and a good idea for known field combinations.
> >>>
> >>> The only thing is
> >>>
> >>> How to handle arbitrary field combinations? - i.e. to allow the
> >>> caller to specify any combination of fields at query-time?
> >>>
> >>> So, yes, the data is available at index-time, but the combination
> >>> isn't (short of creating fields for every possible combination).
> >>>
> >>>
> >>>
> >>> Peter
> >>>
> >>>
> >>>
>  From: erik.hatc...@gmail.com
>  To: solr-user@lucene.apache.org
>  Subject: Re: Aggregated facet value counts?
>  Date: Fri, 29 Jan 2010 06:30:27 -0500
> 
>  When faced with this type of situation where the data is entirely
>  available at index-time, simply create an aggregated field that 
>  glues
>  the two pieces together, and facet on that.
> 
>  Erik
> 
>  On Jan 29, 2010, at 6:16 AM, Peter S wrote:
> 

sort items by whether the user has viewed it or not

2010-01-29 Thread a8910b-solr
hi,

i want to query for documents that have certain values but i want it first 
sorted by documents that this person has viewed in the past.  i can't store 
each user's view information in the document so i want to pass that in to the 
search.  is it possible to do something like this:

http://solr?q=baseball&sort=doc_isbn("ABC" or "DEF" or "GHI") desc, title desc

any help is appreciated,
r



Re: loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Chris Hostetter


: I guess . default=true should not be necessary if there is only one
: updateRequestProcessorChain specified . Open an issue

No... that doesn't seem right.  If you declare you're own chains, but you 
don't mark any of them as default="true" then it shouldn't matter how many 
of them you declare, SolrCore should create a default for you.


The real question here is: why isn't he getting his explicilty defined 
chain when he refrences it by name?


declaring that he wants his explicitly named chain to be the default is 
fine, and that should work, but w/o declaring it as the default he should 
still be able to ask for it by name ... why isn't that working? ...


: > 
: > � �
: > � � � myChain
: > � �

Marc, can you confirm that when you don't declare your chain as 
default="true" that...
1) an instance of your CustomUpdateProcessorFactory is actaully getting 
instantiated by solr (via logging or runningg in a debugger)
2) wether your custom chain is used if you pass update.processor=myChain 
as a request param instead of relying on the configured defaults


(I wonder if some handler refactoring caused the default 
processing logic to no longer respect the defaults)




-Hoss

Re: update doc success, but could not find the new value

2010-01-29 Thread Chris Hostetter

: Subject: update doc success, but could not find the new value
: In-Reply-To: <449216.59315...@web56308.mail.re3.yahoo.com>
: References: <27335403.p...@talk.nabble.com>
: <449216.59315...@web56308.mail.re3.yahoo.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



RE: Solr wiki link broken

2010-01-29 Thread Chris Hostetter

: Why don't we change the links to have "FrontPage" explicitly?
: Wouldn't it be the easiest fix unless there are numerous
: other pages that references the default page w/o "FrontPage"?

I'm fairly confident that there are more links pointing to 
http://wiki.apache.org/solr/ then there are alternate versions in 
differnet langauges ... particularly when you start factoring in all of 
the webpages in the world that we don't have the ability to edit 
directly.


-Hoss



Re: NullPointerException in ReplicationHandler.postCommit + question about compression

2010-01-29 Thread Chris Hostetter

: never keep a 0.
: 
: It is better to leave not mention the deletionPolicy at all. The
: defaults are usually fine.

if setting the "keep" values to 0 results in NPEs we should do one (if not 
both) of the following...

1) change the init code to warn/fail if the values are 0 (not sure if 
there is ever a legitimate use for 0 as a value)

2) change the code that's currently throwing an NPE to check it's 
assumptings and log a more meaninful error if it can't function because of 
the existing config.


-Hoss



RE: How to Implement SpanQuery in Solr . . ?

2010-01-29 Thread Chris Hostetter

: and Solr. I was hoping to start by getting a simple example working in SOLR
: and then iterate towards the more complex, given this is my first attempt at
: extending Solr.

wise choice.

: For my first iteration of SpanQuery in Solr I am thinking of starting with a
: simple syntax to combine:

...honestly: since you already mentioned that you might eventually want to 
integrate Qsol, i would suggest you start with that directly.  that way 
you are taking an eixsting parser (that you evidently understand) and just 
hooking it via the QParser abstraction (as opposed to writting a 
Lucene String->Query parser *and* learning the QParser/Solr internals.

: implementation on the Lucene side and the FooQParserPlugin as a reference
: implementation on the SOLR side?

The FooQParserPlugin is fairly primative and doesn't really make it 
obvious some of the things you can do with a QParser, so you may also want 
to skim the LuceneQParserPlugin as well

: The other part of the riddle I would really appreciate some guidance on is
: how to get it to plug-in to SOLR correctly?

http://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins
http://wiki.apache.org/solr/SolrPlugins#QParserPlugin


-Hoss



Re: Solr Cache Viewing/Browsing

2010-01-29 Thread Chris Hostetter
: used in a modified DisMaxHandler) and I was wondering if there is a way to
: get at this data from the JSP pages? I then thought that it might be nice to
: view more information about the respective caches like the current elements,
: recently evicted etc to help debug performance issues. Has anyone worked on
: this or have any ideas surrounding this?

I don't beleive anyone has looked into this.

It would be hard to implement in a generic manner since the SolrCache API 
doesn't provide any mechanism for inspecting the contents, but you could 
write an implementation that expost some of these things through the 
getStatstics method (or some other new introspection based API)



-Hoss



Re: replication setup

2010-01-29 Thread Chris Hostetter

: Subject: replication setup
: In-Reply-To: <83ec2c9c1001260724t110d6595m5071e0a40e1b1...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



Re: Analysis tool vs search query

2010-01-29 Thread Chris Hostetter

: I've run into this issue that I have no way of resolving, since the analysis
: tool doesn't show me there is an error. I copy the exact field value into
: the analysis tool and i type in the exact query request i'm issuing and the
: tool finds it a match. However running the query with that exact same

the analysis tool doesn'ty do query parsing .. so pasing a *query* 
string into the analysis tool isn't going to give you any meaningful 
information.

what the "query" section of the analysis tool lets you do is see what the 
"query time analyzer" (that is used by most query parsers at query time) 
will do with your input ... but the QueryParser is still in control, and 
it decides which input to pass to your analyser -- special characters 
(like whitespace) have meaning to most query parsers, before they ever 
have a chance of getting passed to the analyzer.

: 

A keyword tokenizer results in a single token for each input string, but 
the (default) query parser is going to chunk the input up on whitespace 
before the analyzer is ever invoked, unless you put it in a quoted string.


-Hoss



Re: using bq with standard request handler

2010-01-29 Thread Chris Hostetter

: I am using a query like:
: 
http://localhost:8080/solr/select?q=product_category:Grocery&fq=in_stock:true&debugQuery=true&;
: sort=display_priority+desc,prop_l_price+asc
...
: Is it possible to use display_priority/price fields in bq itself to acheive
: the same result.. I tried forming some queries that but was unable to get
: desired results...

bf and bq are features of hte dismax parser, so the default query parser 
won't use them -- it really wouldn't even make sense as a possible 
new feature, because the types of queries that might be specified using 
the lucene QParser are too broad to be able to define a consistent 
mechanism for knowing how/where to add the boosting queries to the 
structure.

if all of your queries have that identical structure, you might consider 
however somehting like...

http://localhost:8080/solr/select?qf=product_category&q=Grocery&fq=in_stock:true&bq=...


-Hoss



Re: Mail config

2010-01-29 Thread Chris Hostetter

: I do not want to receive all the emails from this mail list, I only want to
: receive the answers to my questions, is this possible?

That's not how mailing lists work.  If you want to participate in teh 
community, you have to participate fully.

: If I am not mistaken when I unsubscribed I sent an email which did not reach
: the mail list at all (therefore there was of course no chance to get any
: replies).

The same mechanism that prevents you from posting when you are not 
subscribed is the mechanism that prevents thousands of spam messages from 
getting sent to the list every day .. you have to take the "bad" with the 
good.

: I am newbie for Solr and I doubt I can contribute much by answering to other
: posts.

But you can learn from those posts, and the discussion/responses they 
stimulate...
http://people.apache.org/~hossman/#private_q




-Hoss



Re: Lock problems: Lock obtain timed out

2010-01-29 Thread Chris Hostetter

: Can anyone think of a reason why these locks would hang around for more than
: 2 hours?
: 
: I have been monitoring them and they look like they are very short lived.

Typically the lock files are only left arround for more then a few seconds 
when there was a fatal crash of some kind ... an OOM Error for example, or 
as already mentioned in this thread...

: >> > > SEVERE: java.io.IOException: No space left on device

...if you check your solr logs for messages in the immediate time frame 
following the the lastModified time of the lock file you'll probably find 
something interesting.


-Hoss



Re: scenario with FQ parameter

2010-01-29 Thread Chris Hostetter

:&qf=field1^10 field2^20 field^100&fq=*:9+OR+(field1:"xyz")
...
: I know I can use copy field (say 'text') to copy all the fields and then
...
: but doing so , the boost weights specified in the 'qf' field have no effect
: on the score.

An FQ never has any impact on the score, so your question is ab it 
confusing.

If you want to influence the scores, you'll need to use "bq" instead of 
"fq".

as discussed in another current thread on this list, it's possible to make 
the "bq" param use the dismax parser as well, but there are some tricky 
issues involved with that ... unless your use case is actaully more 
complicated then you are describing, you should probably just use 
something like...

...&qf=field1^10+field2^20+field^100&bq=field1:9^10+field2:9^20+field:9^100+field1:xyz

-Hoss



Re: How can I boost bq in FieldQParserPlugin?

2010-01-29 Thread Chris Hostetter

: q=ipod&bq={!dismax qf=userId^0.5 v=$qq bq=}&qq=12345&qt=dismax&debugQuery=on
: 
: I try to debug the above query, it turned out to be as:
: +DisjunctionMaxQuery((content:ipod | title:ipod^4.0)~0.01) ()
: +DisjunctionMaxQuery((userId:12345^0.5)~0.01)

...hmmm, i'm not sure why that's happening, but it certianly seems like a 
bug -- i ust have no idea what that bug is.  

the inner dismax parser should definitely be producing a query where the 
DisjunctionMaxQuery for "12345" is "mandatory" but that mandatory clause 
should be wrapped inside of another boolean query which should be added to 
the outermost query as an "optional" clause.

somewhere that BooleanQuery produced by the inner dismax parser is getting 
thrown away ... hmmm, actually that this is a neccessary behavior of
DismaxQParser for some cases (that it sheds it's own outermost 
BooleanQuery when not needed), but in this case it's screwing you because 
it doesn't realize you really do need it.

does this owrk better? ...

q=ipod&bq={!dismax qf=userId^0.5 v=$qq 
bq=*:*^0}&qq=12345&qt=dismax&debugQuery=on

...it's kind of kludgy, but it should garuntee you that wrapping 
BooleanQuerry is preserved.



-Hoss



Re: Large Query Strings and performance

2010-01-29 Thread Chris Hostetter

: I am using Solr 1.4 with  large query strings with 20+ terms and faceting on
: a single multi-valued field in a 1 million record system. I am using Solr to
: categorize text, that why the query strings are big.
: 
: The performance get's worse the more search terms are used.  Is there any

can you elaborate more on the types of query strings you are using? ... 
are they simply BooleanQuries consiting of many terms? ... are they all 
optional?

We have to understand your goal, what exactly you are currently doing, and 
what exactly you have already tried before we can suggest ways of 
achieving your goal faster then things you've already tried.



-Hoss



Re: request handler defaults

2010-01-29 Thread Chris Hostetter

: I have noticed that atm there doesnt seem to be a way to inherit request 
: handler definitions. This would be nice to be able to define some basic 
: requesthandlers (maybe even with the option of defining them "abstract") 

The idea has been discussed before...
   https://issues.apache.org/jira/browse/SOLR-112 ...as you note in the 
latter comments, we ultimately we hit a wall in trying to determine a 
"generic" way to merge the init params for any arbitrary RequestHandler -- 
things like default/invariant/appends are conventions of the OOTB 
handlers, but not all handlers are garunteed to support them. 

It's the kind of thing that individual handlers could easily implement 
internally (especailly now that they have way of being SolrCoreAware and 
asking for other handlers by name (which they could then ask for whatever 
data they needed to initialize themselves)

: Furthermore it would be nice to be able to dynamically append things in 
: a request. For example I run a search on the companies dismax handler 
: and I find no (or just very few) result, then I want to also include a 
: field that has a doublemethaphone analyzer on the name. So I just want 
: to append that field to the qf setting of the request handler defaults.

...isn't qf a multiValued param? ... as long as you declare it in the 
"appends" init set instead of "defaults" it should work exactly the way 
you describe.



-Hoss



Re: Master Read Timeout

2010-01-29 Thread Chris Hostetter

: Is there any way to increase the Slave's timeout value? Are there any 

http://wiki.apache.org/solr/SolrReplication?highlight=%28timeout%29


-Hoss



RE: matching exact/whole phrase

2010-01-29 Thread Chris Hostetter

: Is it safe to say in order to do exact matches the field should be a string.

It depends on your definition of "exact"

If you want exact matches, including unicode codepoints and 
leading/trailing whitespace, then StrField would probably make sense -- 
but you could equally use TextField with a KeywrodTokenizer and nothing 
else.

If you want *some* normalization (ie: trim leading/trailing whitespace, 
map equivilent codepoints to a canonical representation, etc...) then you 
need TextyField.

: Now in my dismax handler if i have the qf defined as text field and run a
: phrase search on text field
: "my car is the best car in the world"
: i dont get back any results. looking with debugQuery=on this is the
: parsedQuery
: text:"my tire pressure warning light came my honda civic"
: This will not work since text was indexed by removing all stop words.

it *can* work if the query analyzer for your text field type is also 
configured to remove stopwords, and if you either: configure the 
StopFilter(s) to deal with token positions in the way the parser expects 
(i forget which one works, you have to play with it); OR us a "qs" (query 
slop) value that gives you enough slop to miss those empty stop word gaps.


-Hoss



Re: Deleting spelll checker index

2010-01-29 Thread Chris Hostetter

: We are using Index based spell checker.
: i was wondering with the help of any url parameters can we delete the spell
: check index directory.

I don't think so.

You might be able to configure two differnet spell check components that 
point at the same directory -- one hat builds off of a real field, and one 
that builds off of an (empty) text field (using FileBasedSpellChecker) .. 
then you could trigger a rebuild of an empty spell checking index using 
the second component.

But i've never tried it so i have no idea if it would work.


-Hoss



DataImportHandler multivalued field Collection not working

2010-01-29 Thread Jason Rutherglen
DataImportHandler multivalued field Collection isn't
working the way I'd expect, meaning not at all. I logged the
collection is there, however the multivalue collection field
just isn't being indexed (according to the DIH web UI and it's
not in the index).


Re: DataImportHandler multivalued field Collection not working

2010-01-29 Thread Wangsheng Mei
Did you correctly set multiValue(not multivalue)="true" in schema.xml?

2010/1/30 Jason Rutherglen 

> DataImportHandler multivalued field Collection isn't
> working the way I'd expect, meaning not at all. I logged the
> collection is there, however the multivalue collection field
> just isn't being indexed (according to the DIH web UI and it's
> not in the index).
>



-- 
梅旺生


Re: Newbie Question on Custom Query Generation

2010-01-29 Thread Abin Mathew
Hi, I realized the power of Dismax Query Handler recently and now I
dont need to generate my own query since Dismax is giving better
results.Thanks a lot

2010/1/29 Wangsheng Mei :
> What's the point of generating your own query?
> Are you sure that solr query syntax cannot satisfy your need?
>
> 2010/1/29 Abin Mathew 
>
>> Hi I want to generate my own customized query from the input string entered
>> by the user. It should look something like this
>>
>> *Search field : Microsoft*
>> *
>> Generated Query*  :
>> description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
>> role:microsoft requi
>> rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0
>> title:microsoft^3.5 functionalArea:microsoft
>>
>> *The lucene code we used is like this*
>> BooleanQuery must = new BooleanQuery();
>>
>> addToBooleanQuery(must, "tags", inputData, synonymAnalyzer, 1.5f);
>> addToBooleanQuery(must, "title", inputData, synonymAnalyzer);
>> addToBooleanQuery(must, "role", inputData, synonymAnalyzer);
>> addToBooleanQuery(query, "description", inputData, synonymAnalyzer);
>> addToBooleanQuery(must, "requirement", inputData, synonymAnalyzer);
>> addToBooleanQuery(must, "company", inputData, standardAnalyzer);
>> addToBooleanQuery(must, "city", inputData, standardAnalyzer);
>> must.setBoost(5.0f);
>> query.add(must, Occur.MUST);
>> addToBooleanQuery(query, "tags", includeAll, synonymAnalyzer, 2.0f);
>> addToBooleanQuery(query, "title", includeAll, synonymAnalyzer, 3.5f);
>> addToBooleanQuery(query, "functionalArea", inputData, synonymAnalyzer,);
>> *
>> In Simple english*
>> addToBooleanQuery will add the particular field to the query after
>> analysing
>> using the analyser mentioned and setting a boost as specified
>> So there "MUST" be a keyword match with any of the fields
>> tags,title,role,description,requirement,company,city and it "SHOULD" occur
>> in the fields tags,title and functionalArea.
>>
>> Hope you have got an idea of my requirement. I am not asking anyone to do
>> it
>> for me. Please let me know where can i start and give me some useful tips
>> to
>> move ahead with this. I believe that it has to do with modifying the XML
>> configuration file and setting the parameters in Dismax handler. But I am
>> still not sure. Please help
>>
>> Thanks & Regards
>> Abin Mathew
>>
>
>
>
> --
> 梅旺生
>


Looking for a Solr volunteer for www.comics.org

2010-01-29 Thread Henry Andrews
Hi folks,
  I apologize if this isn't the right place to post this (alternate suggestions 
welcome alongside appropriate chastisement :-)

  I'm trying to recruit a volunteer to implement a Solr-based search system for 
the Grand Comic-Book Database (http://www.comics.org/).  We're a non-profit, 
non-commercial, international group researching and indexing comic books, and 
we have only two active programmers (we're both unpaid volunteers, as are all 
GCD personnel).  We'd love to have better search, and Solr looks like the right 
tool, but we're swamped with other technical work.

  So if anyone reading this would like to help out a comic book-related web 
site with their Solr experience, for absolutely no monetary compensation 
whatsoever, do please let me know :-D  It would help to be into comic books, 
but that's not strictly required.  Your work would be used quite heavily, and 
you could of course point that out to anyone you might wish to impress with 
your expertise.  Our technical work is open-source, and therefore available for 
inspection and showing off.

  To clarify:  I'm not looking for assistance with or pointers about setting 
Solr up myself (no matter how easy it is).  And I'm not trying to get the list 
as a whole to do our work for us.  I'm just trying to find if any individual 
feels like joining our tech team and volunteering for the project and couldn't 
think of a more likely place to find candidates than here.  If we don't find a 
volunteer, I'll end up doing it next year, and I'll be reading a lot more 
documentation before asking any questions here.

thanks,
-henry



Re: Deleting spelll checker index

2010-01-29 Thread darniz

Then i assume the easiest way is to delete the directory itself.

darniz


hossman wrote:
> 
> 
> : We are using Index based spell checker.
> : i was wondering with the help of any url parameters can we delete the
> spell
> : check index directory.
> 
> I don't think so.
> 
> You might be able to configure two differnet spell check components that 
> point at the same directory -- one hat builds off of a real field, and one 
> that builds off of an (empty) text field (using FileBasedSpellChecker) .. 
> then you could trigger a rebuild of an empty spell checking index using 
> the second component.
> 
> But i've never tried it so i have no idea if it would work.
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27381620.html
Sent from the Solr - User mailing list archive at Nabble.com.