date:20100129

You can avoid one word terms by setting outputUnigrams="false" on the  
ShingleFilterFactory configuration.


Erik

On Jan 28, 2010, at 11:29 PM, Christopher Ball wrote:


I am curious how I can query for multi-term phrases using the
TermsComponent?



The field I am searching has been shingled so it contains 2 and 3 word
phrases.



For example in the sample results below I want to only get back  
multi-word
phrases such as "table of contents" and "under the" but not the  
single word

terms such as "year" and "significant"



25302

25162

25097

17501

17359



Appreciate any ideas,



Christopher

Re: Newbie Question on Custom Query Generation

dismax won't quite give you the same query result.  What you can do  
pretty easily, though, is create a QParser and QParserPlugin pair,  
register it solrconfig.xml and then use &defType=.   
Pretty straightforward.  Have a look at Solr's various QParserPlugin  
implementations for details.


Erik

On Jan 29, 2010, at 12:30 AM, Abin Mathew wrote:

Hi I want to generate my own customized query from the input string  
entered

by the user. It should look something like this

*Search field : Microsoft*
*
Generated Query*  :
description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
role:microsoft requi
rement:microsoft company:microsoft city:microsoft)^5.0)  
tags:microsoft^2.0

title:microsoft^3.5 functionalArea:microsoft

*The lucene code we used is like this*
BooleanQuery must = new BooleanQuery();

addToBooleanQuery(must, "tags", inputData, synonymAnalyzer, 1.5f);
addToBooleanQuery(must, "title", inputData, synonymAnalyzer);
addToBooleanQuery(must, "role", inputData, synonymAnalyzer);
addToBooleanQuery(query, "description", inputData, synonymAnalyzer);
addToBooleanQuery(must, "requirement", inputData, synonymAnalyzer);
addToBooleanQuery(must, "company", inputData, standardAnalyzer);
addToBooleanQuery(must, "city", inputData, standardAnalyzer);
must.setBoost(5.0f);
query.add(must, Occur.MUST);
addToBooleanQuery(query, "tags", includeAll, synonymAnalyzer, 2.0f);
addToBooleanQuery(query, "title", includeAll, synonymAnalyzer, 3.5f);
addToBooleanQuery(query, "functionalArea", inputData,  
synonymAnalyzer,);

*
In Simple english*
addToBooleanQuery will add the particular field to the query after  
analysing

using the analyser mentioned and setting a boost as specified
So there "MUST" be a keyword match with any of the fields
tags,title,role,description,requirement,company,city and it "SHOULD"  
occur

in the fields tags,title and functionalArea.

Hope you have got an idea of my requirement. I am not asking anyone  
to do it
for me. Please let me know where can i start and give me some useful  
tips to
move ahead with this. I believe that it has to do with modifying the  
XML
configuration file and setting the parameters in Dismax handler. But  
I am

still not sure. Please help

Thanks & Regards
Abin Mathew

Aggregated facet value counts?


Hi,

 

I was wondering if anyone had come across this use case, and if this type of 
faceting is possible:

 

The requirement is to build a query such that an aggregated facet count of 
common (and'ed) field values form the basis of each returned facet count.

 

For example:

Let's say I have a number of documents in an index with, among others, the 
fields 'host' and 'user':

 

Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1

 

Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4

 

Doc7  host:machine_1   user:user_4

 

Is it possible to get facets back that would give the count of documents that 
have common host AND user values (preferably ordered - i.e. host then user for 
this example, so as not to create a factorial explosion)? Note that the caller 
wouldn't know what machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could work for 
this, but I believe facet queries work on a different plane than this 
requirement (narrowing the term count, a.o.t. aggregating).

 

For the example above, the desired result would be:

 

machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)

 

machine_2/user_1 (2)

machine_2/user_4 (1)

 

Has anyone had a need for this type of faceting and found a way to achieve it?

 

Many thanks,

Peter

 

 
  
_
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Aggregated facet value counts?

When faced with this type of situation where the data is entirely  
available at index-time, simply create an aggregated field that glues  
the two pieces together, and facet on that.


Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this  
type of faceting is possible:




The requirement is to build a query such that an aggregated facet  
count of common (and'ed) field values form the basis of each  
returned facet count.




For example:

Let's say I have a number of documents in an index with, among  
others, the fields 'host' and 'user':




Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1



Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4



Doc7  host:machine_1   user:user_4



Is it possible to get facets back that would give the count of  
documents that have common host AND user values (preferably ordered  
- i.e. host then user for this example, so as not to create a  
factorial explosion)? Note that the caller wouldn't know what  
machine and user values exist, only the field names.


I've tried using facet queries in various ways to see if they could  
work for this, but I believe facet queries work on a different plane  
than this requirement (narrowing the term count, a.o.t. aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to  
achieve it?




Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.  
Tell us now

http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Aggregated facet value counts?


Hi Erik,

 

Thanks for your reply. That's an interesting idea doing it at index-time, and a 
good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the caller to 
specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination isn't (short 
of creating fields for every possible combination).

 

Peter


 
> From: erik.hatc...@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Aggregated facet value counts?
> Date: Fri, 29 Jan 2010 06:30:27 -0500
> 
> When faced with this type of situation where the data is entirely 
> available at index-time, simply create an aggregated field that glues 
> the two pieces together, and facet on that.
> 
> Erik
> 
> On Jan 29, 2010, at 6:16 AM, Peter S wrote:
> 
> >
> > Hi,
> >
> >
> >
> > I was wondering if anyone had come across this use case, and if this 
> > type of faceting is possible:
> >
> >
> >
> > The requirement is to build a query such that an aggregated facet 
> > count of common (and'ed) field values form the basis of each 
> > returned facet count.
> >
> >
> >
> > For example:
> >
> > Let's say I have a number of documents in an index with, among 
> > others, the fields 'host' and 'user':
> >
> >
> >
> > Doc1 host:machine_1 user:user_1
> >
> > Doc2 host:machine_1 user:user_2
> >
> > Doc3 host:machine_1 user:user_1
> >
> > Doc3 host:machine_1 user:user_1
> >
> >
> >
> > Doc4 host:machine_2 user:user_1
> >
> > Doc5 host:machine_2 user:user_1
> >
> > Doc6 host:machine_2 user:user_4
> >
> >
> >
> > Doc7 host:machine_1 user:user_4
> >
> >
> >
> > Is it possible to get facets back that would give the count of 
> > documents that have common host AND user values (preferably ordered 
> > - i.e. host then user for this example, so as not to create a 
> > factorial explosion)? Note that the caller wouldn't know what 
> > machine and user values exist, only the field names.
> >
> > I've tried using facet queries in various ways to see if they could 
> > work for this, but I believe facet queries work on a different plane 
> > than this requirement (narrowing the term count, a.o.t. aggregating).
> >
> >
> >
> > For the example above, the desired result would be:
> >
> >
> >
> > machine_1/user_1 (3)
> >
> > machine_1/user_2 (1)
> >
> > machine_1/user_4 (1)
> >
> >
> >
> > machine_2/user_1 (2)
> >
> > machine_2/user_4 (1)
> >
> >
> >
> > Has anyone had a need for this type of faceting and found a way to 
> > achieve it?
> >
> >
> >
> > Many thanks,
> >
> > Peter
> >
> >
> >
> >
> > 
> > _
> > We want to hear all your funny, exciting and crazy Hotmail stories. 
> > Tell us now
> > http://clk.atdmt.com/UKM/go/195013117/direct/01/
> 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Marc Sturlese


I am testing trunk and have seen a different behaviour when loading
updateProcessors wich I don't know if it's normal (at least with multicore)
Before I use to use an updateProcessorChain this way:



   myChain

  






It does not work in current trunk. I have debuged the code and I have seen
now UpdateProcessorChain is loaded via:

  public  T initPlugins(List pluginInfos, Map
registry, Class type, String defClassName) {
T def = null;
for (PluginInfo info : pluginInfos) {
  T o = createInitInstance(info,type, type.getSimpleName(),
defClassName);
  registry.put(info.name, o);
  if(info.isDefault()){
def = o;
  }
}
return def;
  }

As I don't have default="true" in the configuration, my custom
processorChain is not used. Setting default="true" makes it work:



   myChain

  






As far as I understand, if you specify the chain you want to use in here:


   myChain



Shouldn't be necesary to set it as default.
Is it going to be kept this way?

Thanks in advance



-- 
View this message in context: 
http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Aggregated facet value counts?

2010-01-29 Thread Noble Paul നോബിള്‍ नोब्ळ्

Creating values for every possible combination is what you're asking  
Solr to do at query-time, and as far as I know there isn't really a  
way to accomplish that like you're asking.   Is the need really to be  
arbitrary here?


Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index- 
time, and a good idea for known field combinations.


The only thing is

How to handle arbitrary field combinations? - i.e. to allow the  
caller to specify any combination of fields at query-time?


So, yes, the data is available at index-time, but the combination  
isn't (short of creating fields for every possible combination).




Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that glues
the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this
type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably ordered
- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could
work for this, but I believe facet queries work on a different plane
than this requirement (narrowing the term count, a.o.t.  
aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to
achieve it?



Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.
Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/




_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: loading an updateProcessorChain with multicore in trunk

I guess . default=true should not be necessary if there is only one
updateRequestProcessorChain specified . Open an issue

On Fri, Jan 29, 2010 at 6:06 PM, Marc Sturlese  wrote:
>
> I am testing trunk and have seen a different behaviour when loading
> updateProcessors wich I don't know if it's normal (at least with multicore)
> Before I use to use an updateProcessorChain this way:
>
> 
>    
>       myChain
>    
> 
> 
>     class="org.apache.solr.update.processor.CustomUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.LogUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.RunUpdateProcessorFactory" />
> 
>
> It does not work in current trunk. I have debuged the code and I have seen
> now UpdateProcessorChain is loaded via:
>
>  public  T initPlugins(List pluginInfos, Map
> registry, Class type, String defClassName) {
>    T def = null;
>    for (PluginInfo info : pluginInfos) {
>      T o = createInitInstance(info,type, type.getSimpleName(),
> defClassName);
>      registry.put(info.name, o);
>      if(info.isDefault()){
>            def = o;
>      }
>    }
>    return def;
>  }
>
> As I don't have default="true" in the configuration, my custom
> processorChain is not used. Setting default="true" makes it work:
>
> 
>    
>       myChain
>    
> 
> 
>     class="org.apache.solr.update.processor.CustomUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.LogUpdateProcessorFactory" />
>     class="org.apache.solr.update.processor.RunUpdateProcessorFactory" />
> 
>
> As far as I understand, if you specify the chain you want to use in here:
> 
>    
>       myChain
>    
> 
>
> Shouldn't be necesary to set it as default.
> Is it going to be kept this way?
>
> Thanks in advance
>
>
>
> --
> View this message in context: 
> http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com

RE: Aggregated facet value counts?

Well, it wouldn't be 'every' combination - more of 'any' combination at 
query-time.

The 'arbitrary' part of the requirement is because it's not practical to 
predict every combination a user might ask for, although generally users would 
tend to search for similar/the same query combinations (but perhaps with 
different date ranges, for example).

If 'predicted aggregate fields' were calculated at index-time on, say, 10 
fields (the schema in question actually as 73 fields), that's 3,628,801 new 
fields. A large percentage of these would likely never be used (which ones 
would depend on the user, environment etc.).

Perhaps a more 'typical' use case than my network-based example would be a 
product search web page, where you want to show the number of products that are 
made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] 
(15) ). To obtain the (15) facet count value, you would have to correlate the 
number of Sony products (say, (861)), and the products that fall into the [600 
TO 800] price range (say, (1226) ). The (15) would be the intersection of the 
Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that 
filter queries could only do this for document hits if you know the field 
values ahead of time (e.g. fq=manufacturer:Sony&fq=price:[600 TO 800])? The 
facets could then be derived by simply counting the numFound for each result 
set.

If there were subsearch support in Solr (i.e. take the output of a query and 
use it as input into another) that included facets [perhaps there is such 
support?], it might be used to achieve this effect.

A custom query parser plugin could work, maybe? I suppose it would need to 
gather up all the separate facets and correlate them according to the input 
query (e.g. host and user, or manufacturer and price range). Such a mechanism 
would be crying out for caching, but perhaps it could leverage the existing 
field and query caches.

Peter

> From: erik.hatc...@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Aggregated facet value counts?
> Date: Fri, 29 Jan 2010 07:39:44 -0500
> 
> Creating values for every possible combination is what you're asking 
> Solr to do at query-time, and as far as I know there isn't really a 
> way to accomplish that like you're asking. Is the need really to be 
> arbitrary here?
> 
> Erik
> 
> On Jan 29, 2010, at 7:25 AM, Peter S wrote:
> 
> >
> > Hi Erik,
> >
> >
> >
> > Thanks for your reply. That's an interesting idea doing it at index- 
> > time, and a good idea for known field combinations.
> >
> > The only thing is
> >
> > How to handle arbitrary field combinations? - i.e. to allow the 
> > caller to specify any combination of fields at query-time?
> >
> > So, yes, the data is available at index-time, but the combination 
> > isn't (short of creating fields for every possible combination).
> >
> >
> >
> > Peter
> >
> >
> >
> >> From: erik.hatc...@gmail.com
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Aggregated facet value counts?
> >> Date: Fri, 29 Jan 2010 06:30:27 -0500
> >>
> >> When faced with this type of situation where the data is entirely
> >> available at index-time, simply create an aggregated field that glues
> >> the two pieces together, and facet on that.
> >>
> >> Erik
> >>
> >> On Jan 29, 2010, at 6:16 AM, Peter S wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>>
> >>>
> >>> I was wondering if anyone had come across this use case, and if this
> >>> type of faceting is possible:
> >>>
> >>>
> >>>
> >>> The requirement is to build a query such that an aggregated facet
> >>> count of common (and'ed) field values form the basis of each
> >>> returned facet count.
> >>>
> >>>
> >>>
> >>> For example:
> >>>
> >>> Let's say I have a number of documents in an index with, among
> >>> others, the fields 'host' and 'user':
> >>>
> >>>
> >>>
> >>> Doc1 host:machine_1 user:user_1
> >>>
> >>> Doc2 host:machine_1 user:user_2
> >>>
> >>> Doc3 host:machine_1 user:user_1
> >>>
> >>> Doc3 host:machine_1 user:user_1
> >>>
> >>>
> >>>
> >>> Doc4 host:machine_2 user:user_1
> >>>
> >>> Doc5 host:machine_2 user:user_1
> >>>
> >>> Doc6 host:machine_2 user:user_4
> >>>
> >>>
> >>>
> >>> Doc7 host:machine_1 user:user_4
> >>>
> >>>
> >>>
> >>> Is it possible to get facets back that would give the count of
> >>> documents that have common host AND user values (preferably ordered
> >>> - i.e. host then user for this example, so as not to create a
> >>> factorial explosion)? Note that the caller wouldn't know what
> >>> machine and user values exist, only the field names.
> >>>
> >>> I've tried using facet queries in various ways to see if they could
> >>> work for this, but I believe facet queries work on a different plane
> >>> than this requirement (narrowing the term count, a.o.t. 
> >>> aggregating).
> >>>
> >>>
> >>>
> >>> For the example above, the desired result would be:
> >>>
> >>>
> >>>
> >>> machine_1/user_1 (3)
> >>>
> >>> machine_1/user_2 (1)
> >>>
> >>

multi term, multi field, auto suggest

2010-01-29 Thread Lukas Kahwe Smith

Hi,

So over the course of the last two weeks I have been trying to come up with an
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can
have german, english, italian or french names. people have an additional
firstname field. We also want to do auto suggest on the street and city names
as well as on emails and telefon numbers. as such we are treating phonenumbers
as text.

We do have the option for the user to use phonetic searches or to split
(especially the compound german words), but I guess we will leave that out of
the auto suggest.
We do expect that some users will type in properly cased strings, while some
may just type in all lowercase.
We are using the dismax defType for our normal queries.

There will probably be less than 20M entities.

As such I guess the best approach is to copy all of the above mentioned fields
(name, firstname, city, street, email, telefon) into a new field called "all".
It seems the best approach is to use facet.prefix for our requirements. We will
therefore split of the last term in the query and pass it in as the
"facet.prefix" while the rest is passed in as the "q" parameter.

Since facet's are driven out of the index, we will use the following type
definition for this "all" field:

So essentially the idea is to just split on whitespace, remove stop words and
word delimiters.

The query would then look something like the following if the user would enter
"Kaltenreider Ver":
http://localhost:8983/solr/core0/select?defType=dismax&qf=all&q=
Kaltenreider&indent=on&facet=on&facet.limit=10&facet.mincount=1&facet.field=all&rows=0&facet.prefix=Ver

Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of
ram, albeit all of that will be shared with apache, mysql slave and a php app?
Ah well questions like that are impossible to answer, so just trying to ask if
you expect this to be really heavy. I noticed that in my initial testing with
2M on my laptop facets seemed to be fine, though the first request was slow and
the memory use spiked to 300MB. But I presume its just loading stuff into cache
and concurrent requests shouldnt cause the memory use to go up linearly.

I am still a bit unsure how to handle both the lowercased and the case
preserved version:

So here are some examples:
UBS => ubs|UBS
Kreuzstrasse => kreuzstrasse|Kreuzstrasse

So when I type "Kreu" I would get a suggestion of "Kreuzstrasse" and with
"kreu" I would get "kreuzstrasse".
Since I do not expect any words to start with a lowercase letter and still
contain some upper case letter we should be fine with this approach.

As in I doubt there would be stuff like "fooBar" which would lead to suggestion
both "foobar" and "fooBar".

How can I achieve this?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Is optimizing always necessary?

2010-01-29 Thread Marcus Herou

If one only have additions do I then need to optimize the index at all ?

I thought that only update/deletes created "holes" in the index. Or should
the index be sorted on disk at all times, is that the reason ?

Cheers

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/

Re: Aggregated facet value counts?

Sounds like what you're asking for is tree faceting.  A basic  
implementation is available in SOLR-792, but one that could also take  
facet.queries, numeric or date range buckets, to tree on would be a  
nice improvement.


Still, the underlying implementation will simply enumerate all the  
possible values (SOLR-792 has some short-circuiting when the top-level  
has zero, of course).  A client-side application could do this with  
multiple requests to Solr.


Subsearch - sure, just make more requests to Solr, rearranging the  
parameters.


I'd still say that in general for this type of need that it'll  
"generally" be less arbitrary and locking some things in during  
indexing will be the pragmatic way to go for most cases.


Erik



On Jan 29, 2010, at 9:28 AM, Peter S wrote:



Well, it wouldn't be 'every' combination - more of 'any' combination  
at query-time.


The 'arbitrary' part of the requirement is because it's not  
practical to predict every combination a user might ask for,  
although generally users would tend to search for similar/the same  
query combinations (but perhaps with different date ranges, for  
example).


If 'predicted aggregate fields' were calculated at index-time on,  
say, 10 fields (the schema in question actually as 73 fields),  
that's 3,628,801 new fields. A large percentage of these would  
likely never be used (which ones would depend on the user,  
environment etc.).



Perhaps a more 'typical' use case than my network-based example  
would be a product search web page, where you want to show the  
number of products that are made by a manufacturer and within a  
certain price range (e.g. Sony [$600-$800] (15) ). To obtain the  
(15) facet count value, you would have to correlate the number of  
Sony products (say, (861)), and the products that fall into the [600  
TO 800] price range (say, (1226) ). The (15) would be the  
intersection of the Sony hits and the price range hits by  
'manufacturer:Sony'. Am I right that filter queries could only do  
this for document hits if you know the field values ahead of time  
(e.g. fq=manufacturer:Sony&fq=price:[600 TO 800])? The facets could  
then be derived by simply counting the numFound for each result set.




If there were subsearch support in Solr (i.e. take the output of a  
query and use it as input into another) that included facets  
[perhaps there is such support?], it might be used to achieve this  
effect.



A custom query parser plugin could work, maybe? I suppose it would  
need to gather up all the separate facets and correlate them  
according to the input query (e.g. host and user, or manufacturer  
and price range). Such a mechanism would be crying out for caching,  
but perhaps it could leverage the existing field and query caches.



Peter





From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 07:39:44 -0500

Creating values for every possible combination is what you're asking
Solr to do at query-time, and as far as I know there isn't really a
way to accomplish that like you're asking. Is the need really to be
arbitrary here?

Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index-
time, and a good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the
caller to specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination
isn't (short of creating fields for every possible combination).



Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that  
glues

the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if  
this

type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably  
ordered

- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names

Re: Is optimizing always necessary?

In addition to destory the "holes" in the index, optimization is also used
to merge multiple small indexes into a bigger one.
Although I have not got specific performace data, I can imagine that this
will lead to performace benifits.
Supposing you have thousands of small indexes,  open-close these indexes
again and again should be time costing.

2010/1/30 Marcus Herou 

> If one only have additions do I then need to optimize the index at all ?
>
> I thought that only update/deletes created "holes" in the index. Or should
> the index be sorted on disk at all times, is that the reason ?
>
> Cheers
>
> //Marcus
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
>



-- 
梅旺生

Re: Newbie Question on Custom Query Generation

What's the point of generating your own query?
Are you sure that solr query syntax cannot satisfy your need?

2010/1/29 Abin Mathew 

> Hi I want to generate my own customized query from the input string entered
> by the user. It should look something like this
>
> *Search field : Microsoft*
> *
> Generated Query*  :
> description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
> role:microsoft requi
> rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0
> title:microsoft^3.5 functionalArea:microsoft
>
> *The lucene code we used is like this*
> BooleanQuery must = new BooleanQuery();
>
> addToBooleanQuery(must, "tags", inputData, synonymAnalyzer, 1.5f);
> addToBooleanQuery(must, "title", inputData, synonymAnalyzer);
> addToBooleanQuery(must, "role", inputData, synonymAnalyzer);
> addToBooleanQuery(query, "description", inputData, synonymAnalyzer);
> addToBooleanQuery(must, "requirement", inputData, synonymAnalyzer);
> addToBooleanQuery(must, "company", inputData, standardAnalyzer);
> addToBooleanQuery(must, "city", inputData, standardAnalyzer);
> must.setBoost(5.0f);
> query.add(must, Occur.MUST);
> addToBooleanQuery(query, "tags", includeAll, synonymAnalyzer, 2.0f);
> addToBooleanQuery(query, "title", includeAll, synonymAnalyzer, 3.5f);
> addToBooleanQuery(query, "functionalArea", inputData, synonymAnalyzer,);
> *
> In Simple english*
> addToBooleanQuery will add the particular field to the query after
> analysing
> using the analyser mentioned and setting a boost as specified
> So there "MUST" be a keyword match with any of the fields
> tags,title,role,description,requirement,company,city and it "SHOULD" occur
> in the fields tags,title and functionalArea.
>
> Hope you have got an idea of my requirement. I am not asking anyone to do
> it
> for me. Please let me know where can i start and give me some useful tips
> to
> move ahead with this. I believe that it has to do with modifying the XML
> configuration file and setting the parameters in Dismax handler. But I am
> still not sure. Please help
>
> Thanks & Regards
> Abin Mathew
>



-- 
梅旺生

Re: boosting unexpired documents

I think you can combine serveral solr supplied standard function query to
achieve this.

similar to:
&bf=map(map(div(ms(NOW, expiration),8640),-1,0,0), 1,1,1)

Furthermore, you would implement your own function and register it in
solrconfig.xml using valueSourceParser tag.

2010/1/29 Andy 

> Ah, thank you!
>
>
>
> --- On Fri, 1/29/10, Lance Norskog  wrote:
>
> > From: Lance Norskog 
> > Subject: Re: boosting unexpired documents
> > To: solr-user@lucene.apache.org
> > Date: Friday, January 29, 2010, 12:32 AM
> > You add a range query on the date,
> > and boost documents within that
> > date range. Check out the 'boost query' feature of dismax.
> >
> > http://www.lucidimagination.com/search/document/CDRG_ch07_7.4.2.9
> >
> > It's also possible with the standard query parser but a
> > pain in the neck:
> >
> > (value)^2 OR (NOT value)
> >
> >
> >
> > On Thu, Jan 28, 2010 at 6:58 PM, Andy 
> > wrote:
> > > My documents have a field "expiration" that is the
> > expiration date of that doc.
> > >
> > > I want to give a boost to all documents that haven't
> > expired. I still want to have expired documents returned,
> > but unexpired documents should be given priority.
> > >
> > > Ideally the boost amount for all unexpired documents
> > should be the same. i.e. whether the expiration date is
> > tomorrow or a month from now wouldn't make a difference.
> > Like wise all expired documents should be treated the same,
> > whether it expired yesterday or a year ago.
> > >
> > > Is that something possible? I read
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
> > but that's not quite what I want.
> > >
> > >
> > >
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>
>
>
>


-- 
梅旺生

Solr duplicates detection!!

Document Duplication Detection

[image: ] Solr1.4 

目录

   1. Document Duplication Detection <#Document_Duplication_Detection>
   2. Overview <#Overview>
  1. Goals <#Goals>
  2. Design <#Design>
   3. Notes <#Notes>
   4. Configuration <#Configuration>
  1. solrconfig.xml <#solrconfig.xml>
 1. Note <#Note>
  2. Settings <#Settings>

 Overview

Preventing duplicate or near duplicate documents from entering an index or
tagging documents with a signature/fingerprint for duplicate field
collapsing can be efficiently achieved with a low collision or fuzzy hash
algorithm. Solr should natively support deduplication techniques of this
type and allow for the easy addition of new hash/signature implementations.

Goals

   - Efficient, hash based exact/near document duplication detection and
   blocking.
   - Allow for both duplicate collapsing in search results as well as
   deduplication on adding a document.

 Design

Signature

A class capable of generating a signature String from the concatenation of a
group of specified document fields.

public abstract class Signature {
  public void init(SolrParams nl) {
  }

  public abstract String calculate(String content);
}

Implementations:

MD5Signature

128 bit hash used for exact duplicate detection.

Lookup3Signature 

64 bit hash used for exact duplicate detection, much faster than MD5 and
smaller to index

TextProfileSignature 

Fuzzy hashing implementation from nutch for near duplicate detection. Its
tunable but works best on longer text.

There are other more sophisticated algorithms for fuzzy/near hashing that
could be added later.

Notes

Adding in the dedupe process will change the allowDups setting so that it
applies to an update Term (with field signatureField in this case) rather
than the unique field Term (of course the signatureField could be the unique
field, but generally you want the unique field to be unique)

When a document is added, a signature will automatically be generated and
attached to the document in the specified signatureField.

Configuration

solrconfig.xml

The SignatureUpdateProcessorFactory
has to be registered in the
solrconfig.xml as part of the
UpdateRequest  Chain:

Accepting all defaults:

  



  

Example settings:

  
  

  true
  false
  id
  name,features,cat
  org.apache.solr.update.processor.Lookup3Signature



  

 Note

Also be sure to change your update handlers to use the defined chain, i.e.

  

  dedupe

  

The update processor can also be specified per request with a parameter of
update.processor=dedupe

Settings

*Setting*

*Default*

*Description*

signatureClass

org.apache.solr.update.processor.Lookup3Signature 

A Signature implementation for generating a signature hash.

fields

all fields

The fields to use to generate the signature hash in a comma separated list.
By default, all fields on the document will be used.

signatureField

signatureField

The name of the field used to hold the fingerprint/signature. Be sure the
field is defined in schema.xml.

enabled

true

Enable/disable dedupe factory processing


-- 
梅旺生

Re: Solr duplicates detection!!

Sorry by sending wrong message, this should go to my own mail box  :(

2010/1/30 Wangsheng Mei 

> Document Duplication Detection
>
> [image: ] Solr1.4 
>
> 目录
>
>1. Document Duplication 
> Detection<#1267b655a97b48f5_Document_Duplication_Detection>
>2. Overview <#1267b655a97b48f5_Overview>
>   1. Goals <#1267b655a97b48f5_Goals>
>   2. Design <#1267b655a97b48f5_Design>
>3. Notes <#1267b655a97b48f5_Notes>
>4. Configuration <#1267b655a97b48f5_Configuration>
>   1. solrconfig.xml <#1267b655a97b48f5_solrconfig.xml>
>  1. Note <#1267b655a97b48f5_Note>
>   2. Settings <#1267b655a97b48f5_Settings>
>
>  Overview
>
> Preventing duplicate or near duplicate documents from entering an index or
> tagging documents with a signature/fingerprint for duplicate field
> collapsing can be efficiently achieved with a low collision or fuzzy hash
> algorithm. Solr should natively support deduplication techniques of this
> type and allow for the easy addition of new hash/signature implementations.
>
> Goals
>
>- Efficient, hash based exact/near document duplication detection and
>blocking.
>- Allow for both duplicate collapsing in search results as well as
>deduplication on adding a document.
>
>  Design
>
> Signature
>
> A class capable of generating a signature String from the concatenation of
> a group of specified document fields.
>
> public abstract class Signature {
>   public void init(SolrParams nl) {
>   }
>
>   public abstract String calculate(String content);
> }
>
> Implementations:
>
> MD5Signature
>
> 128 bit hash used for exact duplicate detection.
>
> Lookup3Signature 
>
> 64 bit hash used for exact duplicate detection, much faster than MD5 and
> smaller to index
>
> TextProfileSignature 
>
> Fuzzy hashing implementation from nutch for near duplicate detection. Its
> tunable but works best on longer text.
>
> There are other more sophisticated algorithms for fuzzy/near hashing that
> could be added later.
>
> Notes
>
> Adding in the dedupe process will change the allowDups setting so that it
> applies to an update Term (with field signatureField in this case) rather
> than the unique field Term (of course the signatureField could be the unique
> field, but generally you want the unique field to be unique)
>
> When a document is added, a signature will automatically be generated and
> attached to the document in the specified signatureField.
>
> Configuration
>
> solrconfig.xml
>
> The 
> SignatureUpdateProcessorFactoryhas
>  to be registered in the solrconfig.xml as part of the
> UpdateRequest  Chain:
>
> Accepting all defaults:
>
>   
>
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> 
> 
>
>   
>
> Example settings:
>
>   
>   
>  class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>
>   true
>   false
>   id
>   name,features,cat
>
>name="signatureClass">org.apache.solr.update.processor.Lookup3Signature
> 
> 
> 
>
>   
>
>  Note
>
> Also be sure to change your update handlers to use the defined chain, i.e.
>
>   
> 
>   dedupe
>
> 
>   
>
> The update processor can also be specified per request with a parameter of
> update.processor=dedupe
>
> Settings
>
> *Setting*
>
> *Default*
>
> *Description*
>
> signatureClass
>
> org.apache.solr.update.processor.Lookup3Signature
>
> A Signature implementation for generating a signature hash.
>
> fields
>
> all fields
>
> The fields to use to generate the signature hash in a comma separated list.
> By default, all fields on the document will be used.
>
> signatureField
>
> signatureField
>
> The name of the field used to hold the fingerprint/signature. Be sure the
> field is defined in schema.xml.
>
> enabled
>
> true
>
> Enable/disable dedupe factory processing
>
>
> --
> 梅旺生
>



-- 
梅旺生

Deleting spelll checker index

2010-01-29 Thread darniz


Hello all,
We are using Index based spell checker.
i was wondering with the help of any url parameters can we delete the spell
check index directory.
please let me know 
thans
darniz


-- 
View this message in context: 
http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27376823.html
Sent from the Solr - User mailing list archive at Nabble.com.

Auto Suggest with multiple space separated words

2010-01-29 Thread Nair, Manas

Hi Experts,
 
I need an auto suggest functionality using SOLR which gives me the feel of 
using the fire fox browser. In short, if I type in a prefix, the results should 
drop down even if the prefix is not the starting of the drop down items.
 
Example: If I search for Lin, then the results could be 
[Abe Lincoln, Lindsay Lohan, Sarah Palin, Gasoline .].
 
Please suggest the best approach.
 
Any help is greatly appreciated.
 
Thankyou,
Manas Nair

distributed search and failed core

2010-01-29 Thread Joe Calderon

hello *, in distributed search when a shard goes down, an error is
returned and the search fails, is there a way to avoid the error and
return the results from the shards that are still up?

thx much

--joe

Re: Basic questions about Solr cost in programming time

2010-01-29 Thread Sven Maurmann


Hi!

Of course the answer depends (as usually) very much on the features
you want to realize. But Solr can be set up very fast. When we created
our first prototype, it took us about a week to get it running with
spell phoneme search, spell checking, facetting - and even collapsing
(using the famous 236-patch).

It is definitely very nice that you can do a lot of things using the
available components and only configuring them inside solrconfig.xml
and schema.xml.

And you may well start with the standard distribution.

Cheers,
   Sven

--On Dienstag, 26. Januar 2010 12:00 -0800 Jeff Crump 
 wrote:



Hi,
I hope this message is OK for this list.

I'm looking into search solutions for an intranet site built with Drupal.
Eventually we'd like to scale to enterprise search, which would include
the Drupal site, a document repository, and Jive SBS (collaboration
software). I'm interested in Lucene/Solr because of its scalability,
faceted search and optimization features, and because it is free. Our
problem is that we are a non-profit organization with only three very
busy programmers/sys admins supporting our employees around the world.

To help me argue for Solr in terms of total cost, I'm hoping that members
of this list can share their insights about the following:

* About how many hours of programming did it take you to set up your
instance of Lucene/Solr (not counting time spent on optimization)?

* Are there any disadvantages of going with a certified distribution
rather than the standard distribution?


Thanks and best regards,
Jeff

Jeff Crump
jcr...@hq.mercycorps.org

RE: Aggregated facet value counts?