Re: tomcat install

2006-09-18 Thread Nick Snels

Hi James,

the problem is most likely a xml error in either schema.xml or
solrconfig.xml. Go through your Tomcat logs, if it is an xml error you
should find the line where the xml parsing went wrong.

Kind regards,

Nick

On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:


thk Nick.

i do you tell me and i can see admin page.

but when i click search ,,,error information:

java.lang.NullPointerException
at org.apache.solr.search.SolrQueryParser.(SolrQueryParser.java:37)
at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:47)
at org.apache.solr.request.StandardRequestHandler.handleRequest(
StandardRequestHandler.java:94)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:586)
at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:91)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:252)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:173)
at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:178)
at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java
:126)
at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java
:105)
at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:107)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
:148)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:869)
at

org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:664)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:527)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:80)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:595)



Re: tomcat install

2006-09-18 Thread James liu

Hi Nick,

It is very funny. when i reboot my pc,it is ok and i do nothing.

my new question is how to add lucene-analyzers-2.0.0.jar to tomcat or jetty.

i add useful classes to solr.war which exist
"C:\cygwin\tmp\solr-nightly\example\webapps\solr.war",,,but it is not
effect...

do u know how to solve it?


Regards,

JL

2006/9/18, Nick Snels <[EMAIL PROTECTED]>:


Hi James,

the problem is most likely a xml error in either schema.xml or
solrconfig.xml. Go through your Tomcat logs, if it is an xml error you
should find the line where the xml parsing went wrong.

Kind regards,

Nick

On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:
>
> thk Nick.
>
> i do you tell me and i can see admin page.
>
> but when i click search ,,,error information:
>
> java.lang.NullPointerException
> at org.apache.solr.search.SolrQueryParser.(SolrQueryParser.java
:37)
> at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:47)
> at org.apache.solr.request.StandardRequestHandler.handleRequest(
> StandardRequestHandler.java:94)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:586)
> at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:91)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> ApplicationFilterChain.java:252)
> at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:173)
> at org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:213)
> at org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:178)
> at org.apache.catalina.core.StandardHostValve.invoke(
> StandardHostValve.java
> :126)
> at org.apache.catalina.valves.ErrorReportValve.invoke(
> ErrorReportValve.java
> :105)
> at org.apache.catalina.core.StandardEngineValve.invoke(
> StandardEngineValve.java:107)
> at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java
> :148)
> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
> :869)
> at
>
>
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
> (Http11BaseProtocol.java:664)
> at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
> PoolTcpEndpoint.java:527)
> at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
> LeaderFollowerWorkerThread.java:80)
> at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
> ThreadPool.java:684)
> at java.lang.Thread.run(Thread.java:595)
>




Re: acts_as_solr

2006-09-18 Thread Simon Peter Nicholls
Additionally, we should mock out for testing reasons, e.g. in file  
test/mocks/test/acts_as_solr.rb:


require "#{File.dirname(__FILE__)}/../../../vendor/plugins/ 
acts_as_solr/lib/acts_as_solr.rb"


module SolrMixin #:nodoc:

module Acts #:nodoc:

module ARSolr #:nodoc:

module InstanceMethods #:nodoc:

def solr_save #:nodoc:

end

def solr_destroy #:nodoc:

end

end

end#module ARSolr

end#module Acts

end#module SolrMixin

ClassMethod find_by_solr is the tricky case, but could be handled by  
various means, from specifying mock return collections, through using  
Procs, to using Solr (or even Ferret) as a local test engine if query  
*absolutely* needs testing. Solr interface seems simple enough that  
mocking return values is fine though.


Any thoughts?

Simon

On 17 Sep 2006, at 17:29, Simon Peter Nicholls wrote:

I had a play around with acts_as_solr today (thanks Yonik for the  
pointer). Had to fiddle around for a while to set up the  
schema.xml, (I'm totally new to Solr), but it worked well.


Well, I'm far from RoR savvy (new to Ruby/Rails also), but  
concerning the commit question, probably using an around_filter to  
extend the transaction scope would be best. That way, migrations,  
script/console, & other ways to directly use model will still work  
using auto-commit. I'll have a stab to keep the ball rolling.


Add filter in from from plugin:

ActionController::Base.around_filter SolrFilter.new

With something like this pseudocode:

class SolrFilter
def before
disable solr commit for request
end

def after
solr commit
end
end

(and check if commit needed within save callback of course)

Anyway, acts_as_solr is looking good, thanks!

Simon

>>>


I think Ruby is very fertile ground for Solr to pick up
users/developers right now.



I fully agree.  Ferret is coming along very nicely as well, which  
is wonderful for pure Rubyists that don't need the additional  
dependency, skill set to manage, and different environment that  
Solr would require.  But Solr really shines for all its caching and  
index management, so I'm sure there will be many RoR folks that  
will embrace Solr.




Getting into some little details, it looks like a commit (which
actualy does an optimize) is done on every .save, right?



That's true.  I'm not sure how one would avoid doing a commit for  
a .save.  There isn't, as far as I know, broader granularity for  
database operations.  An optimize wouldn't be necessary, but  
certainly swapping over the searcher would be desired after a save.




I also notice that the commit is asynchronous... so one could do a
save, then do an immediate search and not see the changes yet, right?



That is true.  But holding up a save for a new IndexSearcher would  
be a big hit, at least in my application that currently takes 30+  
seconds of warming up before a new searcher is ready.




I don't know anything about RoR and ActiveRecord, but hopefully there
is some way to avoid a commit on every operation.



It could certainly be made more manual such that a developer would  
need to code in when a commit happens.  I'm not currently sure what  
other options there would be for it to be automatic but not done  
for every .save.  Within a RoR application, one could code in a  
 a controller after_filter such that it would occur at the  
end of an HTTP request by the browser.  Any RoR savvy folks have  
suggestions on this?


Erik




about sortMissingLast and sortMissingFirst

2006-09-18 Thread James liu

i look expression about them in schema.xml.

but i m not clear.

My
understand: if sortMissingLast is true do last in descending sort. for example:
 field name : pname,  if sortMissingLast is true, pname will be sorted desc
and we can't find result with pname, only other field information.

am i wrong?



--
regards
jl


no example to CollectionDistribution?

2006-09-18 Thread James liu

not find example in wiki.


anyone know?

--
regards
jl


Re: about sortMissingLast and sortMissingFirst

2006-09-18 Thread Chris Hostetter

: My
: understand: if sortMissingLast is true do last in descending sort. for 
example:
:   field name : pname,  if sortMissingLast is true, pname will be sorted desc
: and we can't find result with pname, only other field information.

sortMissingLast is used to determine how you want documents that don't
have a value to appear -- if sortMissingLast=true then it doesn't matter
whether you sort "asc" or "desc", documents that don't have a value for
that field will allways come last.  if sortMissingLast="false" then the
default Lucene sort behavior is used in which "missing" values are sorted
the same as empty strings -- it is the "lowest" possible value, so they
come first in asc sorts.



-Hoss



Re: no example to CollectionDistribution?

2006-09-18 Thread Chris Hostetter

: Subject: no example to CollectionDistribution?
:
: not find example in wiki.
:
:
: anyone know?

you really need to be more specific James:  what kinds of examples are you
looking for? ... the CollectionDistribution describes in depth how the
replication/distribution works, and has examples of the only things that
need to be croned: snapcleaner, snappuller, and snapinstaller.


-Hoss



update partial document

2006-09-18 Thread Brian Lucas
Hi, I wanted to inquire if anybody would find an update flag useful that
only replaced the subset of data (ie a certain field) getting passed in,
instead of the whole record.  

 

Pseudo-code for what I'm describing:

 

125125

true



+ RU



- EN

 

Instead of deleting and reinserting an entire document, which is ostensibly
what SOLR does each time an update is performed, it's sometimes preferable
to simply replace a single field's value like one does in a database.  

 

Any thoughts on the feasibility or limitations of this?

 

Brian



Re: update partial document

2006-09-18 Thread Simon Willnauer

I'm not into the code of Solr at all but I know that Solr is based on
the lucene core which has no kind of update mechanism. To update a
document using lucene you have to delete and reinsert the document.
That might be the reason for the solr behaviour as well.

You should consider that lucene is not a database!

best regards simon

On 9/18/06, Brian Lucas <[EMAIL PROTECTED]> wrote:

Hi, I wanted to inquire if anybody would find an update flag useful that
only replaced the subset of data (ie a certain field) getting passed in,
instead of the whole record.



Pseudo-code for what I'm describing:



125125

true



+ RU



- EN



Instead of deleting and reinserting an entire document, which is ostensibly
what SOLR does each time an update is performed, it's sometimes preferable
to simply replace a single field's value like one does in a database.



Any thoughts on the feasibility or limitations of this?



Brian





Re: no example to CollectionDistribution?

2006-09-18 Thread James liu

maybe i should get cron through cygwin..

my system is win2003,not unix.

today i try ./snappuller,,,but it seems wrong and i set master port,
directory,snap directory

tomorrow i will try again.



2006/9/18, Chris Hostetter <[EMAIL PROTECTED]>:



: Subject: no example to CollectionDistribution?
:
: not find example in wiki.
:
:
: anyone know?

you really need to be more specific James:  what kinds of examples are you
looking for? ... the CollectionDistribution describes in depth how the
replication/distribution works, and has examples of the only things that
need to be croned: snapcleaner, snappuller, and snapinstaller.


-Hoss





--
regards
jl


Re: tomcat install

2006-09-18 Thread Nick Snels

Hi James,

I also needed the DutchAnalyzer from Lucene in my Solr project. I did it the
following way. Which is probably the hard way, because my Java knowledge
isn't that great.

1. I unzipped the solr-nightly build
2. I downloaded the latest code from lucene, preferrably from svn :
http://svn.apache.org/viewvc/lucene/java/ and all necessary analyzers from
the lucene sandbox
3. I put it into c:\solr-nightly\src\java\org\apache\lucene
4. I installed ant (unzip it and add ANT_HOME to your path)
5. than open a DOS prompt and go to c:\solr-nightly and run 'ant dist', this
makes a new solr-1.0.war file in c:\solr-nightly\dist. That war file
contains also the lucene code along with your analyzers

This is how I did it, don't know if this is the right or the easiest way to
do it.

Kind regards,

Nick


On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:


Hi Nick,

It is very funny. when i reboot my pc,it is ok and i do nothing.

my new question is how to add lucene-analyzers-2.0.0.jar to tomcat or
jetty.

i add useful classes to solr.war which exist
"C:\cygwin\tmp\solr-nightly\example\webapps\solr.war",,,but it is not
effect...

do u know how to solve it?


Regards,

JL

2006/9/18, Nick Snels <[EMAIL PROTECTED]>:
>
> Hi James,
>
> the problem is most likely a xml error in either schema.xml or
> solrconfig.xml. Go through your Tomcat logs, if it is an xml error you
> should find the line where the xml parsing went wrong.
>
> Kind regards,
>
> Nick
>
> On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:
> >
> > thk Nick.
> >
> > i do you tell me and i can see admin page.
> >
> > but when i click search ,,,error information:
> >
> > java.lang.NullPointerException
> > at org.apache.solr.search.SolrQueryParser.(SolrQueryParser.java
> :37)
> > at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java
:47)
> > at org.apache.solr.request.StandardRequestHandler.handleRequest(
> > StandardRequestHandler.java:94)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:586)
> > at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:91)
> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> > at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> > ApplicationFilterChain.java:252)
> > at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> > ApplicationFilterChain.java:173)
> > at org.apache.catalina.core.StandardWrapperValve.invoke(
> > StandardWrapperValve.java:213)
> > at org.apache.catalina.core.StandardContextValve.invoke(
> > StandardContextValve.java:178)
> > at org.apache.catalina.core.StandardHostValve.invoke(
> > StandardHostValve.java
> > :126)
> > at org.apache.catalina.valves.ErrorReportValve.invoke(
> > ErrorReportValve.java
> > :105)
> > at org.apache.catalina.core.StandardEngineValve.invoke(
> > StandardEngineValve.java:107)
> > at org.apache.catalina.connector.CoyoteAdapter.service(
> CoyoteAdapter.java
> > :148)
> > at org.apache.coyote.http11.Http11Processor.process(
Http11Processor.java
> > :869)
> > at
> >
> >
>
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
> > (Http11BaseProtocol.java:664)
> > at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
> > PoolTcpEndpoint.java:527)
> > at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
> > LeaderFollowerWorkerThread.java:80)
> > at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
> > ThreadPool.java:684)
> > at java.lang.Thread.run(Thread.java:595)
> >
>
>




Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault
Been playing around with the news 'facets search' and it works very 
well, but it's really slow for some particular applications. I've been 
trying to use it to display the most frequent authors of articles; this 
is from a huge (15 millions articles) database and names of authors are 
rare and heterogeneous. On a query that takes (without facets) 0.1 
seconds, it jumps to ~20 seconds with just 1% of the documents indexed 
(I've been getting java.lang.OutOfMemoryError with the full index). ~40 
seconds for a faceted search on 2 (string) fields. Range queries on a 
slong field is more acceptable (even with a dozen of them, query time is 
still in the subsecond range).


I'm I trying to do something which isn't what faceted search was made 
for? It would be understandable, after all, I guess the facets engine 
has to check very doc in the index and sort... which shouldn't yield 
good performance no matter what, sadly.


Is there any other way I could achieve what I'm trying to do? Just a 
list of the most frequent (top 5) authors present in the results of a query.


Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault
Just a little follow-up - I did a little more testing, and the query 
takes 20 seconds no matter what - If there's one document in the results 
set, or if I do a query that returns all 13 documents.


It seems something isn't right... it looks like solr is doing faceted 
search on the whole index no matter what's the result set when doing 
facets on a string field. I must be doing something wrong?


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Michael Imbeault wrote:
Been playing around with the news 'facets search' and it works very 
well, but it's really slow for some particular applications. I've been 
trying to use it to display the most frequent authors of articles; 
this is from a huge (15 millions articles) database and names of 
authors are rare and heterogeneous. On a query that takes (without 
facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the 
documents indexed (I've been getting java.lang.OutOfMemoryError with 
the full index). ~40 seconds for a faceted search on 2 (string) 
fields. Range queries on a slong field is more acceptable (even with a 
dozen of them, query time is still in the subsecond range).


I'm I trying to do something which isn't what faceted search was made 
for? It would be understandable, after all, I guess the facets engine 
has to check very doc in the index and sort... which shouldn't yield 
good performance no matter what, sadly.


Is there any other way I could achieve what I'm trying to do? Just a 
list of the most frequent (top 5) authors present in the results of a 
query.


Thanks,



Re: tomcat install

2006-09-18 Thread James liu

thank u, with your step and add junit,  it is ok.

you can analyzer your language?

i modify schema:
   
 
   
   
   
   
   
   
   

but nothing changed.



2006/9/19, Nick Snels <[EMAIL PROTECTED]>:


Hi James,

I also needed the DutchAnalyzer from Lucene in my Solr project. I did it
the
following way. Which is probably the hard way, because my Java knowledge
isn't that great.

1. I unzipped the solr-nightly build
2. I downloaded the latest code from lucene, preferrably from svn :
http://svn.apache.org/viewvc/lucene/java/ and all necessary analyzers from
the lucene sandbox
3. I put it into c:\solr-nightly\src\java\org\apache\lucene
4. I installed ant (unzip it and add ANT_HOME to your path)
5. than open a DOS prompt and go to c:\solr-nightly and run 'ant dist',
this
makes a new solr-1.0.war file in c:\solr-nightly\dist. That war file
contains also the lucene code along with your analyzers

This is how I did it, don't know if this is the right or the easiest way
to
do it.

Kind regards,

Nick


On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:
>
> Hi Nick,
>
> It is very funny. when i reboot my pc,it is ok and i do nothing.
>
> my new question is how to add lucene-analyzers-2.0.0.jar to tomcat or
> jetty.
>
> i add useful classes to solr.war which exist
> "C:\cygwin\tmp\solr-nightly\example\webapps\solr.war",,,but it is not
> effect...
>
> do u know how to solve it?
>
>
> Regards,
>
> JL
>
> 2006/9/18, Nick Snels <[EMAIL PROTECTED]>:
> >
> > Hi James,
> >
> > the problem is most likely a xml error in either schema.xml or
> > solrconfig.xml. Go through your Tomcat logs, if it is an xml error you
> > should find the line where the xml parsing went wrong.
> >
> > Kind regards,
> >
> > Nick
> >
> > On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:
> > >
> > > thk Nick.
> > >
> > > i do you tell me and i can see admin page.
> > >
> > > but when i click search ,,,error information:
> > >
> > > java.lang.NullPointerException
> > > at org.apache.solr.search.SolrQueryParser.(
SolrQueryParser.java
> > :37)
> > > at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java
> :47)
> > > at org.apache.solr.request.StandardRequestHandler.handleRequest(
> > > StandardRequestHandler.java:94)
> > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:586)
> > > at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:91)
> > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> > > at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> > > ApplicationFilterChain.java:252)
> > > at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> > > ApplicationFilterChain.java:173)
> > > at org.apache.catalina.core.StandardWrapperValve.invoke(
> > > StandardWrapperValve.java:213)
> > > at org.apache.catalina.core.StandardContextValve.invoke(
> > > StandardContextValve.java:178)
> > > at org.apache.catalina.core.StandardHostValve.invoke(
> > > StandardHostValve.java
> > > :126)
> > > at org.apache.catalina.valves.ErrorReportValve.invoke(
> > > ErrorReportValve.java
> > > :105)
> > > at org.apache.catalina.core.StandardEngineValve.invoke(
> > > StandardEngineValve.java:107)
> > > at org.apache.catalina.connector.CoyoteAdapter.service(
> > CoyoteAdapter.java
> > > :148)
> > > at org.apache.coyote.http11.Http11Processor.process(
> Http11Processor.java
> > > :869)
> > > at
> > >
> > >
> >
>
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
> > > (Http11BaseProtocol.java:664)
> > > at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
> > > PoolTcpEndpoint.java:527)
> > > at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
> > > LeaderFollowerWorkerThread.java:80)
> > > at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
> > > ThreadPool.java:684)
> > > at java.lang.Thread.run(Thread.java:595)
> > >
> >
> >
>
>





--
regards
jl


Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Yonik Seeley

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Been playing around with the news 'facets search' and it works very
well, but it's really slow for some particular applications. I've been
trying to use it to display the most frequent authors of articles


I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

-Yonik


Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Yonik Seeley

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik


Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault

Yonik Seeley wrote:

I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).
Yeah that's what I've been thinking; the index isn't built to handle 
such searches, sadly :( It would be very nice to be able to rapidly 
search by most frequent author, journal, etc.

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

I have one value per document (I have fields for authors, last_author 
and first_author, and I'm doing faceted search on first and last authors 
fields). How would I use the field cache to fix my problem? Also, would 
it be better to store a unique number (for each possible author) in an 
int field along with the string, and do the faceted searching on the int 
field? Would this be faster / require less memory? I guess that yes, and 
I'll test that when I have the time.



Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik
So more memory would fix the problem? Also, I was under the impression 
that it was only searching / sorting for authors that it knows are in 
the result set... in the case of only one document (1 result), it seems 
strange that it takes the same time as for 130 000 results. It should 
just check the results, see that there's only one author, and return 
that? And in the case of 2 documents, just sort 2 authors (or 1 if 
they're the same)? I understand your answer (it does intersections), but 
I wonder why its intersecting from the whole document set at first, and 
not docs_matching_query like you said.


Thanks for the support,

Michael


Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault

Another followup: I bumped all the caches in solrconfig.xml to

 size="1600384"
 initialSize="400096"
 autowarmCount="400096"

It seemed to fix the problem on a very small index (facets on last and 
first author fields, + 12 range date facets, sub 0.3 seconds for 
queries). I'll check on the full index tomorrow (it's indexing right 
now, 400docs/sec!). However, I still don't have an idea what are these 
values representing, and how I should estimate what values I should set 
them to. Originally I thought it was the size of the cache in kb, and 
someone on the list told me it was number of items, but I don't quite 
get it. Better documentation on that would be welcomed :)


Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset is 
too large...


Thanks,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik