Question: Pagination with multi index box

2007-05-14 Thread James liu

if use multi index box, how to pagination with sort by score correctly?

for example, i wanna query "search" with 60 index box and sort by score.

i don't know the num found from every index box which have different
content.

if promise 10 page with sort score correctly, i think solr 's start is 0,
and rows is 100.(10 result per page)

60*100=6000, sort it and get top 100 to cache.

it is very slove although it promise 10 page with sort score correctly.


any idea to fix it?

fast and correct.



--
regards
jl


NumberFormat exception when trying to use recip function query

2007-05-14 Thread Mekin Maheshwari

Hi,

I am getting the following exception when I try & run any query :

java.lang.NumberFormatException: empty String
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:994)
at java.lang.Float.parseFloat(Float.java:394)
at 
org.apache.solr.search.QueryParsing$StrParser.getFloat(QueryParsing.java:478)
at 
org.apache.solr.search.QueryParsing.parseValSource(QueryParsing.java:526)
at 
org.apache.solr.search.QueryParsing.parseFunction(QueryParsing.java:579)
at 
org.apache.solr.util.SolrPluginUtils.parseFuncs(SolrPluginUtils.java:519)
at 
org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:321)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)


The function query field is -

   recip(popularityRank, 1, 1000,
1000)^0.5recip(rord(creationDate),1,1000,1000)^
0.3


relevant field definitions from schema are:
  
  


I have checked that there is no document that has popularityRank as empty or
null.
(I ran an update query to set it to a large number when popularityRank was
empty or null)


If I change the function to rord(popularityRank) - the queries start working


Any clue what else I could do to debug this.

Thanks,
mekin

--
My company - http://ugenie.com
My Blog - http://mekin.livejournal.com/
My linkedin URL - http://www.linkedin.com/in/mekin


Feature Request: Multiple default search fields

2007-05-14 Thread Jack L

The default search field is really handy. It helps simplify
the query, and thus simplify the application using solr.
My understand is that solr only allows one default search field.
It would be useful to allow multiple default fields, and maybe
also specify a global field boost in the schema file, as opposed
to on a per document bases during post time. For example, article
title can be given a higher boost factor than article content.

-- 
Best regards,
Jack



Re: Feature Request: Multiple default search fields

2007-05-14 Thread Brian Whitman


On May 14, 2007, at 12:38 PM, Jack L wrote:


The default search field is really handy. It helps simplify
the query, and thus simplify the application using solr.
My understand is that solr only allows one default search field.
It would be useful to allow multiple default fields, and maybe
also specify a global field boost in the schema file, as opposed
to on a per document bases during post time. For example, article
title can be given a higher boost factor than article content.



For your first issue, use a copyField to copy all the text you want  
as default to a default search field.
For the second, have you looked at http://wiki.apache.org/solr/ 
DisMaxRequestHandler ? You set up boosts per field in solrconfig.xml  
that way.







RE: Requests per second/minute monitor?

2007-05-14 Thread Will Johnson
I've needed similar logged information recently and I looked at the code
and had a few questions:

Why does SolrCore.setResponseHeaderValues(...) set the QTime (and other
response header options) instead of having it as a function of
RequestHandlerBase?  If things were tracked in the RequestHandlers you
could add timing information there: avg query time, etc.  I know some
people have argued that you can do that with logs but being able to pull
that info live via JMX/stats.jsp would make monitoring much cleaner in
environments with multiple machines on different networks.  If things
are tracked in the handlers then people can add more statistics easily
to both query response headers and overall via custom handlers.

I'm happy to make the changes and supply a patch to move the logic as
well as adding a few simple metrics unless enough people on this thread
really feel that it's always better to do it with log files and
postmortem math.

- will


Re[3]: Multiple fq fields in URL

2007-05-14 Thread Chris Hostetter

:   q=samsung+camera
:
: And if samsung is mandatory, the query will be like this: (or not:)
:
:   q=+samsung+camera
:
: And the first + will be interpreted as mandatory flag?

No.  bottom line, forget all about URLs and URL escape.  step #1:
understand the Lucene query syntax...

   http://lucene.apache.org/java/docs/queryparsersyntax.html

in that syntax, this says samsung is mandatory and camera is optional...

+samsung camera

Step #2: use the admin form in Solr to type in queries, check the
debug enable option to see exactly what query structures you are
getting at the botom of your results...

   http://localhost:8983/solr/admin/form.jsp

step#3: only after you are sure you understand the syntax, and what result
you ar getting as a result, should you look at the URL to see how the
Lucene query syntax is being URL escaped.

Solr doesn't do anything magic with the URL, it doesn't do any special
Solr specific parsing ... the URL must be legal, and it must be valid, it
will be parsed/unescaped just like any other CGI/form style URL .. and
then the args will be interpreted.

I've updated the wiki page that started this thread to try and eli8minate
any ambiguity about URL escaping...

http://wiki.apache.org/solr/CommonQueryParameters#fq


-Hoss



Re: Question: Pagination with multi index box

2007-05-14 Thread Mike Klaas

On 14-May-07, at 1:35 AM, James liu wrote:

if use multi index box, how to pagination with sort by score  
correctly?


for example, i wanna query "search" with 60 index box and sort by  
score.


i don't know the num found from every index box which have different
content.

if promise 10 page with sort score correctly, i think solr 's start  
is 0,

and rows is 100.(10 result per page)

60*100=6000, sort it and get top 100 to cache.


it is very slove although it promise 10 page with sort score  
correctly.


With few index partitions, you it is sufficient to ask for startAt 
+numNeeded docs from each partition and sort globally.  Normally if  
you wanted 10 for the first page, you would ask for 10 from each  
server and cache the remainder.  It is better to ask for more later  
if the user asks for page ten.



When you get up to 60 partitions, you should make it a multi stage  
process.  Assuming your partitions are disjoint and evenly  
distributed, estimate the number of documents that will appear in the  
final result from each.  Double or triple that (and put a minimum  
threshold), try to assemble the number of documents you require, and  
if one partition "runs out" of docs before it is done, request a new  
round.


-Mike


Re: solr for corpus?

2007-05-14 Thread Huib Verweij

Hi matej,

since I didn't see anyone answering your question yet, I'll have a go at 
it, but I'm not one of the Solr developers, I've just used it so far and 
am very happy with it. I use it for searching literary texts, storing 
information from a SQL database in the Solr documents as metadata for 
the texts.



[EMAIL PROTECTED] schreef:
i test solr as one of potential tools for the purpose of building a 
linguistic corpus.


i'd like to have your opinion, to which extent it would be a good choice.

the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), 
with a bibliographic header
looking at the examples of the solr-tutorial and the central concept 
of the "field",
i am a bit confused how to map these on one another, ie would the 
whole text be one field "text"

and the bibheader-items individual fields?
Yes, you could do that. What I did was: add the text as a whole in one 
field, add each chapter in it's own field, add metadata fields from a 
SQL database for each title (e.g. year=1966, author.name=Some one, 
author.placeofbirth=Somewhere). Basically, everything you want to 
explicitely search for/in you put in a separate field.


2. real full-text (no stop-words, really every word is indexed, 
possibly even all the punctuation)

Shouldn't be a problem I think.


3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys 
on the word-level,

ie for every word we also have its lemma-value and its PoS.
In dedicated systems, this is implemented either as verticale (each 
word in one line):

word   lemma   pos
...
or in newer systems with xml-attributes:
trees

Important is, that it has to be possible to mix this various layers in 
one query, eg:

"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

I'm not 100% sure what you are trying to do here, sorry.


4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a 
year (in production environment),
so it doesnt really matter how long the indexing takes. Regarding the 
speed the emphasis is on the searches,
which have to be "fast", exact and the results have to be further 
processable (kwic-view
possible (though it cuts off searching the text for keywords after 50Kb. 
Actually, Lucene does that and it is configurable, but it can be 
annoying, so you might have to hack that if you find that Solr doesn't 
return a kwic-index for a hit. But maybe I'm not using Solr the right 
way ;-). )

, thinning the solution,

possible

sorting,

possible

export,
Not sure what you mean here, but Solr just returns a XML document that 
you can process any way you like.
etc.). "Fast" is important also for more complex queries (ranges, 
boolean operators and prefixes mixed)
and i say 10-15 seconds is the upper limit, which should be rather an 
exception to the rule of ~ 1 second.


5. also to regard the size: we are talking of multiples of 100 
millions of tokens.
the canonic example British National Corpus is 100 million, there are 
corpora with 2 billions tokens
That's a lot of text. I find Solr performs very well, but I can't 
guarantee you that Solr will work in your case, other more knowledgable 
people might be able to though.


Good luck with your decision making!

Kind regards,

Huib Verweij.



RE: solr for corpus?

2007-05-14 Thread Binkley, Peter
Regarding the Lemma and PoS-tag requirement: you might handle this by
inserting each word as its own document, with "lemma", "pos", and "word"
fields, thereby allowing you lots of search flexibility. You could also
include ID fields for the item and (if necessary) part (chapter etc.)
and use these as facets, allowing you to group results by the items that
contain them. Your application would have to know how to use the item ID
value to retrieve the full item-level record.

These word-level records could live in a separate index or in the main
index (since there are no required fields in Solr, you can have entirely
different record structures in a single index; you just have to
structure your queries accordingly). The problem will be that because
your word-level entries are separate from your item-level entries,
you'll have to include in the word-level entries any item-level fields
that you want to be able to use in word-level queries (e.g. if you
wanted to be able to limit a lemma search by date).  

The alternative would be to insert the lemma/pos/word entries in a
multivalued string field and come up with more complex wildcard query
structures to get at them. Apparently you can now get queries with
leading and trailing wildcards to work, so you should be able to do
everything you need, but I don't know how the performance will be.

All the best,

Peter

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 12, 2007 11:28 AM
To: solr-user@lucene.apache.org
Subject: solr for corpus?

i test solr as one of potential tools for the purpose of building a
linguistic corpus.

i'd like to have your opinion, to which extent it would be a good
choice.

the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), with
a bibliographic header looking at the examples of the solr-tutorial and
the central concept of the "field", i am a bit confused how to map these
on one another, ie would the whole text be one field "text"
and the bibheader-items individual fields?

2. real full-text (no stop-words, really every word is indexed, possibly
even all the punctuation)

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
on the word-level, ie for every word we also have its lemma-value and
its PoS.
In dedicated systems, this is implemented either as verticale (each word
in one line):
word   lemma   pos
...
or in newer systems with xml-attributes:
trees

Important is, that it has to be possible to mix this various layers in
one query, eg:
"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a
year (in production environment), so it doesnt really matter how long
the indexing takes. Regarding the speed the emphasis is on the searches,
which have to be "fast", exact and the results have to be further
processable (kwic-view, thinning the solution, sorting, export, etc.). 
"Fast" is important also for more complex queries (ranges, boolean
operators and prefixes mixed) and i say 10-15 seconds is the upper
limit, which should be rather an exception to the rule of ~ 1 second.

5. also to regard the size: we are talking of multiples of 100 millions
of tokens.
the canonic example British National Corpus is 100 million, there are
corpora with 2 billions tokens



thank you in advance

regards
matej


Re: NumberFormat exception when trying to use recip function query

2007-05-14 Thread Chris Hostetter

: The function query field is -
:  
: recip(popularityRank, 1, 1000,
: 1000)^0.5recip(rord(creationDate),1,1000,1000)^
: 0.3
:  

off the top of my head, i'd suggest you:
  1) verify there is some whitespace between the boost of the
popularity recip function and the date recip function
  2) eliminate the space inside the recip functions
  3) verify that there isno psace between eitehr recp function and it's
boost

...and see if that works...

  
 recip(popularityRank,1,1000,1000)^0.5
 recip(rord(creationDate),1,1000,1000)^0.3
  



-Hoss



Re: Null pointer exception

2007-05-14 Thread Chris Hostetter
: I have tried indexing from the exampledocs which is just sitting in my
: user home directory but now I get a null pointer exception after
: running:

just to clarify: are you using solr 1.1 or a nightly build? did you check
the log file to ensure thatthere are no exceptions when you start tomcat?
are you using the example solrconfig.xml and schema.xml?  have you tried
doing a search first without indexing any docs to see if that executs and
(correctly) returns 0 docs?

If i had to guess, i'd speculate that you aren't correctly using a system
prop or JNDI to point Solr at your solr home dir, so it's not finding the
configs; either that, or you've modified the configs and there is a
syntax error -- either way there should be an exception when the server
starts up, well before you update any docs.


-Hoss



Re: solr for corpus?

2007-05-14 Thread Chris Hostetter

: 3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
: on the word-level,
: ie for every word we also have its lemma-value and its PoS.

: or in newer systems with xml-attributes:
: trees
:
: Important is, that it has to be possible to mix this various layers in
: one query, eg:
: "word(some) lemma(nice) pos(Noun)"

the best way to approach this would probably be to preprocess the data nad
use a custom analyzer ... send it to solr with all of the info encoded in
each word, (ie: trees__tree_Noun) and then have a custom indexing analyzer
create multiple tokens in each position with an easy way to distinguish
wether a token is a word, the Lemma for a word, or the POS for word (ie:
the regular word plain, the Lemma prefixed by two underscores, and the POS
indexed by a single understore) then at query time if you know you are
looking for the phrase "some nice trees" you would search for "some nice
trees" but if you are looking for the word "some" followed by a word whose
lemma is "nice" followed by any Noun, you would search for "some __nice _Noun"

: This seems to me to be the biggest challenge for solr.

yeah ... neither Solr nor Lucene really attempt to tackly complex query
forms like this ... but Lucene has recently added a Token Payload
mechanism in an attempt to make queries like this easier (allowing
annotation of the actual terms that can be queried instead of needing to
create artificial terms in identical positions)

: corpus is very static: the selection fo texts changes perhaps once a
: year (in production environment),
: so it doesnt really matter how long the indexing takes. Regarding the
: speed the emphasis is on the searches,
: which have to be "fast", exact and the results have to be further
: processable (kwic-view, thinning the solution, sorting, export, etc.).
: "Fast" is important also for more complex queries (ranges, boolean
: operators and prefixes mixed)

these things should all be decent, especially since your index will be
fairly static so you don't have to worry baout 'warming' FieldCaches for
sorting etc something you might wnat to consider if you find query
speeds unacceptible on your full corpus with stop words left in would be
to sacrifice disk for speed by creating another field where the stop words
are removed and using it as much as possible (ie: anytime a query doesn't
care about stop words). ... but i wouldn't worry abotu that unless you
find it's actually a problem.  i've yet to see a complaint from anyone
that Solr isn't fast enough unless they are doing heavy faceting, or
updating their index so frequently that the caches can't be used.


-Hoss



RE: Null pointer exception

2007-05-14 Thread Gary Browne
Thanks a lot for your reply Chris

I am running v1.1.0. If I do a search (from the admin page), it throws
the following exception:

java.lang.RuntimeException: java.io.IOException:
/var/www/html/solr/data/index not a directory

There are no exceptions on starting Tomcat, only one warning regarding
JMS client lib not found (related to Cocoon). I have named a file
solr.xml in my $TOMCAT_HOME/conf/Catalina/localhost directory containing
the following:





I am using the example configs (unmodified).

Thanks again
Gary


Gary Browne
Development Programmer
Library IT Services
University of Sydney
Australia
ph: 61-2-9351 5946 
-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 15 May 2007 7:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Null pointer exception

: I have tried indexing from the exampledocs which is just sitting in my
: user home directory but now I get a null pointer exception after
: running:

just to clarify: are you using solr 1.1 or a nightly build? did you
check
the log file to ensure thatthere are no exceptions when you start
tomcat?
are you using the example solrconfig.xml and schema.xml?  have you tried
doing a search first without indexing any docs to see if that executs
and
(correctly) returns 0 docs?

If i had to guess, i'd speculate that you aren't correctly using a
system
prop or JNDI to point Solr at your solr home dir, so it's not finding
the
configs; either that, or you've modified the configs and there is a
syntax error -- either way there should be an exception when the server
starts up, well before you update any docs.


-Hoss



RE: Null pointer exception

2007-05-14 Thread Chris Hostetter
: I am running v1.1.0. If I do a search (from the admin page), it throws
: the following exception:
:
: java.lang.RuntimeException: java.io.IOException:
: /var/www/html/solr/data/index not a directory

does /var/www/html/solr/data/ exist? ... if so does the effective userID
for tomcat have permission to write to it?  if not does the effective
userID for tomcat have permission to write to /var/www/html/solr/ ?



-Hoss



Re: Question: Pagination with multi index box

2007-05-14 Thread James liu

2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:


On 14-May-07, at 1:35 AM, James liu wrote:

> if use multi index box, how to pagination with sort by score
> correctly?
>
> for example, i wanna query "search" with 60 index box and sort by
> score.
>
> i don't know the num found from every index box which have different
> content.
>
> if promise 10 page with sort score correctly, i think solr 's start
> is 0,
> and rows is 100.(10 result per page)
>
> 60*100=6000, sort it and get top 100 to cache.

> it is very slove although it promise 10 page with sort score
> correctly.

With few index partitions, you it is sufficient to ask for startAt
+numNeeded docs from each partition and sort globally.  Normally if
you wanted 10 for the first page, you would ask for 10 from each
server and cache the remainder.  It is better to ask for more later
if the user asks for page ten.


When you get up to 60 partitions, you should make it a multi stage
process.  Assuming your partitions are disjoint and evenly
distributed, estimate the number of documents that will appear in the
final result from each.



yes, partitions distrbuted.


Double or triple that (and put a minimum

threshold), try to assemble the number of documents you require, and
if one partition "runs out" of docs before it is done, request a new
round.



i dont' know what u mean "runs out"

one user request will generate 60 partitions request.

they work in parallel。

so i don't know every partion's status before they done.


To promise 10 page result sorted by score correctly, the only way seems to
get 100 results(rows=100) from each partitioin. but it very slow.

now i wanna find a way to get result sorted by score correctly and search
fast.


-Mike




Thks Mike. But it not i want.


--
regards
jl


Re: Question: Pagination with multi index box

2007-05-14 Thread James liu

if i set rows=(page-1)*10,,,it will lose more result which fits query.

how to set start when pagination.



2007/5/15, James liu <[EMAIL PROTECTED]>:




2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>
> On 14-May-07, at 1:35 AM, James liu wrote:
>
> > if use multi index box, how to pagination with sort by score
> > correctly?
> >
> > for example, i wanna query "search" with 60 index box and sort by
> > score.
> >
> > i don't know the num found from every index box which have different
> > content.
> >
> > if promise 10 page with sort score correctly, i think solr 's start
> > is 0,
> > and rows is 100.(10 result per page)
> >
> > 60*100=6000, sort it and get top 100 to cache.
>
> > it is very slove although it promise 10 page with sort score
> > correctly.
>
> With few index partitions, you it is sufficient to ask for startAt
> +numNeeded docs from each partition and sort globally.  Normally if
> you wanted 10 for the first page, you would ask for 10 from each
> server and cache the remainder.  It is better to ask for more later
> if the user asks for page ten.
>
>
> When you get up to 60 partitions, you should make it a multi stage
> process.  Assuming your partitions are disjoint and evenly
> distributed, estimate the number of documents that will appear in the
> final result from each.


yes, partitions distrbuted.


 Double or triple that (and put a minimum
> threshold), try to assemble the number of documents you require, and
> if one partition "runs out" of docs before it is done, request a new
> round.


i dont' know what u mean "runs out"

one user request will generate 60 partitions request.

they work in parallel。

so i don't know every partion's status before they done.


To promise 10 page result sorted by score correctly, the only way seems to
get 100 results(rows=100) from each partitioin. but it very slow.

now i wanna find a way to get result sorted by score correctly and search
fast.


-Mike
>

Thks Mike. But it not i want.


--
regards
jl





--
regards
jl


RE: Null pointer exception

2007-05-14 Thread Gary Browne
Hi Chris

The /var/www/html/solr/data/ directory did exist. I tried opening up
permissions completely for testing but no luck (the tomcat user had
write permissions).

I decided to trash the whole installation and start again. I downloaded
last nights build and untarred it. Put the .war into
$TOMCAT_HOME/webapps. Copied the example/solr directory as
/var/www/html/solr. No JNDI file this time, just updated solrconfig to
read /var/www/html/solr as my data.dir.

I can access the admin page but when I try an index action from the
commandline, or a search from the admin page, I get something like:

"The requested resource (/solr/select/) is not available"

I have other apps running under tomcat okay, seems like it can't find
the lib .jars or can't access the classes within them?

Stuck...

Cheers
Gary



Gary Browne
Development Programmer
Library IT Services
University of Sydney
Australia
ph: 61-2-9351 5946 

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 15 May 2007 9:51 AM
To: solr-user@lucene.apache.org
Subject: RE: Null pointer exception

: I am running v1.1.0. If I do a search (from the admin page), it throws
: the following exception:
:
: java.lang.RuntimeException: java.io.IOException:
: /var/www/html/solr/data/index not a directory

does /var/www/html/solr/data/ exist? ... if so does the effective userID
for tomcat have permission to write to it?  if not does the effective
userID for tomcat have permission to write to /var/www/html/solr/ ?



-Hoss



Re: Question: Pagination with multi index box

2007-05-14 Thread Mike Klaas

On 14-May-07, at 6:49 PM, James liu wrote:


2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:


On 14-May-07, at 1:35 AM, James liu wrote:

When you get up to 60 partitions, you should make it a multi stage
process.  Assuming your partitions are disjoint and evenly
distributed, estimate the number of documents that will appear in the
final result from each.



yes, partitions distrbuted.


Double or triple that (and put a minimum

threshold), try to assemble the number of documents you require, and
if one partition "runs out" of docs before it is done, request a new
round.



i dont' know what u mean "runs out"


Say you request 5 docs from each of 60 partitions, and are interested  
in docs 1-10.  If, sorted by score, the docs come from:


p1, p2, p1, p1, p3, p4, p1, p1

Then p1 has "run out" at n=8, and there is no way to be sure if the  
remaining two needed docs come from p1 or somewhere else.  So you  
have to now request at least two additional documents from p1.



one user request will generate 60 partitions request.

they work in parallel。

so i don't know every partion's status before they done.


Normally, you would wait for them to finish, and execute a subsequent  
request if more docs are needed.


-Mike

Re: Question: Pagination with multi index box

2007-05-14 Thread Mike Klaas


On 14-May-07, at 7:15 PM, James liu wrote:


if i set rows=(page-1)*10,,,it will lose more result which fits query.

how to set start when pagination.


I'm not sure I understand the question.

When combining results from partitions, you can't use startAt.  You  
must always assemble the docs from 0 to N for each partition (whether  
through one request or multiple).


-Mike




2007/5/15, James liu <[EMAIL PROTECTED]>:




2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>
> On 14-May-07, at 1:35 AM, James liu wrote:
>
> > if use multi index box, how to pagination with sort by score
> > correctly?
> >
> > for example, i wanna query "search" with 60 index box and sort by
> > score.
> >
> > i don't know the num found from every index box which have  
different

> > content.
> >
> > if promise 10 page with sort score correctly, i think solr 's  
start

> > is 0,
> > and rows is 100.(10 result per page)
> >
> > 60*100=6000, sort it and get top 100 to cache.
>
> > it is very slove although it promise 10 page with sort score
> > correctly.
>
> With few index partitions, you it is sufficient to ask for startAt
> +numNeeded docs from each partition and sort globally.  Normally if
> you wanted 10 for the first page, you would ask for 10 from each
> server and cache the remainder.  It is better to ask for more later
> if the user asks for page ten.
>
>
> When you get up to 60 partitions, you should make it a multi stage
> process.  Assuming your partitions are disjoint and evenly
> distributed, estimate the number of documents that will appear  
in the

> final result from each.


yes, partitions distrbuted.


 Double or triple that (and put a minimum
> threshold), try to assemble the number of documents you require,  
and
> if one partition "runs out" of docs before it is done, request a  
new

> round.


i dont' know what u mean "runs out"

one user request will generate 60 partitions request.

they work in parallel。

so i don't know every partion's status before they done.


To promise 10 page result sorted by score correctly, the only way  
seems to

get 100 results(rows=100) from each partitioin. but it very slow.

now i wanna find a way to get result sorted by score correctly and  
search

fast.


-Mike
>

Thks Mike. But it not i want.


--
regards
jl





--
regards
jl




Re: Question: Pagination with multi index box

2007-05-14 Thread James liu

thks for your detail answer.

but u ignore "sorted by score"

p1, p2,p1,p1,p3,p4,p1,p1

maybe their max score is lower than from p19,p20.

so it will not sorted by score correctly.

and if user click page 2 to see, how to show data?

p1 start from 10 or query other partitions?


2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:


On 14-May-07, at 6:49 PM, James liu wrote:

> 2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>>
>> On 14-May-07, at 1:35 AM, James liu wrote:
>>
>> When you get up to 60 partitions, you should make it a multi stage
>> process.  Assuming your partitions are disjoint and evenly
>> distributed, estimate the number of documents that will appear in the
>> final result from each.
>
>
> yes, partitions distrbuted.
>
>
> Double or triple that (and put a minimum
>> threshold), try to assemble the number of documents you require, and
>> if one partition "runs out" of docs before it is done, request a new
>> round.
>
>
> i dont' know what u mean "runs out"

Say you request 5 docs from each of 60 partitions, and are interested
in docs 1-10.  If, sorted by score, the docs come from:

p1, p2, p1, p1, p3, p4, p1, p1

Then p1 has "run out" at n=8, and there is no way to be sure if the
remaining two needed docs come from p1 or somewhere else.  So you
have to now request at least two additional documents from p1.

> one user request will generate 60 partitions request.
>
> they work in parallel。
>
> so i don't know every partion's status before they done.

Normally, you would wait for them to finish, and execute a subsequent
request if more docs are needed.

-Mike





--
regards
jl


Documenting function queries [was Re: NumberFormat exception when trying to use recip function query]

2007-05-14 Thread Mekin Maheshwari

  2) eliminate the space inside the recip functions


This solved it :)

I would like to document this along with a little detail about function
queries & may be if I get enough time, simple graphs that I created to help
people choose the right values for using in the function queries.

I dont see a link from the wiki at the top level -
http://wiki.apache.org/solr/

I do see a stub for - http://wiki.apache.org/solr/FunctionQuery

Which I can start filling up.

The other options are - http://wiki.apache.org/solr/SolrRelevancyCookbook
and
http://wiki.apache.org/solr/DisMaxRequestHandler

I am inclined to creating the FuntionQuery page and adding links to it from
the other 2 pages.

Let me know if you think of a more appropriate place to put this stuff.


Thanks,
mekin


On 5/15/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: The function query field is -
:  
: recip(popularityRank, 1, 1000,
: 1000)^0.5recip(rord(creationDate),1,1000,1000)^
: 0.3
:  

off the top of my head, i'd suggest you:
  1) verify there is some whitespace between the boost of the
popularity recip function and the date recip function
  2) eliminate the space inside the recip functions
  3) verify that there isno psace between eitehr recp function and it's
boost

...and see if that works...

  
 recip(popularityRank,1,1000,1000)^0.5
 recip(rord(creationDate),1,1000,1000)^0.3
  



-Hoss





--


Re: Question: Pagination with multi index box

2007-05-14 Thread James liu

for example, i wanna query "lucene", it's numFound is 234300.

and results should sorted by score.

if u do, how to pagination and sort it's score?


2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:



On 14-May-07, at 7:15 PM, James liu wrote:

> if i set rows=(page-1)*10,,,it will lose more result which fits query.
>
> how to set start when pagination.

I'm not sure I understand the question.

When combining results from partitions, you can't use startAt.




if not use startAt, how to define rows to keep user can find results?


You

must always assemble the docs from 0 to N for each partition (whether
through one request or multiple).



if  rows bigger it will slow, if smaller it will lose data and sort score
not correctly.

-Mike


>
>
> 2007/5/15, James liu <[EMAIL PROTECTED]>:
>>
>>
>>
>> 2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>> >
>> > On 14-May-07, at 1:35 AM, James liu wrote:
>> >
>> > > if use multi index box, how to pagination with sort by score
>> > > correctly?
>> > >
>> > > for example, i wanna query "search" with 60 index box and sort by
>> > > score.
>> > >
>> > > i don't know the num found from every index box which have
>> different
>> > > content.
>> > >
>> > > if promise 10 page with sort score correctly, i think solr 's
>> start
>> > > is 0,
>> > > and rows is 100.(10 result per page)
>> > >
>> > > 60*100=6000, sort it and get top 100 to cache.
>> >
>> > > it is very slove although it promise 10 page with sort score
>> > > correctly.
>> >
>> > With few index partitions, you it is sufficient to ask for startAt
>> > +numNeeded docs from each partition and sort globally.  Normally if
>> > you wanted 10 for the first page, you would ask for 10 from each
>> > server and cache the remainder.  It is better to ask for more later
>> > if the user asks for page ten.
>> >
>> >
>> > When you get up to 60 partitions, you should make it a multi stage
>> > process.  Assuming your partitions are disjoint and evenly
>> > distributed, estimate the number of documents that will appear
>> in the
>> > final result from each.
>>
>>
>> yes, partitions distrbuted.
>>
>>
>>  Double or triple that (and put a minimum
>> > threshold), try to assemble the number of documents you require,
>> and
>> > if one partition "runs out" of docs before it is done, request a
>> new
>> > round.
>>
>>
>> i dont' know what u mean "runs out"
>>
>> one user request will generate 60 partitions request.
>>
>> they work in parallel。
>>
>> so i don't know every partion's status before they done.
>>
>>
>> To promise 10 page result sorted by score correctly, the only way
>> seems to
>> get 100 results(rows=100) from each partitioin. but it very slow.
>>
>> now i wanna find a way to get result sorted by score correctly and
>> search
>> fast.
>>
>>
>> -Mike
>> >
>>
>> Thks Mike. But it not i want.
>>
>>
>> --
>> regards
>> jl
>
>
>
>
> --
> regards
> jl





--
regards
jl


Re: Documenting function queries [was Re: NumberFormat exception when trying to use recip function query]

2007-05-14 Thread Chris Hostetter

: I would like to document this along with a little detail about function
: queries & may be if I get enough time, simple graphs that I created to help
: people choose the right values for using in the function queries.

that would be *awesome*


: I do see a stub for - http://wiki.apache.org/solr/FunctionQuery

there's no actual stub article, but the wiki probably shows you a link
there from somewhere that someone types FunctionQuery (since java class
names look like wikiwords) so there's no particular reason to fill up that
page ... but it's pretty much the best possible name, so my all means
start using it.

: I am inclined to creating the FuntionQuery page and adding links to it from
: the other 2 pages.

sounds like a good plan to me.


-Hoss



Re: Question: Pagination with multi index box

2007-05-14 Thread Mike Klaas

On 14-May-07, at 8:55 PM, James liu wrote:


thks for your detail answer.

but u ignore "sorted by score"

p1, p2,p1,p1,p3,p4,p1,p1

maybe their max score is lower than from p19,p20.



I'm not ignoring it: I'm implying that the above is the correct  
descending score-sorted order.  You have to perform that sort manually.



so it will not sorted by score correctly.

and if user click page 2 to see, how to show data?

p1 start from 10 or query other partitions?


Assemble results 1 through 20, then display 11-20 to the user.

-Mike



2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:


On 14-May-07, at 6:49 PM, James liu wrote:

> 2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>>
>> On 14-May-07, at 1:35 AM, James liu wrote:
>>
>> When you get up to 60 partitions, you should make it a multi stage
>> process.  Assuming your partitions are disjoint and evenly
>> distributed, estimate the number of documents that will appear  
in the

>> final result from each.
>
>
> yes, partitions distrbuted.
>
>
> Double or triple that (and put a minimum
>> threshold), try to assemble the number of documents you  
require, and
>> if one partition "runs out" of docs before it is done, request  
a new

>> round.
>
>
> i dont' know what u mean "runs out"

Say you request 5 docs from each of 60 partitions, and are interested
in docs 1-10.  If, sorted by score, the docs come from:

p1, p2, p1, p1, p3, p4, p1, p1

Then p1 has "run out" at n=8, and there is no way to be sure if the
remaining two needed docs come from p1 or somewhere else.  So you
have to now request at least two additional documents from p1.

> one user request will generate 60 partitions request.
>
> they work in parallel。
>
> so i don't know every partion's status before they done.

Normally, you would wait for them to finish, and execute a subsequent
request if more docs are needed.

-Mike





--
regards
jl




Re: Question: Pagination with multi index box

2007-05-14 Thread James liu

2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:


On 14-May-07, at 8:55 PM, James liu wrote:

> thks for your detail answer.
>
> but u ignore "sorted by score"
>
> p1, p2,p1,p1,p3,p4,p1,p1
>
> maybe their max score is lower than from p19,p20.
>

I'm not ignoring it: I'm implying that the above is the correct
descending score-sorted order.  You have to perform that sort manually.



i mean merged results(from 60 p) and sort it, not solr's sort.
every result from box have been  sorted by score.



so it will not sorted by score correctly.
>
> and if user click page 2 to see, how to show data?
>
> p1 start from 10 or query other partitions?

Assemble results 1 through 20, then display 11-20 to the user.



for example, i wanna query "solr"

p1 have 100 results which score is bigger than 80

p2 have 100 results which score is smaller than 20

so if i use rows=10, score not correct.

if i wanna promise 10 pages which sort by score correctly.

so i have to get 100(rows=100) results from every box.

and merge results, sort it, finallay get top 100 results.

but it will very slow.


i don't know other search how to solve it? maybe they not sort by score very
correctly.




-Mike


>
> 2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>>
>> On 14-May-07, at 6:49 PM, James liu wrote:
>>
>> > 2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>> >>
>> >> On 14-May-07, at 1:35 AM, James liu wrote:
>> >>
>> >> When you get up to 60 partitions, you should make it a multi stage
>> >> process.  Assuming your partitions are disjoint and evenly
>> >> distributed, estimate the number of documents that will appear
>> in the
>> >> final result from each.
>> >
>> >
>> > yes, partitions distrbuted.
>> >
>> >
>> > Double or triple that (and put a minimum
>> >> threshold), try to assemble the number of documents you
>> require, and
>> >> if one partition "runs out" of docs before it is done, request
>> a new
>> >> round.
>> >
>> >
>> > i dont' know what u mean "runs out"
>>
>> Say you request 5 docs from each of 60 partitions, and are interested
>> in docs 1-10.  If, sorted by score, the docs come from:
>>
>> p1, p2, p1, p1, p3, p4, p1, p1
>>
>> Then p1 has "run out" at n=8, and there is no way to be sure if the
>> remaining two needed docs come from p1 or somewhere else.  So you
>> have to now request at least two additional documents from p1.
>>
>> > one user request will generate 60 partitions request.
>> >
>> > they work in parallel。
>> >
>> > so i don't know every partion's status before they done.
>>
>> Normally, you would wait for them to finish, and execute a subsequent
>> request if more docs are needed.
>>
>> -Mike
>
>
>
>
> --
> regards
> jl





--
regards
jl


Re: Question: Pagination with multi index box

2007-05-14 Thread James liu

maybe full-text search sort correct not very import.


2007/5/15, James liu <[EMAIL PROTECTED]>:




2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>
> On 14-May-07, at 8:55 PM, James liu wrote:
>
> > thks for your detail answer.
> >
> > but u ignore "sorted by score"
> >
> > p1, p2,p1,p1,p3,p4,p1,p1
> >
> > maybe their max score is lower than from p19,p20.
> >
>
> I'm not ignoring it: I'm implying that the above is the correct
> descending score-sorted order.  You have to perform that sort manually.


i mean merged results(from 60 p) and sort it, not solr's sort.
every result from box have been  sorted by score.


> so it will not sorted by score correctly.
> >
> > and if user click page 2 to see, how to show data?
> >
> > p1 start from 10 or query other partitions?
>
> Assemble results 1 through 20, then display 11-20 to the user.


for example, i wanna query "solr"

p1 have 100 results which score is bigger than 80

p2 have 100 results which score is smaller than 20

so if i use rows=10, score not correct.

if i wanna promise 10 pages which sort by score correctly.

so i have to get 100(rows=100) results from every box.

and merge results, sort it, finallay get top 100 results.

but it will very slow.


i don't know other search how to solve it? maybe they not sort by score
very correctly.




-Mike
>
> >
> > 2007/5/15, Mike Klaas <[EMAIL PROTECTED] >:
> >>
> >> On 14-May-07, at 6:49 PM, James liu wrote:
> >>
> >> > 2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
> >> >>
> >> >> On 14-May-07, at 1:35 AM, James liu wrote:
> >> >>
> >> >> When you get up to 60 partitions, you should make it a multi stage
> >> >> process.  Assuming your partitions are disjoint and evenly
> >> >> distributed, estimate the number of documents that will appear
> >> in the
> >> >> final result from each.
> >> >
> >> >
> >> > yes, partitions distrbuted.
> >> >
> >> >
> >> > Double or triple that (and put a minimum
> >> >> threshold), try to assemble the number of documents you
> >> require, and
> >> >> if one partition "runs out" of docs before it is done, request
> >> a new
> >> >> round.
> >> >
> >> >
> >> > i dont' know what u mean "runs out"
> >>
> >> Say you request 5 docs from each of 60 partitions, and are interested
>
> >> in docs 1-10.  If, sorted by score, the docs come from:
> >>
> >> p1, p2, p1, p1, p3, p4, p1, p1
> >>
> >> Then p1 has "run out" at n=8, and there is no way to be sure if the
> >> remaining two needed docs come from p1 or somewhere else.  So you
> >> have to now request at least two additional documents from p1.
> >>
> >> > one user request will generate 60 partitions request.
> >> >
> >> > they work in parallel。
> >> >
> >> > so i don't know every partion's status before they done.
> >>
> >> Normally, you would wait for them to finish, and execute a subsequent
>
> >> request if more docs are needed.
> >>
> >> -Mike
> >
> >
> >
> >
> > --
> > regards
> > jl
>
>


--
regards
jl





--
regards
jl


Re: NumberFormat exception when trying to use recip function query

2007-05-14 Thread Mekin Maheshwari

Done.

Please check - http://wiki.apache.org/solr/FunctionQuery

and send me your comments (or improve the wiki )

Right now its more of a aggregation of all relevant information.
I hope people will be able to add notes like what values to use, pitfalls to
avoid, behaviour in special cases as well.

For example, I dont know how these functions deal with missing values.

-mekin


On 5/15/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: The function query field is -
:  
: recip(popularityRank, 1, 1000,
: 1000)^0.5recip(rord(creationDate),1,1000,1000)^
: 0.3
:  

off the top of my head, i'd suggest you:
  1) verify there is some whitespace between the boost of the
popularity recip function and the date recip function
  2) eliminate the space inside the recip functions
  3) verify that there isno psace between eitehr recp function and it's
boost

...and see if that works...

  
 recip(popularityRank,1,1000,1000)^0.5
 recip(rord(creationDate),1,1000,1000)^0.3
  



-Hoss





--
My company - http://ugenie.com
My Blog - http://mekin.livejournal.com/
My linkedin URL - http://www.linkedin.com/in/mekin