Re: Implementing PhraseQuery and MoreLikeThis Query in one app

2009-07-04 Thread SergeyG

Otis,

Here're the logs - method calls along with their outputs (sorry for the bulk
data :) ). I compared 3 runs.


1) GetMethod
 a) url=http://localhost:8080/solr/mlt
 b)
query=q=id:10&mlt.fl=content_mlt&mlt.maxqt=5&mlt.interestingTerms=details&fl=title+author+score

Output:
INFO MLT2SearchRequestProcessor:87 - In method sendGetCommand():
url=http://localhost:8080/solr/mlt
;
queryString=q=id:10&mlt.fl=content_mlt&mlt.maxqt=5&mlt.interestingTerms=details&fl=title+author+sc
ore
 INFO MLT2SearchRequestProcessor:76 - 

002.098612S.G.SG_Book0.28923997O. HenryS.G.Four
Million, The0.08667877Katherine
MosbyThe Season of Lillian Dawes0.07947738Jerome K. JeromeThree Men in a
Boat0.047219563Charles
OliverS.G.ABC's of Science1.01.01.01.01.0



2) GetMethod
 a) url=http://localhost:8080/solr/select
 b)
query=q=id:10&mlt.fl=content_mlt&mlt.maxqt=5&mlt.interestingTerms=details&fl=title+author+score

Output:

INFO MLT2SearchRequestProcessor:87 - In method sendGetCommand():
url=http://localhost:8080/solr/sel
ect;
queryString=q=id:10&mlt=true&mlt.fl=content_mlt&mlt.maxqt=5&mlt.interestingTerms=details&fl=tit
le+author+score
 INFO MLT2SearchRequestProcessor:76 - 

015title author scorecontent_mltid:10truedetails5<
/lst>2.098612S.G.SG_Book0.24578805O.
HenryS.G.Four Million, The0.22171465Jerome K. JeromeThree Men in a
Boat0.22018899Katherine
MosbyTh
e Season of Lillian Dawes0.098666154
Charles OliverS.G.ABC's of
Scienceid:10id:10id:10id:10
2.098612 = (MATCH) weight(id:10 in 3), product of:
  0.9994 = queryWeight(id:10), product of:
2.0986123 = idf(docFreq=1, numDocs=5)
0.47650534 = queryNorm
  2.0986123 = (MATCH) fieldWeight(id:10 in 3), product of:
1.0 = tf(termFreq(id:10)=1)
2.0986123 = idf(docFreq=1, numDocs=5)
1.0 = fieldNorm(field=id, doc=3)
OldLuceneQParser15.00.00.00.00.00.00.015.00.00.015.00.00.0



3) SolrJ call
 a) url=http://localhost:8080/solr
 b)
query=q=id:10&mlt=true&mlt.fl=content_mlt&mlt.maxqt=5&mlt.interestingTerms=details&fl=title+auth
or+score

Output:

INFO MLTSearchRequestProcessor:45 - SolrServer url:
http://localhost:8080/solr
 INFO MLTSearchRequestProcessor:51 - id = 10
 INFO MLTSearchRequestProcessor:53 - constructedQuery> id:10
 INFO MLTSearchRequestProcessor:63 - solrQuery>
q=id%3A10&mlt=true&mlt.fl=content_mlt&mlt.maxqt=
5&mlt.interestingTerms=details&fl=title+author+score
 INFO MLTSearchRequestProcessor:69 - Number of docs found = 1
 INFO MLTSearchRequestProcessor:73 - title = SG_Book; score = 2.098612


One can see that the results of 2 runs with GetMethod are almost identical:
docs found and their weights are the same. (Although the values themselves
are doubtful: for example, the response contains the original doc, though it
wasn't supposed to be in the returned list of "more like this" docs. Then
its weight shows that its id=10 was found in three other docs what shouldn't
be like that. (Or it's just that rare coincidence that 10 is among the most
important terms of this doc and other docs happen to contain it. But it
looks very unlikely. Or I simply misinterpret it?) Plus individual weights
for "intestingTerms" are the same (1.0) and that's also questionable. 
And the 3rd run (SolrJ call) returned just the original doc (with the same
weight as in the first two calls).

Maybe the problem lurks somewhere in solrconfig.xml? Now I don't have a
slightest idea where to look for a hint.

Anyway, it's a holiday today. (Hopefully my message doesn't interrupt it. :)
)

Have a great 4th of July!

Sergey


Otis Gospodnetic wrote:
> 
> 
> Sergey,
> 
> I think I confused you.  The comment about the fields listed in the "fl"
> parameter has nothing to do with the SolrJ calls not working.
> 
> For SolrJ calls not working my suggestion is to look at the logs and
> compare the GetMethod call with the SolrJ call.  Paste them if you want
> more people to look at them.
> 
> 
> Otis 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
>> From: SergeyG 
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 3, 2009 4:08:37 AM
>> Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app
>> 
>> 
>> Otis,
>> 
>> Thanks a lot. I'd certainly follow your advice and check the logs.
>> Although,
>> I must say that I've already tried all possible variations of the string
>> for
>> the "fl" parameter (spaces, commas, plus signs). More than that - the
>> query
>> still doesn't want to fetch any docs (other than the one with the id
>> specified in the query) even when the line solrQuery.setParam("fl",
>> "title
>> author score"); is commented out. So I suspect that the problem is that
>> the
>> request with the url
>> "http://localhost:8080/solr/select?q=id:1&mlt=true&mlt.fl=content&...";
>> due
>> to some reason doesn't work properly. And when I use the GetMethod(url)
>> approach and send url directly in the form
>> "http://localhost:8080/solr/mlt?q=id:1&mlt.fl=content&...";, Solr picks up
>> the mlt component. (At lea

Re: Trie vs long string for sorting

2009-07-04 Thread Chris Hostetter

: My data are library call numbers, normalized to be comparable, resulting in
: (maximum) 21-character strings of the form "RK 052180H359~999~999"
: 
: Now, these are fine -- they work for sorting and ranges and the whole thing,
: but right now I can't use them because I've got two or three for each of my
: 6M documents and on a 32-bit machine I run out of heap.
: 
: Another option would be to turn them into longs (using roughly 56 bits of
: the 64 bit space) and use a trie type. Is there any sort of a win involved
: there?

I don't think Trie fields can be used for sorting (because they result in 
multiple terms per doc) but i could be wrong about that, smarter people 
then me may have done something cool with the TreiField that i'm not aware 
of.

As a general rule: if you have character data that fits a rigid enough set 
of constraints that you can encode any legal value into a single 
numberic value (either int, or long) such that they still sort properly, 
sorting on those encoded values is going to be more memory efficient (and 
probably just as fast) as sorting on the string values.


-Hoss



Re: Delete, Commit, Add Interaction

2009-07-04 Thread Chris Hostetter

: 
:collection:foo
: 
: 
: 
:
:   .
: 
: 

...

: Finally, here's the behavior we're seeing.  In some cases, usually when
: the index is starting to get larger (approaching 500,000 documents),
: the above procedure will fail to add anything to the index.  That is, none
: of the commands return an error code, there is no indication of a problem
: in the log files and the process DOES take some amount of time to

That really shouldn't happen.  if you were using embedded solr, or some 
crazy UpdateProcessor, i can imagine encountering a code path 
where your adds got processed before your delete -- but not if you are 
using HTTP to send XML like that each in a separate HTTP Connection as you 
describe.

: If this is happening, how can I know when the delete has been processed
: before initiating the add process?

When the  command after the delete returns a 200 status code, the 
delete is done.  *DONE* Done, completley done, over and done nothing funky 
going on under the covers done.

can you post some of your log messages from one of these problematic 
instances?  I'm particularly intersted in the INFO level messages from the 
LogUpdateProcessor.finsh and SolrCore.execute that say things like...

Jul 4, 2009 12:38:43 PM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: {} 0 0
Jul 4, 2009 12:38:43 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=0 

...that was a delete (not sure why the msg from LogUpdateProcessor is 
empty) then somehting like this from the commit...

Jul 4, 2009 12:39:55 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true)
Jul 4, 2009 12:39:55 PM org.apache.solr.search.SolrIndexSearcher 
INFO: Opening searc...@15ccfb1 main
   < ... snip a bunch of logging about autowarming various caches ... >
Jul 4, 2009 12:39:55 PM 
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {commit=} 0 50
Jul 4, 2009 12:39:55 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=50 

...and then a bunch of adds...

Jul 4, 2009 12:41:37 PM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: {add=[SP2514N, 6H500F0]} 0 24
Jul 4, 2009 12:41:37 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=24 
Jul 4, 2009 12:41:37 PM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: {add=[F8V7067-APL-KIT, IW-02]} 0 9

...which should be followed by another commit getting logged.

These log messages are all from the example runnning in jetty, your log 
format may vary.  What I'm particularly interested is the timestamps on 
these log messages so if you can turn on millisecond time resolution that 
would be best ... i want to see when exactly the delete/commit/add/commit 
comands are getting executed.




-Hoss



Re: Delete, Commit, Add Interaction

2009-07-04 Thread Chris Hostetter

: Jul 4, 2009 12:38:43 PM org.apache.solr.update.processor.LogUpdateProcessor 
finish
: INFO: {} 0 0
: Jul 4, 2009 12:38:43 PM org.apache.solr.core.SolrCore execute
: INFO: [] webapp=/solr path=/update params={} status=0 QTime=0 
: 
: ...that was a delete (not sure why the msg from LogUpdateProcessor is 
: empty) then somehting like this from the commit...

...it was fat finger user error on my part, the log msg from delete should 
look like...

Jul 4, 2009 12:46:30 PM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: {deleteByQuery=name:foo} 0 12
Jul 4, 2009 12:46:30 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=12 



-Hoss



Re: Trie vs long string for sorting

2009-07-04 Thread Mark Miller
Trie has a custom parser that can load the FieldCache for sorting. Its
basically a built in type now, that supports fieldcache, sorting, stored
fields, etc.

On Sat, Jul 4, 2009 at 3:27 PM, Chris Hostetter wrote:

>
> : My data are library call numbers, normalized to be comparable, resulting
> in
> : (maximum) 21-character strings of the form "RK 052180H359~999~999"
> :
> : Now, these are fine -- they work for sorting and ranges and the whole
> thing,
> : but right now I can't use them because I've got two or three for each of
> my
> : 6M documents and on a 32-bit machine I run out of heap.
> :
> : Another option would be to turn them into longs (using roughly 56 bits of
> : the 64 bit space) and use a trie type. Is there any sort of a win
> involved
> : there?
>
> I don't think Trie fields can be used for sorting (because they result in
> multiple terms per doc) but i could be wrong about that, smarter people
> then me may have done something cool with the TreiField that i'm not aware
> of.
>
> As a general rule: if you have character data that fits a rigid enough set
> of constraints that you can encode any legal value into a single
> numberic value (either int, or long) such that they still sort properly,
> sorting on those encoded values is going to be more memory efficient (and
> probably just as fast) as sorting on the string values.
>
>
> -Hoss
>
>


-- 
-- 
- Mark

http://www.lucidimagination.com


Problem in parsing non-string dynamic field by using IndexReader

2009-07-04 Thread Yuchen Wang
I have a task to parse all documents in a solr index. I use Lucene
IndexReader to read the index and go through each field from all documents.
However, for float or int dynamic fields, the stringValue() call always
returns some special characters. I tried tokenStreamValue, byteValue,
readerValue, and they return null.
Following is my method to parse the solr index. My question is, how can I
get the values from non-string dynamic fields properly?

public static void main(String[] args) throws Exception {
IndexReader reader =
IndexReader.open("/path/to/my/index/directory");

int total = reader.numDocs();
System.out.println("Total documents: " + total);

for (int i = 0; i < 1; i++) {
Document d = reader.document(i);

List fields = d.getFields();

for (Field f : fields) {
String name = f.name();
String val = f.stringValue();

   System.out.println("get field / value: [" + name + "=" + val
+ "]");}
}

reader.close();
}