RE: How to read values of a field efficiently

2007-08-20 Thread Martin Grotzke
On Sun, 2007-08-19 at 21:39 +0200, Ard Schrijvers wrote:
> > On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote:
> > > : Is it possible to get the values from the ValueSource (or from
> > > : getFieldCacheCounts) sorted by its natural order (from lowest to
> > > : highest values)?
> > > 
> > > well, an inverted term index is already a data structure 
> > > listing terms
> > > from lowest to highest and the associated documents -- so 
> > > if you want to
> > > iterate from low to high between a range and find matching 
> > > docs you should
> > > just use hte TermEnum
> > Ok. Unfortunately I don't see how I can get a TermEnum for a specific
> > field (e.g. "price")... I tried
> > 
> > TermEnum te = searcher.getReader().terms(new Term(field, ""));
> > 
> > but this returns also terms for several other fields.
> 
> correct, see 
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms()
> 
> > Is it possible at all to get a TermEnum for a specific field?
> 
> AFAIK not directly. Normally, I use something like:
> 
> TermEnum terms = searcher.getReader().terms(new Term(field, ""));
>   while (terms.term() != null && terms.term().field() == field){
>   //do things   
>   terms.next();
>   }
Now I implemented it like this:

String startTerm = prefix==null ? "" : ft.toInternal(prefix);
TermEnum te = searcher.getReader().terms(new Term(field, startTerm));
final int[] prices = new int[docs.size()];
int i = 0;
int skipped = 0;
while( te.next() ) {
final Term term = te.term();
if ( term == null || !term.field().equals( field ) ) {
skipped++;
continue;
}
final String termText = term.text();
int count = searcher.numDocs(new TermQuery(term), docs);
int value = (int) NumberUtils.SortableStr2float(termText);
for( int j = 0; j < count; j++ ) {
prices[i++] = value;
}
}

Unfortunately this takes ~1,5 sec in my case (~2M docs, result/docset contains
1900280 results, 1026683 terms skipped because they didn't match the field).

Is there anything which could be optimized / which is wrong?

If not I would have to check which other way I could go (based on ValueSource
or what else)...

Thanx && cheers,
Martin


> 
> > 
> > Then if I had this TermEnum, how can I check if a Term is in my
> > DocSet? In other words, I would like to read Terms for a specific
> > field from my DocSet - so that I could determine all price terms
> > for my DocSet.
> 
> Is your DocSet some sort of filter? if so, in your while loop you can fill a 
> new Filter, like
> 
> BitSet docFilter = new BitSet(reader.maxDoc());
> 
> and in the while loop:
> 
>   docs.seek(terms);
>   while (docs.next()) {
>  docFilter.set(docs.doc());
>   }
> 
> If your DocSet is not a BitSet you might be able to construct one for it,
> 
> Regards Ard
> 
> > 
> > Is there a way to achieve this?
> > 
> > Thanx in advance,
> > cheers,
> > Martin
> > 
> > 
> > >  -- the whole point of the FieldCache (and
> > > FieldCacheSource) is to have a "reverse inverted index" so 
> > you can quickly
> > > fetch the indexed value if you know the docId.
> > > 
> > > perhaps you should elaborate a little more on what it is 
> > you are trying to
> > > do so we can help you figure out how to do it more 
> > efficinelty ... i know
> > > you mentioend computing price ranges in your first message 
> > ... but you
> > > also didn't post any clear code about that part of your 
> > problem, just that
> > > the *other* part of your code that iterated over every doc 
> > was too slow
> > > ... perhaps you shouldn't be iterating over every doc to 
> > figure out your
> > > ranges .. perhaps you can iterate over the terms themselves?
> > > 
> > > 
> > > hang on ... rereading your first message i just noticed something i
> > > definitely didn't spot before...
> > > 
> > > >> Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
> > > >> for the second request, while reading prices takes ~600 ms.
> > > 
> > > ...i clearly missed this, and fixated on your assertion 
> > that your reading
> > > of field values took longer then the stock methods -- but 
> > you're not just
> > > comparing the time needed byu different methods, you're also timing
> > > different fields.
> > > 
> > > this actually makes a lot of sense since there are probably 
> > a lot fewer
> > > unique values for the cat field, so there are a lot fewer 
> > discrete values
> > > to deal with when computing counts.
> > > 
> > > 
> > > 
> > > 
> > > -Hoss
> > > 
> > -- 
> > Martin Grotzke
> > http://www.javakaffee.de/blog/
> > 
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


RE: UTF-8 encoding problem on one of two Solr setups

2007-08-20 Thread Mario Knezovic
> You might want to check out this page
> http://wiki.apache.org/solr/SolrTomcat
> 
> Tomcat needs a small config change out 
> of the box to properly support UTF-8. 

This exactly solved the problem.

Thanks a lot!

Mario



Indexing large documents

2007-08-20 Thread Fouad Mardini
Hello,

I am using solr to index text extracted from word documents, and it is
working really well.
Recently i started noticing that some documents are not indexed, that is i
know that the word foobar is in a document, but when i search for foobar the
id of that document is not returned.
I suspect that this has to do with the size of the document, and that
documents with a lot of text are not being indexed.
Please advise.

thanks,
fmardini


RE: Indexing large documents

2007-08-20 Thread praveen jain
Hi 
 I want to know how to update my .xml file which have other field then
the  default one , so which file o have to modify, and how.

pRAVEEN jAIN
+919890599250

-Original Message-
From: Fouad Mardini [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 20, 2007 4:00 PM
To: solr-user@lucene.apache.org
Subject: Indexing large documents

Hello,

I am using solr to index text extracted from word documents, and it is
working really well.
Recently i started noticing that some documents are not indexed, that is
i
know that the word foobar is in a document, but when i search for foobar
the
id of that document is not returned.
I suspect that this has to do with the size of the document, and that
documents with a lot of text are not being indexed.
Please advise.

thanks,
fmardini



Re: Indexing large documents

2007-08-20 Thread Peter Manis
Fouad,

I would check the error log or console for any possible errors first.
They may not show up, it really depends on how you are processing the
word document (custom solr, feeding the text to it, etc).  We are
using a custom version of solr with PDF, DOC, XLS, etc text extraction
and I have successfully indexed 40mb documents.  I did have indexing
problems with a large document or two and simply increasing the heap
size fixed the problem.

 - Pete

On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I am using solr to index text extracted from word documents, and it is
> working really well.
> Recently i started noticing that some documents are not indexed, that is i
> know that the word foobar is in a document, but when i search for foobar the
> id of that document is not returned.
> I suspect that this has to do with the size of the document, and that
> documents with a lot of text are not being indexed.
> Please advise.
>
> thanks,
> fmardini
>


Re: Indexing large documents

2007-08-20 Thread Fouad Mardini
Well, I am using the java textmining library to extract text from documents,
then i do a post to solr
I do not have an error log, i only have *.request.log files in the logs
directory

Thanks

On 8/20/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> Fouad,
>
> I would check the error log or console for any possible errors first.
> They may not show up, it really depends on how you are processing the
> word document (custom solr, feeding the text to it, etc).  We are
> using a custom version of solr with PDF, DOC, XLS, etc text extraction
> and I have successfully indexed 40mb documents.  I did have indexing
> problems with a large document or two and simply increasing the heap
> size fixed the problem.
>
> - Pete
>
> On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > I am using solr to index text extracted from word documents, and it is
> > working really well.
> > Recently i started noticing that some documents are not indexed, that is
> i
> > know that the word foobar is in a document, but when i search for foobar
> the
> > id of that document is not returned.
> > I suspect that this has to do with the size of the document, and that
> > documents with a lot of text are not being indexed.
> > Please advise.
> >
> > thanks,
> > fmardini
> >
>


Re: Indexing large documents

2007-08-20 Thread Peter Manis
The that should show some errors if something goes wrong, if not the
console usually will.  The errors will look like a java stacktrace
output.  Did increasing the heap do anything for you?  Changing mine
to 256mb max worked fine for all of our files.

On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> Well, I am using the java textmining library to extract text from documents,
> then i do a post to solr
> I do not have an error log, i only have *.request.log files in the logs
> directory
>
> Thanks
>
> On 8/20/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> >
> > Fouad,
> >
> > I would check the error log or console for any possible errors first.
> > They may not show up, it really depends on how you are processing the
> > word document (custom solr, feeding the text to it, etc).  We are
> > using a custom version of solr with PDF, DOC, XLS, etc text extraction
> > and I have successfully indexed 40mb documents.  I did have indexing
> > problems with a large document or two and simply increasing the heap
> > size fixed the problem.
> >
> > - Pete
> >
> > On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > > Hello,
> > >
> > > I am using solr to index text extracted from word documents, and it is
> > > working really well.
> > > Recently i started noticing that some documents are not indexed, that is
> > i
> > > know that the word foobar is in a document, but when i search for foobar
> > the
> > > id of that document is not returned.
> > > I suspect that this has to do with the size of the document, and that
> > > documents with a lot of text are not being indexed.
> > > Please advise.
> > >
> > > thanks,
> > > fmardini
> > >
> >
>


Re: Indexing large documents

2007-08-20 Thread Pieter Berkel
You will probably need to increase the value of maxFieldLength in your
solrconfig.xml.  The default value is 1 which might explain why your
documents are not being completely indexed.

Piete


On 20/08/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> The that should show some errors if something goes wrong, if not the
> console usually will.  The errors will look like a java stacktrace
> output.  Did increasing the heap do anything for you?  Changing mine
> to 256mb max worked fine for all of our files.
>
> On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > Well, I am using the java textmining library to extract text from
> documents,
> > then i do a post to solr
> > I do not have an error log, i only have *.request.log files in the logs
> > directory
> >
> > Thanks
> >
> > On 8/20/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > >
> > > Fouad,
> > >
> > > I would check the error log or console for any possible errors first.
> > > They may not show up, it really depends on how you are processing the
> > > word document (custom solr, feeding the text to it, etc).  We are
> > > using a custom version of solr with PDF, DOC, XLS, etc text extraction
> > > and I have successfully indexed 40mb documents.  I did have indexing
> > > problems with a large document or two and simply increasing the heap
> > > size fixed the problem.
> > >
> > > - Pete
> > >
> > > On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > > > Hello,
> > > >
> > > > I am using solr to index text extracted from word documents, and it
> is
> > > > working really well.
> > > > Recently i started noticing that some documents are not indexed,
> that is
> > > i
> > > > know that the word foobar is in a document, but when i search for
> foobar
> > > the
> > > > id of that document is not returned.
> > > > I suspect that this has to do with the size of the document, and
> that
> > > > documents with a lot of text are not being indexed.
> > > > Please advise.
> > > >
> > > > thanks,
> > > > fmardini
> > > >
> > >
> >
>


Re: how to retrieve all the documents in an index?

2007-08-20 Thread Erik Hatcher

Yes - they come back in the order indexed.

Erik

On Aug 19, 2007, at 7:20 PM, Yu-Hui Jin wrote:

BTW, Hoss,  is there a default order for the documents returned by  
running

this query?


thanks,

-Hui

On 8/16/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Any of you know whether the new "q:*.*" query performs better  
than the
: get-around solutions like using a ranged query?  I would guess  
so, but I

: haven't looked into the Lucene implementation.

it's faster -- it has almost no work to do relative the range query
version.



-Hoss





--
Regards,

-Hui




Re: Indexing large documents

2007-08-20 Thread Fouad Mardini
thanks, i reindexed the documents and now it works, there was an issue with
text extraction it seems.
I also changed the maxFieldLength and it must have helped

thanks

On 8/20/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
>
> You will probably need to increase the value of maxFieldLength in your
> solrconfig.xml.  The default value is 1 which might explain why your
> documents are not being completely indexed.
>
> Piete
>
>
> On 20/08/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> >
> > The that should show some errors if something goes wrong, if not the
> > console usually will.  The errors will look like a java stacktrace
> > output.  Did increasing the heap do anything for you?  Changing mine
> > to 256mb max worked fine for all of our files.
> >
> > On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > > Well, I am using the java textmining library to extract text from
> > documents,
> > > then i do a post to solr
> > > I do not have an error log, i only have *.request.log files in the
> logs
> > > directory
> > >
> > > Thanks
> > >
> > > On 8/20/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Fouad,
> > > >
> > > > I would check the error log or console for any possible errors
> first.
> > > > They may not show up, it really depends on how you are processing
> the
> > > > word document (custom solr, feeding the text to it, etc).  We are
> > > > using a custom version of solr with PDF, DOC, XLS, etc text
> extraction
> > > > and I have successfully indexed 40mb documents.  I did have indexing
> > > > problems with a large document or two and simply increasing the heap
> > > > size fixed the problem.
> > > >
> > > > - Pete
> > > >
> > > > On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > > > > Hello,
> > > > >
> > > > > I am using solr to index text extracted from word documents, and
> it
> > is
> > > > > working really well.
> > > > > Recently i started noticing that some documents are not indexed,
> > that is
> > > > i
> > > > > know that the word foobar is in a document, but when i search for
> > foobar
> > > > the
> > > > > id of that document is not returned.
> > > > > I suspect that this has to do with the size of the document, and
> > that
> > > > > documents with a lot of text are not being indexed.
> > > > > Please advise.
> > > > >
> > > > > thanks,
> > > > > fmardini
> > > > >
> > > >
> > >
> >
>


RE: Structured Lucene documents

2007-08-20 Thread Pierre-Yves LANDRON
Hello !



At least, I've had the oportunity to test your solution, Pieter, which was to 
use dynamic field :


> 
> 
> Store each page in a separate field (e.g. page1, page2, page3 .. pageN) then
> at query time, use the highlighting parameters to highlight matches in the
> page fields. You should be able to determine the page field that matched the
> query by observing the highlighted results (I'm not certain if the
> hl.flparameter accepts dynamic field names, you may need to specify
> them all
> manually):
> 
> hl=true&hl.fl=page1,page2,page3,pageN&hl.requireFieldMatch=true



As expected, when using the option requireFieldMatch=true ; it does not work...

But when the option is set to false, it seems to work fine, and in first 
thought, I don't need it...
As
you say, I need to specify each field when requesting the index... It's
a big letdown, in my case : it's a shame because, your solution nearly
answer my problem.

It seems the highlights fields must be specified, and that I can't use the * 
completion to do so.

 Am I true ? Is there a way to go throught this obligation ?



Anyway, thanks you very much !

Kind Regards,

Pierre-Yves Landron








> Date: Mon, 13 Aug 2007 21:57:42 +1000
> From: [EMAIL PROTECTED]
> To: solr-user@lucene.apache.org
> Subject: Re: Structured Lucene documents
> 
> On 13/08/07, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote:
> >
> > Hello !Thanks Pieter,That seems a good idea - if not an ideal one - even
> > if it sort of an hack. I will try it as soon as possible and keep you
> > informed.The hl.fl parameter doesn't have to be initialized, I think, so
> > it won't be a problem.On the other hand, I will have the exact same
> > problem to specify the (dynamic) field on wich the request is performed... I
> > need to be able to execute the request on the full text of the page only :
> > must I specify all of the -hightly variable- name of each page field in my
> > query ?I think that structured index document could be of great value to
> > complex documents indexation. Is there a way that someday Solr will include
> > such possibility, or is it basically impossible (due to the way Lucene works
> > for example) ?Kind Regards,Pierre-Yves Landron
> 
> 
> Hi Pierre-Yves,
> 
> Maybe you could use dynamic field copy in your schema.xml to index content
> from all page stored in your document in a separate field, something like:
> 
> 
> 
> and then you would only need to query on the "all_pages" field.  Not quite
> sure how this might be affected by the hl.requireFieldMatch=true parameter
> but it's worth a try.
> 
> cheers,
> Piete

_
Invite your mail contacts to join your friends list with Windows Live Spaces. 
It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us

problem with quering solr after indexing UTF-8 encoded CSV files

2007-08-20 Thread Ben Shlomo, Yatir
Hi!

 

I have utf-8 encoded data inside a csv file (actually it’s a tab separated file 
- attached)

I can index it with no apparent errors

I did not forget to set this in my tomcat configuration

 

 


 
   

 

When I query  a document using the UTF-8 text I get zero matches: 

 

   

- 

  

- 

  

  0 

  0 

- 

  

  on 

  0 

יתיר // Note that - I can see the correct UTF-8 
text in it (hebrew characters)

  10 

  2.2 

  

  

   

  

 

 

When I observe this text in the response by querinig for *:*

I notice that the text does not appear as desired: יתיר instead of יתיר

Do you have any ideas?

Thanks…

 

Here is the response :

 

   

- 

  

- 

  

  0 

  0 

- 

  

  on 

  0 

  *:* 

  10 

  2.2 

  

  

- 

  

- 

  

  1 

  desc is a very good camera 

  display is יתיר ABC res123  

  1 

  1 

  ABC 

   res123  

  C123 

  123456 

  72900010123 

  

  

  

 

 

yatir



Enquiry on Search Results counting

2007-08-20 Thread Jeffrey Tiong
Hi,

I am trying to do some counting on certain fields of the search results,
currently I am using PHP to do the counting, but it is impossible to do this
when the results sets reach a few hundred thousands. Does anyone here has
any idea on how to do this?

Example of scenario,

   1. The solr schema index fields such as ,  and 
   2. User inputs certain query, there are 100,000 results returned
   3. Out from that 100,000 results, we want to show the users what are
   the top 5 most frequent ,  and 
   4. for example, when users search for hard drive, we show the users
   top 5 manufacturer names are seagate, samsung, ibm etc

Thanks a lot!

Regards,
Jeffrey


RE: problem with quering solr after indexing UTF-8 encoded CSV files

2007-08-20 Thread Lance Norskog
If you are running on Windows, it does not default to UTF-8. It has a java 
property that changes it to UTF-8. Unfortunately, not all libraries get this 
information, and some of the String converters don't have a character-encoding 
argument. I learned this the hard way.

  _  

From: Ben Shlomo, Yatir [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 20, 2007 8:40 AM
To: solr-user@lucene.apache.org
Subject: problem with quering solr after indexing UTF-8 encoded CSV files



Hi!

 

I have utf-8 encoded data inside a csv file (actually it’s a tab separated file 
- attached)

I can index it with no apparent errors

I did not forget to set this in my tomcat configuration

 

 


 
   

 

When I query  a document using the UTF-8 text I get zero matches: 

 

   

 

 - 

 

 - 

  0 

  0 

 

 - 

  on 

  0 

יתיר // Note that - I can see the correct UTF-8 
text in it (hebrew characters)

  10 

  2.2 

  

  

   

  

 

 

When I observe this text in the response by querinig for *:*

I notice that the text does not appear as desired: יתיר instead of יתיר

Do you have any ideas?

Thanks…

 

Here is the response :

 

   

 

 - 

 

 - 

  0 

  0 

 

 - 

  on 

  0 

  *:* 

  10 

  2.2 

  

  

 

 - 

 

 - 

  1 

  desc is a very good camera 

  display is יתיר ABC res123  

  1 

  1 

  ABC 

   res123  

  C123 

  123456 

  72900010123 

  

  

  

 

 

yatir



RE: solr + carrot2

2007-08-20 Thread Lance Norskog
Exactly! The Lucene version requires direct access to the file. Our indexes
are on servers which do not have graphics (VNC) configured.

A generic Solr access UI would be great.

Lance 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stanislaw
Osinski
Sent: Saturday, August 18, 2007 2:23 AM
To: solr-user@lucene.apache.org
Subject: Re: solr + carrot2

Hi Lance,

The Lucene interface is cool, but not many people put their indexes on
> machines with Swing access.
>
> I just did a Solr integration by copying the eTools.ch implementation.
> This
> took several edits. As long as we're making requests, please do a 
> general-pupose implementation by cloning the Lucene implementation.


I'm not sure if I'm getting you right here... By "implementation" do you
mean adding to the Swing application an option for pulling data from Solr
(with a configuration dialog for Solr URL etc.)?

Thanks,

Stanislaw



RE: Solr 1.1. vs. 1.2.

2007-08-20 Thread Lance Norskog
While we're on the topic, there appear to be a ton of new features in 1.3,
and they are getting debugged. When do you plan to do an official 1.3
release? 

-Original Message-
From: Yu-Hui Jin [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 17, 2007 11:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 1.1. vs. 1.2.

Thanks, Hoss! I would recommend go with 1.2.


regards,

-Hui

On 8/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : I wonder what are the production experience of Solr 1.2 vs. 1.1.  We 
> are
> : thinking of using 1.2 as opposed to 1.1 to support a mission 
> critical
> : application, but am not sure whether 1.2. is stable enough ( afraid 
> of
>
> it's plenty stable.  if you (or someone else) were already using Solr 
> 1.1 and were happy with it, then i wouldn't argue loundly that you 
> *need* to upgrade, but i would not recommend anyone start with 1.1 at this
point.
>
>
>
> -Hoss
>
>


--
Regards,

-Hui



Re: Enquiry on Search Results counting

2007-08-20 Thread Erik Hatcher
Have a look at using Solr's faceting to give you counts back on  
specific fields.  Details here:





On Aug 20, 2007, at 12:35 PM, Jeffrey Tiong wrote:


Hi,

I am trying to do some counting on certain fields of the search  
results,
currently I am using PHP to do the counting, but it is impossible  
to do this
when the results sets reach a few hundred thousands. Does anyone  
here has

any idea on how to do this?

Example of scenario,

   1. The solr schema index fields such as ,  

   name> and 
   2. User inputs certain query, there are 100,000 results returned
   3. Out from that 100,000 results, we want to show the users what  
are
   the top 5 most frequent ,  and  

   code>
   4. for example, when users search for hard drive, we show the users
   top 5 manufacturer names are seagate, samsung, ibm etc

Thanks a lot!

Regards,
Jeffrey




RE: How to read values of a field efficiently

2007-08-20 Thread Chris Hostetter
: > TermEnum terms = searcher.getReader().terms(new Term(field, ""));
: > while (terms.term() != null && terms.term().field() == field){
: > //do things
: > terms.next();
: > }

: while( te.next() ) {
: final Term term = te.term();


you're missing the key piece that Ard alluded to ... the there is one
ordere list of all terms stored in the index ... a TermEnum lets you
iterate over this ordered list, and the IndexReader.terms(Term) method
lets you efficiently start at an arbitrary term.  if you are only
interested in terms for a specific field, once your TermEnum returns a
differnet field, you can stop -- you will never get any more terms for
the field you care about (hence Ard's terms.term().field() == field in his
loop conditional)


-Hoss



Re: Custom Sorting

2007-08-20 Thread Chris Hostetter

: Sort sort = new Sort(new SortField[]
: { SortField.FIELD_SCORE, new SortField(customValue, SortField.FLOAT,
: true) });
: indexSearcher.search(q, sort)

that appears to just be a sort on score withe a secondary reversed
float sort on whatever field name is in the variable "customValue" ...
assuming hte field name is "FIELD" that's hte same thing as...
   sort=score+asc,+FIELD+desc

: Sort sort = new Sort(new SortField(customValue, customComparator))
: indexSearcher.search(q, sort)

this is using a custom SortComparatorSource -- code you (or someone else)
has written which is not part of Lucene and which tells lucene how to
order the documents using whatever crazy logic it wants ... for obvious
reasons Solr can't do that same logic (since it doesn't know what it is)

although many things in Solr are easily customizable, just by writting a
little factory and configuring it by class name, i'm afraind
SortComparatorSources aren't once of them.  You could write a custom
RequestHandler which used your SortComparatorSource, or you could write a
custom FieldType that used it anything someone sorted on that field ...
but those are the best options i cna think of.



-Hoss



Re: solr + carrot2

2007-08-20 Thread Mike Klaas

On 20-Aug-07, at 11:24 AM, Lance Norskog wrote:

Exactly! The Lucene version requires direct access to the file. Our  
indexes

are on servers which do not have graphics (VNC) configured.

A generic Solr access UI would be great.


A generic Solr access UI?  Is this different from the existing admin  
ui that ships with Solr?


-Mike


Re: index size

2007-08-20 Thread Mike Klaas


On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:


Are there any tips on reducing the index size or what factors most
impact index size?

My index has 2.7 million documents and is 200 gigabytes and growing.
Most documents are around 2-3kb and there are about 30 indexed fields.


An "ls -sh" will tell you roughly where the the space is being  
occupied.  There is something strange going on: 2.5kB * 2.7m is only  
6GB, and I have trouble imagining where the 30-fold index size  
expansion is coming from.


-Mike


RE: solr + carrot2

2007-08-20 Thread Lance Norskog
No, this about the Carrot2 clustering tool, specifically the Swing
application.
To make this app use a Solr service you have to code a custom searcher for
your Solr.
I'm requesting a generic UI for Carrot2 that works against any Solr. 

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 20, 2007 12:03 PM
To: solr-user@lucene.apache.org
Subject: Re: solr + carrot2

On 20-Aug-07, at 11:24 AM, Lance Norskog wrote:

> Exactly! The Lucene version requires direct access to the file. Our 
> indexes are on servers which do not have graphics (VNC) configured.
>
> A generic Solr access UI would be great.

A generic Solr access UI?  Is this different from the existing admin ui that
ships with Solr?

-Mike



clear index

2007-08-20 Thread Sundling, Paul
what is the best approach to clearing an index?
 
The use case is that I'm doing some performance testing with various
index sizes.  In between indexing (embedded and soon HTTP/XML) I need to
clear the index so I have a fresh start.
 
What's the best approach, close the index and delete the files?  Hack
together some query that will match all documents and delete by query?
Looking at the Lucene API it looks like they have the same functionality
that is exposed already (delete by id or query).
 
Paul Sundling
 


Re: clear index

2007-08-20 Thread Pieter Berkel
If you are using solr 1.2 the following command (followed by a commit /
optimize) should do the trick:

*:*

cheers,
Piete


On 21/08/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
>
> what is the best approach to clearing an index?
>
> The use case is that I'm doing some performance testing with various
> index sizes.  In between indexing (embedded and soon HTTP/XML) I need to
> clear the index so I have a fresh start.
>
> What's the best approach, close the index and delete the files?  Hack
> together some query that will match all documents and delete by query?
> Looking at the Lucene API it looks like they have the same functionality
> that is exposed already (delete by id or query).
>
> Paul Sundling
>
>


Commit performance

2007-08-20 Thread Lance Norskog
How long should a  take? I've got about 9.8G of data for 9M of
records. (Yes, I'm indexing too much data.) My commits are taking 20-30
seconds. Since other people set the autocommit to 1 second, I'm guessing we
have a major mistake somewhere in our configurations.
 
We have a lot of deletes/re-adds to commit. Any hints?
 
Thanks,
 
Lance


Re: clear index

2007-08-20 Thread Charles Hornberger
IIRC you can also also simply stop the servlet container, delete the
contents of the data directory by hand, then restart the container.

-Charlie

On 8/20/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
> If you are using solr 1.2 the following command (followed by a commit /
> optimize) should do the trick:
>
> *:*
>
> cheers,
> Piete
>
>
> On 21/08/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
> >
> > what is the best approach to clearing an index?
> >
> > The use case is that I'm doing some performance testing with various
> > index sizes.  In between indexing (embedded and soon HTTP/XML) I need to
> > clear the index so I have a fresh start.
> >
> > What's the best approach, close the index and delete the files?  Hack
> > together some query that will match all documents and delete by query?
> > Looking at the Lucene API it looks like they have the same functionality
> > that is exposed already (delete by id or query).
> >
> > Paul Sundling
> >
> >
>