date:20091002

Re: JVM OOM when using field collapse component

2009-10-02 Thread Martijn v Groningen

No I have not encountered OOM exception yet with current field collapse patch.
How large is your configured JVM heap space (-Xmx)? Field collapsing
requires more memory then regular searches so. Does Solr run out of
memory during the first search(es) or does it run out of memory after
a while when it performed quite a few field collapse searches?

I see that you are also using the collapse.includeCollapsedDocs.fl
parameter for your search. This feature will require more memory then
a normal field collapse search.

I normally give the Solr instance a heap space of 1024M when having an
index of a few million.

Martijn

2009/10/2 Joe Calderon :
> i gotten two different out of memory errors while using the field
> collapsing component, using the latest patch (2009-09-26) and the
> latest nightly,
>
> has anyone else encountered similar problems? my collection is 5
> million results but ive gotten the error collapsing as little as a few
> thousand
>
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>        at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173)
>        at 
> org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749)
>        at 
> org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757)
>        at 
> org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292)
>        at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233)
>        at 
> org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402)
>        at 
> org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115)
>        at 
> org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
>        at 
> org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
>        at 
> org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
>        at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
>        at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
>        at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>        at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>        at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>        at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>        at org.mortbay.jetty.Server.handle(Server.java:326)
>        at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>        at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>        at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>        at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
>
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>        at 
> org.apache.solr.util.DocSetScoreCollector.(DocSetScoreCollector.java:44)
>        at 
> org.apache.solr.search.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:68)
>        at 
> org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:205)
>        at 
> org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
>        at 
> org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
>        at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at 
> org.mortbay.jetty

Re: populating synonyms.txt

2009-10-02 Thread Michael Engesgaard

I understand that synonyms are domain-specific, although I could still see a
benefit of having standardized synonyms.txt files (a thesaurus) for general
use. Just like the ones you can download or is already embedded in word
processors like Open Office Writer or MS Word.

I can understand that you can get an english from wordweb.info, but I need
other major languages  and Scandinavian languages as well. 

Does someone know where I can find multiple thesaurus'es? Maybe even some
targeted at solr?

Thanks,

Michael

Walter Underwood, Netflix wrote:
> 
> Synonyms are domain-specific. A food site would list "arugula" and
> "rocket" as synonyms, but that would be a bad idea for NASA.
> 
> wunder
> 
> On 1/16/09 1:35 PM, "Daniel Lovins"  wrote:
> 
>> Hello list.
>> 
>> Are there standardized lists out there for populating synonyms.txt?
>> Entering the terms manually seems like a bad idea.
>> 
>> Thanks for your help.
>> 
>> Daniel
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/populating-synonyms.txt-tp21508964p25712347.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: "Only one usage of each socket address" error

2009-10-02 Thread Steinar Asbjørnsen

Tried running solr on jetty now, and I still get the same error:(.

Steinar

Den 1. okt. 2009 kl. 16.23 skrev Steinar Asbjørnsen:

Hi.

This situation is still bugging me.
I thought i had it fixed yday, but no...

Seems like this goes both for deleting and adding, but I'll explain
the delete-situation here:
When I'm deleting documents(~5k) from a index, i get a error message
saying
"Only one usage of each socket address (protocol/network address/
port) is normally permitted 127.0.0.1:8983".

I've tried both delete by id and delete by query, and both gives me
the same error.
The command that is giving me the errormessage is solr.Delete(id)
and solr.Delete(new SolrQuery("id:"+id)).

The command is issued with SolrNet, and I'm not sure if this is
SolrNet or solr related.

I cannot find anything that helps me out in the catalina-log.
Are there any other logs that should be checked?

I'm grateful for any pointers :)

Thanks,
Steinar

Den 29. sep. 2009 kl. 11.15 skrev Steinar Asbjørnsen:

Seems like the post in the SolrNet group: http://groups.google.com/group/solrnet/browse_thread/thread/7e3034b626d3e82d?pli=1
helped me get trough.

Thanks you solr-user's for helping out too!

Steinar

Videresendt melding:

Fra: Steinar Asbjørnsen
Dato: 28. september 2009 17.07.15 GMT+02.00
Til: solr-user@lucene.apache.org
Emne: Re: "Only one usage of each socket address" error

I'm using the add(MyObject) command form ()in a foreach loop to
add my objects to the index.

In the catalina-log i cannot see anything that helps me out.
It stops at:
28.sep.2009 08:58:40
org.apache.solr.update.processor.LogUpdateProcessor finish

INFO: {add=[12345]} 0 187
28.sep.2009 08:58:40 org.apache.solr.core.SolrCore execute
INFO: [core2] webapp=/solr path=/update params={} status=0 QTime=187
Whitch indicates nothing wrong.

Are there any other logs that should be checked?

What it seems like to me at the moment is that the foreach is
passing objects(documents) to solr faster then solr can add them
to the index. As in I'm eventually running out of connections (to
solr?) or something.

I'm running another incremental update that with other objects
where the foreachs isn't quite as fast. This job has added over
100k documents without failing, and still going. Whereas the
problematic job fails after ~3k.

What I've learned trough the day tho, is that the index where my
feed is failing is actually redundant.

I.e I'm off the hook for now.

Still I'd like to figure out whats going wrong.

Steinar

There's nothing in that output that indicates something we can
help with over in solr-user land. What is the call you're making
to Solr? Did Solr log anything anomalous?

Erik

On Sep 28, 2009, at 4:41 AM, Steinar Asbjørnsen wrote:

I just posted to the SolrNet-group since i have the exact same
(?) problem.
Hope I'm not beeing rude posting here as well (since the SolrNet-
group doesn't seem as active as this mailinglist).

The problem occurs when I'm running an incremental feed(self
made) of a index.

My post:
[snip]
Whats happening is that i get this error message (in VS):
"A first chance exception of type
'SolrNet.Exceptions.SolrConnectionException' occurred in
SolrNet.DLL"

And the web browser (which i use to start the feed says:
"System.Data.SqlClient.SqlException: Timeout expired. The timeout
period elapsed prior to completion of the operation or the
server is

not responding."
At the time of writing my index contains 15k docs, and "lacks"
~700k

docs that the incremental feed should take care of adding to the
index.
The error message appears after 3k docs are added, and before 4k
docs are added.
I'm committing each 1%1000==0.
In addittion autocommit is set to:

More info:
From schema.xml:

I'm fetching data from a (remote) Sql 2008 Server, using
sqljdbc4.jar.

And Solr is running on a local Tomcat-installation.
SolrNet version: 0.2.3.0
Solr Specification Version: 1.3.0.2009.08.29.08.05.39

[/snip]
Any suggestions on how to fix this would be much apreceiated.

Regards,
Steinar

Re: field collapsing sums

2009-10-02 Thread Martijn v Groningen

Well that is odd. How have you configured field collapsing with the
dismax request handler?
The collapse counts should X - 1 (if collapse.threshold=1).

Martijn

2009/10/1 Joe Calderon :
> thx for the reply, i just want the number of dupes in the query
> result, but it seems i dont get the correct totals,
>
> for example a non collapsed dismax query for belgian beer returns X
> number results
> but when i collapse and sum the number of docs under collapse_counts,
> its much less than X
>
> it does seem to work when the collapsed results fit on one page (10
> rows in my case)
>
>
> --joe
>
>> 2) It seems that you are using the parameters as was intended. The
>> collapsed documents will contain all documents (from whole query
>> result) that have been collapsed on a certain field value that occurs
>> in the result set that is being displayed. That is how it should work.
>> But if I'm understanding you correctly you want to display all dupes
>> from the whole query result set (also those which collapse field value
>> does not occur in the in the displayed result set)?
>



-- 
Met vriendelijke groet,

Martijn van Groningen

Re: best way to get the size of an index

2009-10-02 Thread Grant Ingersoll



On Oct 1, 2009, at 12:18 PM, Phillip Farber wrote:



Resuming this discussion in a new thread to focus only on this  
question:


What is the best way to get the size of an index so it does not get  
too big to be optimized (or to allow a very large segment merge)  
given space limits?


I already have the largest 15,000rpm SCSI direct attached storage so  
buying storage is not an option.  I don't do deletes.


From what I've read, I expect no more than a 2x increase during  
optimization and have not seen more in practice.


I'm thinking: stop indexing, commit, do a du.


That sounds reasonable, but on the other thread, I'd still plan for a  
3x increase, even if you aren't doing deletes, just to be on the safe  
side.



I wonder if there is a way to report it back via Java/Lucene in a  
Request Handler or in the Luke Request Handler?  May be worth taking  
the time to add.


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

yellow pages navigation kind menu. howto take every 100th row from resultset

2009-10-02 Thread Julian Davchev

Hi,

Long story short:   how can I take every 100th row from solr resultset.
What would syntax for this be.

Long story:

Currently I have lots of say documents(articles) indexed. They all have
field title with corresponding value.

atitle
btitle
.
*title

How do I build menu   so I can search of those?
I cannot just hardcode  ABC  Dmeaning all starting
with A all starting with B etc...cause there are unicode characters
and english alphabet will just not cut it...

So my idea is to make ranges like

[atitle - mtitle][mtitle - ltitle] ...etc etc   (based on
actual title names I got)


Questions is how do I figure out what those  atitle-mtitle is (like get
from solr query every 100th record)
Two solutions I found:
1. get all stuff and do it server side (huge load as it's thousands
record we talk about)
2. use solr sort and &start and make N calls until   resulted rows <
100.But this will mean quite a load as well as there lots of records.

Any pointers?
Thanks

debugQuery different score for same query. dismax

2009-10-02 Thread Julian Davchev

Hi,

I run debug on a query to examine the score as I was surprised of results.
Here is the diff of same explain section of two different rows that I
found troubling.

It looks for "pari"   in   ancestorName  field   but first row looks in 
241135 records
and the second row it's just 187821 records.  Which in results give
lower score for the second row.

Question is what is affecting this thingy cause I would expect same
fieldname same value to give same score.

It's dismax query...I skipped showing scoring of other fields to simplify.

Cheers


-3.7137468 = (MATCH) weight(ancestorName:pari^35.0 in 241135), product of:
+3.1832116 = (MATCH) weight(ancestorName:pari^35.0 in 187821), product of:
 0.8593 = queryWeight(ancestorName:pari^35.0), product of:
 35.0 = boost
 8.488684 = idf(docFreq=148, numDocs=74979)
 0.0033657781 = queryNorm
-3.713799 = (MATCH) fieldWeight(ancestorName:pari in 241135),
product of:
+3.1832564 = (MATCH) fieldWeight(ancestorName:pari in 187821),
product of:
 1.0 = tf(termFreq(ancestorName:pari)=1)
 8.488684 = idf(docFreq=148, numDocs=74979)
-0.4375 = fieldNorm(field=ancestorName, doc=241135)
+0.375 = fieldNorm(field=ancestorName, doc=187821)

debugQuery rows get different score for same field same value

2009-10-02 Thread Julian Davchev

Hi,

I run debug on a query to examine the score as I was surprised of results.
Here is the diff of same explain section of two different rows that I
found troubling.

It looks for "pari"   in   ancestorName  field   but first row looks in 
241135 records
and the second row it's just 187821 records.  Which in results give
lower score for the second row.

Question is what is affecting this thingy cause I would expect same
fieldname same value to give same score.

It's dismax query...I skipped showing scoring of other fields to simplify.

Cheers


-3.7137468 = (MATCH) weight(ancestorName:pari^35.0 in 241135), product of:
+3.1832116 = (MATCH) weight(ancestorName:pari^35.0 in 187821), product of:
 0.8593 = queryWeight(ancestorName:pari^35.0), product of:
 35.0 = boost
 8.488684 = idf(docFreq=148, numDocs=74979)
 0.0033657781 = queryNorm
-3.713799 = (MATCH) fieldWeight(ancestorName:pari in 241135),
product of:
+3.1832564 = (MATCH) fieldWeight(ancestorName:pari in 187821),
product of:
 1.0 = tf(termFreq(ancestorName:pari)=1)
 8.488684 = idf(docFreq=148, numDocs=74979)
-0.4375 = fieldNorm(field=ancestorName, doc=241135)
+0.375 = fieldNorm(field=ancestorName, doc=187821)

Re: Keepwords Schema

2009-10-02 Thread Shalin Shekhar Mangar

On Thu, Oct 1, 2009 at 7:37 PM, matrix_psj  wrote:

>
>
> An example:
> My schema is about web files. Part of the syntax is a text field of authors
> that have worked on each file, e.g.
> 
>login.php
>   2009-01-01
>   alex, brian, carl carlington, dave alpha, eddie, dave
> beta
> 
>
> When I perform a search and get 20 web files back, I would like a facet of
> the individual authors, but only if there name appears in a
> public_authors.txt file.
>
> So if the public_authors.txt file contained:
> Anna,
> Bob,
> Carl Carlington,
> Dave Alpha,
> Elvis,
> Eddie,
>
> The facet returned would be:
> Carl Carlington
> Dave Alpha
> Eddie
>
>
>
> Not sure if that makes sense? If it does, could someone explain to me the
> schema fieldtype declarations that would bring back this sort of results.
>
>
If I'm understanding you correctly - You want to facet on a field (with
facet=true&facet.field=authors) but you want to show only certain
whitelisted facet values in the response.

If that is correct then, you can remove the authors which are not in the
whitelist during indexing time. You can do this by adding
KeepWordFilterFactory to your field type:



-- 
Regards,
Shalin Shekhar Mangar.

Re: Query filters/analyzers

2009-10-02 Thread Shalin Shekhar Mangar

On Thu, Oct 1, 2009 at 7:59 PM, Claudio Martella  wrote:

>
> About the copyField issue in general: as it copies the content to the
> other field, what is the sense to define analyzers for the destination
> field? The source is already analyzed so i guess that the RESULT of the
> analysis is copied there.

The copy is done before analysis. The original text is sent to the copyField
which can choose to do analysis differently from the source field.

-- 
Regards,
Shalin Shekhar Mangar.

Re: best way to get the size of an index

2009-10-02 Thread Mark Miller

Phillip Farber wrote:
>
> Resuming this discussion in a new thread to focus only on this question:
>
> What is the best way to get the size of an index so it does not get
> too big to be optimized (or to allow a very large segment merge) given
> space limits?
>
> I already have the largest 15,000rpm SCSI direct attached storage so
> buying storage is not an option.  I don't do deletes.
Even if you did do deletes, its not really a 3x problem - thats just
theory - you'd have to work to get there. Deletes are merged out as you
index additional docs as segments are merged over time. The 3x scenario
brought up is more of a fun mind exercise than anything that would
realistically happen.
>
> From what I've read, I expect no more than a 2x increase during
> optimization and have not seen more in practice.
>
> I'm thinking: stop indexing, commit, do a du.
>
> Will this give me the number I need for what I'm trying to do? Is
> there a better way?
Should work fine. When you do the commit, onCommit will be called on the
IndexDeltionPolicy, and all of the "snapshots" of the index other than
the latest one will be removed. You should have a clean index to gauge
the size with. Using something like Java Replication complicates this
though - in that case, older commit points can be reserved while they
are being copied.
>
> Phil


-- 
- Mark

http://www.lucidimagination.com

Re: "Only one usage of each socket address" error

2009-10-02 Thread Mauricio Scheffer

Did you try this?
http://blogs.msdn.com/dgorti/archive/2005/09/18/470766.aspx
Also, please
post the full exception stack trace.

2009/10/2 Steinar Asbjørnsen 

> Tried running solr on jetty now, and I still get the same error:(.
>
> Steinar
>
> Den 1. okt. 2009 kl. 16.23 skrev Steinar Asbjørnsen:
>
>
>  Hi.
>>
>> This situation is still bugging me.
>> I thought i had it fixed yday, but no...
>>
>> Seems like this goes both for deleting and adding, but I'll explain the
>> delete-situation here:
>> When I'm deleting documents(~5k) from a index, i get a error message
>> saying
>> "Only one usage of each socket address (protocol/network address/port) is
>> normally permitted 127.0.0.1:8983".
>>
>> I've tried  both delete by id and delete by query, and both gives me the
>> same error.
>> The command that is giving me the errormessage is solr.Delete(id) and
>> solr.Delete(new SolrQuery("id:"+id)).
>>
>> The command is issued with SolrNet, and I'm not sure if this is SolrNet or
>> solr related.
>>
>> I cannot find anything that helps me out in the catalina-log.
>> Are there any other logs that should be checked?
>>
>> I'm grateful for any pointers :)
>>
>> Thanks,
>> Steinar
>>
>> Den 29. sep. 2009 kl. 11.15 skrev Steinar Asbjørnsen:
>>
>>  Seems like the post in the SolrNet group:
>>> http://groups.google.com/group/solrnet/browse_thread/thread/7e3034b626d3e82d?pli=1
>>>  helped
>>> me get trough.
>>>
>>> Thanks you solr-user's for helping out too!
>>>
>>> Steinar
>>>
>>> Videresendt melding:
>>>
>>>  Fra: Steinar Asbjørnsen 
 Dato: 28. september 2009 17.07.15 GMT+02.00
 Til: solr-user@lucene.apache.org
 Emne: Re: "Only one usage of each socket address" error

 I'm using the add(MyObject) command form ()in a foreach loop to add my
 objects to the index.

 In the catalina-log i cannot see anything that helps me out.
 It stops at:
 28.sep.2009 08:58:40 org.apache.solr.update.processor.LogUpdateProcessor
 finish
 INFO: {add=[12345]} 0 187
 28.sep.2009 08:58:40 org.apache.solr.core.SolrCore execute
 INFO: [core2] webapp=/solr path=/update params={} status=0 QTime=187
 Whitch indicates nothing wrong.

 Are there any other logs that should be checked?

 What it seems like to me at the moment is that the foreach is passing
 objects(documents) to solr faster then solr can add them to the index. As 
 in
 I'm eventually running out of connections (to solr?) or something.

 I'm running another incremental update that with other objects where the
 foreachs isn't quite as fast. This job has added over 100k documents 
 without
 failing, and still going. Whereas the problematic job fails after ~3k.

 What I've learned trough the day tho, is that the index where my feed is
 failing is actually redundant.
 I.e I'm off the hook for now.

 Still I'd like to figure out whats going wrong.

 Steinar

  There's nothing in that output that indicates something we can help
> with over in solr-user land.  What is the call you're making to Solr?  Did
> Solr log anything anomalous?
>
>Erik
>
>
> On Sep 28, 2009, at 4:41 AM, Steinar Asbjørnsen wrote:
>
>  I just posted to the SolrNet-group since i have the exact same(?)
>> problem.
>> Hope I'm not beeing rude posting here as well (since the SolrNet-group
>> doesn't seem as active as this mailinglist).
>>
>> The problem occurs when I'm running an incremental feed(self made) of
>> a index.
>>
>> My post:
>> [snip]
>> Whats happening is that i get this error message (in VS):
>> "A first chance exception of type
>> 'SolrNet.Exceptions.SolrConnectionException' occurred in SolrNet.DLL"
>> And the web browser (which i use to start the feed says:
>> "System.Data.SqlClient.SqlException: Timeout expired.  The timeout
>> period elapsed prior to completion of the operation or the server is
>> not responding."
>> At the time of writing my index contains 15k docs, and "lacks" ~700k
>> docs that the incremental feed should take care of adding to the
>> index.
>> The error message appears after 3k docs are added, and before 4k
>> docs are added.
>> I'm committing each 1%1000==0.
>> In addittion autocommit is set to:
>> 
>> 1
>> 
>> More info:
>> From schema.xml:
>> > required="true" />
>> > required="false" />
>> I'm fetching data from a (remote) Sql 2008 Server, using sqljdbc4.jar.
>> And Solr is running on a local Tomcat-installation.
>> SolrNet version: 0.2.3.0
>> Solr Specification Version: 1.3.0.2009.08.29.08.05.39
>>
>> [/snip]
>> Any suggestions on how to fix this would be much apreceiated.
>>
>> Regards,
>> Steinar
>>
>
>

>>>
>>
>

Re: Question on modifying solr behavior on indexing xml files..

2009-10-02 Thread Shalin Shekhar Mangar

On Thu, Oct 1, 2009 at 3:10 PM, Thung, Peter C CIV SPAWARSYSCEN-PACIFIC,
56340  wrote:

> 1.  In my playing around with
> sending in an XML document within a an XML CDATA tag,
> with termVectors="true"
>
> I noticed the following behavior:
> peter
> collapses to the term
> personpeterperson
> instead of
> person
> and
> peter separately.
>
> I realize I could try and do a search and replaces of characters like
> <>"=  to a space so that the default parser/indexer can preserve element
> names.
> However, I'm wondering if someon could point me to where one might do
> this withing
> the solr or apache lucene code as a proper plug in with maybe an example
> that I could use
> as a template.  Also where in the solrconfig.xml file I would want to
> change to reference the new parser.
>
>
Solr is agnostic of the content in a schema field. It does not know that it
is XML and hence it will do blind tokenization/filtering as defined for the
field type in schema.xml

If all you want is to do a full-text search on words found somewhere in that
XML, then your approach of replacing <>"= to a space will work fine. You can
use the PatternReplaceFilter and specify a regex which matches these special
characters and replaces them by a space.

Or you can use the MappingCharFilter (solr 1.4 feature) and specify a
mapping file which has these special characters mapped to a space.

The file should be in the format:
characterToBeReplaced => replacementChar

However, if you want to preserve the structure of the XML document, it is
best to parse it out yourself and put contents into Solr fields before
sending it to Solr. You may also want to look at DataImportHandler and
XPathEntityProcessor which is commonly used for importing XML files.

http://wiki.apache.org/solr/DataImportHandler

> 2.  My other question would also be if this technique would work for XML
> type messages embedded
> in Microsoft Excel, or Powerpoint presentations where I would like to
> preserve knowining xml element term frequencies
> where I would try and leverage the component that automatically indexes
> microsoft documents.
> Would I need to modify that component and customize it?
>
>
Perhaps somebody who knows about Solr Cell can answer this but I think it
should work.

-- 
Regards,
Shalin Shekhar Mangar.

conditional sorting

2009-10-02 Thread Bojan Šmid

Hi all,

I need to perform sorting of my query hits by different criterion depending
on the number of hits. For instance, if there are < 10 hits, sort by
date_entered, otherwise, sort by popularity.

Does anyone know if there is a way to do that with a single query, or I'll
have to send another query with desired sort criterion after I inspect
number of hits on my client?

Thx

Re: Query filters/analyzers

2009-10-02 Thread Fergus McMenemie

>On Thu, Oct 1, 2009 at 7:59 PM, Claudio Martella > wrote:
>
>>
>> About the copyField issue in general: as it copies the content to the
>> other field, what is the sense to define analyzers for the destination
>> field? The source is already analyzed so i guess that the RESULT of the
>> analysis is copied there.
>
>
>The copy is done before analysis. The original text is sent to the copyField
>which can choose to do analysis differently from the source field.
>
I have been wondering about this as well. The WIKI is not explicit about
what happens. Is this correct:-

"The original text is sent to the copyField, before any configured
analyzers for the originating or destination field are invoked."

is so, I will tweak the wiki!

Regds Fergus.
--

Re: trie fields and sortMissingLast

2009-10-02 Thread Yonik Seeley

On Thu, Oct 1, 2009 at 2:54 PM, Lance Norskog  wrote:
> Trie fields also do not support faceting.

Only those that index multiple tokens per value to speed up range queries.

> They also take more ram in
> some operations.

Should be less memory on average.

-Yonik
http://www.lucidimagination.com

> Given these defects, I'm not sure that promoting tries as the default
> is appropriate at this time. (I'm sure this is an old argument.:)
>
> On Thu, Oct 1, 2009 at 7:39 AM, Steve Conover  wrote:
>> I just noticed this comment in the default schema:
>>
>> 
>>
>> Does that mean TrieFields are never going to get sortMissingLast?
>>
>> Do you all think that a reasonable strategy is to use a copyField and
>> use "s" fields for sorting (only), and trie for everything else?
>>
>> On Wed, Sep 30, 2009 at 10:59 PM, Steve Conover  wrote:
>>> Am I correct in thinking that trie fields don't support
>>> sortMissingLast (my tests show that they don't).  If not, is there any
>>> plan for adding it in?
>>>
>>> Regards,
>>> Steve
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Re: Query filters/analyzers

2009-10-02 Thread Shalin Shekhar Mangar

On Fri, Oct 2, 2009 at 6:44 PM, Fergus McMenemie  wrote:

> >The copy is done before analysis. The original text is sent to the
> copyField
> >which can choose to do analysis differently from the source field.
> >
> I have been wondering about this as well. The WIKI is not explicit about
> what happens. Is this correct:-
>
> "The original text is sent to the copyField, before any configured
> analyzers for the originating or destination field are invoked."
>
>
Yes, that is correct.


> is so, I will tweak the wiki!
>
>
Please do!

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr Trunk Heap Space Issues

2009-10-02 Thread Jeff Newburn

Ah yes we do have some warming queries which would look like a search.  Did
that side change enough to push up the memory limits where we would run out
like this?  Also, would FastLRU cache make a difference?
-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


> From: Yonik Seeley 
> Reply-To: 
> Date: Fri, 2 Oct 2009 00:53:46 -0400
> To: 
> Subject: Re: Solr Trunk Heap Space Issues
> 
> On Thu, Oct 1, 2009 at 8:45 PM, Jeffery Newburn  wrote:
>> I loaded the jvm and started indexing. It is a test server so unless some
>> errant query came in then no searching. Our instance has only 512mb but my
>> concern is the obvious memory requirement leap since it worked before. What
>> other data would be helpful with this?
> 
> Interesting... not too much should have changed for memory
> requirements on the indexing side.
> TokenStreams are now reused (and hence cached) per thread... but that
> normally wouldn't amount to much.
> 
> There was recently another bug where compound file format was being
> used regardless of the config settings... but I think that was fixed
> on the 29th.
> 
> Maybe you were already close to the limit required?
> Also, your heap dump did show LRUCache taking up 170MB, and only
> searches populate that (perhaps you have warming searches configured
> on this server?)
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> 
> 
>> 
>> 
>> On Oct 1, 2009, at 5:14 PM, "Mark Miller"  wrote:
>> 
>>> Jeff Newburn wrote:
 
 Ok I was able to get a heap dump from the GC Limit error.
 
 1 instance of LRUCache is taking 170mb
 1 instance of SchemaIndex is taking 56Mb
 4 instances of SynonymMap is taking 112mb
 
 There is no searching going on during this index update process.
 
 Any ideas what on earth is going on?  Like I said my May version did this
 without any problems whatsoever.
 
 
>>> Had any searching gone on though? Even if its not occurring during the
>>> indexing, you will still have the data structure loaded if searches had
>>> occurred.
>>> 
>>> What heap size do you have - that doesn't look like much data to me ...
>>> 
>>> --
>>> - Mark
>>> 
>>> http://www.lucidimagination.com
>>> 
>>> 
>>> 
>>

Re: Solr Trunk Heap Space Issues

2009-10-02 Thread Yonik Seeley

On Fri, Oct 2, 2009 at 9:54 AM, Jeff Newburn  wrote:
> Ah yes we do have some warming queries which would look like a search.  Did
> that side change enough to push up the memory limits where we would run out
> like this?

What does the warming request(s) look like, and what are the field
types for the fields referenced?

>  Also, would FastLRU cache make a difference?

It shouldn't.

-Yonik
http://www.lucidimagination.com


> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>
>
>> From: Yonik Seeley 
>> Reply-To: 
>> Date: Fri, 2 Oct 2009 00:53:46 -0400
>> To: 
>> Subject: Re: Solr Trunk Heap Space Issues
>>
>> On Thu, Oct 1, 2009 at 8:45 PM, Jeffery Newburn  wrote:
>>> I loaded the jvm and started indexing. It is a test server so unless some
>>> errant query came in then no searching. Our instance has only 512mb but my
>>> concern is the obvious memory requirement leap since it worked before. What
>>> other data would be helpful with this?
>>
>> Interesting... not too much should have changed for memory
>> requirements on the indexing side.
>> TokenStreams are now reused (and hence cached) per thread... but that
>> normally wouldn't amount to much.
>>
>> There was recently another bug where compound file format was being
>> used regardless of the config settings... but I think that was fixed
>> on the 29th.
>>
>> Maybe you were already close to the limit required?
>> Also, your heap dump did show LRUCache taking up 170MB, and only
>> searches populate that (perhaps you have warming searches configured
>> on this server?)
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>>
>>
>>>
>>>
>>> On Oct 1, 2009, at 5:14 PM, "Mark Miller"  wrote:
>>>
 Jeff Newburn wrote:
>
> Ok I was able to get a heap dump from the GC Limit error.
>
> 1 instance of LRUCache is taking 170mb
> 1 instance of SchemaIndex is taking 56Mb
> 4 instances of SynonymMap is taking 112mb
>
> There is no searching going on during this index update process.
>
> Any ideas what on earth is going on?  Like I said my May version did this
> without any problems whatsoever.
>
>
 Had any searching gone on though? Even if its not occurring during the
 indexing, you will still have the data structure loaded if searches had
 occurred.

 What heap size do you have - that doesn't look like much data to me ...

 --
 - Mark

 http://www.lucidimagination.com



>>>
>
>

Re: Solr Trunk Heap Space Issues

2009-10-02 Thread Mark Miller

Jeff Newburn wrote:
> that side change enough to push up the memory limits where we would run out
> like this? 
>   
Yes - now give us the FieldCache section from the stats section please :)

Its not likely gonna do you any good, but it could be good information
for us.

-- 
- Mark

http://www.lucidimagination.com

Re: Solr Trunk Heap Space Issues

2009-10-02 Thread Yonik Seeley

On Fri, Oct 2, 2009 at 10:02 AM, Mark Miller  wrote:
> Jeff Newburn wrote:
>> that side change enough to push up the memory limits where we would run out
>> like this?
>>
> Yes - now give us the FieldCache section from the stats section please :)

And the fieldValueCache section too (used for multi-valued faceting).

-Yonik
http://www.lucidimagination.com

RE: Solr and Garbage Collection

2009-10-02 Thread siping liu


Hi,

I read pretty much all posts on this thread (before and after this one). Looks 
like the main suggestion from you and others is to keep max heap size (-Xmx) as 
small as possible (as long as you don't see OOM exception). This brings more 
questions than answers (for me at least. I'm new to Solr).

 

First, our environment and problem encountered: Solr1.4 (nightly build, 
downloaded about 2 months ago), Sun JDK1.6, Tomcat 5.5, running on 
Solaris(multi-cpu/cores). The cache setting is from the default solrconfig.xml 
(looks very small). At first we used minimum JAVA_OPTS and quickly run into the 
problem similar to the one orignal poster reported -- long pause (seconds to 
minutes) under load test. jconsole showed that it pauses on GC. So more 
JAVA_OPTS get added: "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:ParallelGCThreads=8 -XX:SurvivorRatio=2 -XX:NewSize=128m 
-XX:MaxNewSize=512m -XX:MaxGCPauseMillis=200", the thinking is with 
mutile-cpu/cores we can get over with GC as quickly as possibe. With the new 
setup, it works fine until Tomcat reaches heap size, then it blocks and takes 
minutes on "full GC" to get more space from "tenure generation". We tried 
different Xmx (from very small to large), no difference in long GC time. We 
never run into OOM.

 

Questions:

* In general various cachings are good for performance, we have more RAM to use 
and want to use more caching to boost performance, isn't your suggestion (of 
lowering heap limit) going against that?

* Looks like Solr caching made its way into tenure-generation on heap, that's 
good. But why they get GC'ed eventually?? I did a quick check of Solr code 
(Solr 1.3, not 1.4), and see a single instance of using WeakReference. Is that 
what is causing all this? This seems to suggest a design flaw in Solr's memory 
management strategy (or just my ignorance about Solr?). I mean, wouldn't this 
be the "right" way of doing it -- you allow user to specify the cache size in 
solrconfig.xml, then user can set up heap limit in JAVA_OPTS accordingly, and 
no need to use WeakReference (BTW, why not SoftReference)??

* Right now I have a single Tomcat hosting Solr and other applications. I guess 
now it's better to have Solr on its own Tomcat, given that it's tricky to 
adjust the java options.

 

thanks.


 
> From: wun...@wunderwood.org
> To: solr-user@lucene.apache.org
> Subject: RE: Solr and Garbage Collection
> Date: Fri, 25 Sep 2009 09:51:29 -0700
> 
> 30ms is not better or worse than 1s until you look at the service
> requirements. For many applications, it is worth dedicating 10% of your
> processing time to GC if that makes the worst-case pause short.
> 
> On the other hand, my experience with the IBM JVM was that the maximum query
> rate was 2-3X better with the concurrent generational GC compared to any of
> their other GC algorithms, so we got the best throughput along with the
> shortest pauses.
> 
> Solr garbage generation (for queries) seems to have two major components:
> per-request garbage and cache evictions. With a generational collector,
> these two are handled by separate parts of the collector. Per-request
> garbage should completely fit in the short-term heap (nursery), so that it
> can be collected rapidly and returned to use for further requests. If the
> nursery is too small, the per-request allocations will be made in tenured
> space and sit there until the next major GC. Cache evictions are almost
> always in long-term storage (tenured space) because an LRU algorithm
> guarantees that the garbage will be old.
> 
> Check the growth rate of tenured space (under constant load, of course)
> while increasing the size of the nursery. That rate should drop when the
> nursery gets big enough, then not drop much further as it is increased more.
> 
> After that, reduce the size of tenured space until major GCs start happening
> "too often" (a judgment call). A bigger tenured space means longer major GCs
> and thus longer pauses, so you don't want it oversized by too much.
> 
> Also check the hit rates of your caches. If the hit rate is low, say 20% or
> less, make that cache much bigger or set it to zero. Either one will reduce
> the number of cache evictions. If you have an HTTP cache in front of Solr,
> zero may be the right choice, since the HTTP cache is cherry-picking the
> easily cacheable requests.
> 
> Note that a commit nearly doubles the memory required, because you have two
> live Searcher objects with all their caches. Make sure you have headroom for
> a commit.
> 
> If you want to test the tenured space usage, you must test with real world
> queries. Those are the only way to get accurate cache eviction rates.
> 
> wunder
  
_
Bing™  brings you maps, menus, and reviews organized in one place.   Try it now.
http://www.bing.com/search?q=restaurants&form=MLOGEN&publ=WLHMTAG&crea=TEXT_MLOGEN_Core_tagline_local_1x1

Re: Solr and Garbage Collection

2009-10-02 Thread Mark Miller

siping liu wrote:
> Hi,
>
> I read pretty much all posts on this thread (before and after this one). 
> Looks like the main suggestion from you and others is to keep max heap size 
> (-Xmx) as small as possible (as long as you don't see OOM exception). This 
> brings more questions than answers (for me at least. I'm new to Solr).
>
>  
>
> First, our environment and problem encountered: Solr1.4 (nightly build, 
> downloaded about 2 months ago), Sun JDK1.6, Tomcat 5.5, running on 
> Solaris(multi-cpu/cores). The cache setting is from the default 
> solrconfig.xml (looks very small). At first we used minimum JAVA_OPTS and 
> quickly run into the problem similar to the one orignal poster reported -- 
> long pause (seconds to minutes) under load test. jconsole showed that it 
> pauses on GC. So more JAVA_OPTS get added: "-XX:+UseConcMarkSweepGC 
> -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:SurvivorRatio=2 -XX:NewSize=128m 
> -XX:MaxNewSize=512m -XX:MaxGCPauseMillis=200", the thinking is with 
> mutile-cpu/cores we can get over with GC as quickly as possibe. With the new 
> setup, it works fine until Tomcat reaches heap size, then it blocks and takes 
> minutes on "full GC" to get more space from "tenure generation". We tried 
> different Xmx (from very small to large), no difference in long GC time. We 
> never run into OOM.
>   
MaxGCPauseMillis doesnt work with UseConcMarkSweepGC - its for use with
the Parallel collector. That also doesnt look like a good survivorratio.
>  
>
> Questions:
>
> * In general various cachings are good for performance, we have more RAM to 
> use and want to use more caching to boost performance, isn't your suggestion 
> (of lowering heap limit) going against that?
>   
Leaving RAM for the FileSystem cache is also very important. But you
should also have enough RAM for your Solr caches of course.
> * Looks like Solr caching made its way into tenure-generation on heap, that's 
> good. But why they get GC'ed eventually?? I did a quick check of Solr code 
> (Solr 1.3, not 1.4), and see a single instance of using WeakReference. Is 
> that what is causing all this? This seems to suggest a design flaw in Solr's 
> memory management strategy (or just my ignorance about Solr?). I mean, 
> wouldn't this be the "right" way of doing it -- you allow user to specify the 
> cache size in solrconfig.xml, then user can set up heap limit in JAVA_OPTS 
> accordingly, and no need to use WeakReference (BTW, why not SoftReference)??
>   
Do you see concurrent mode failure when looking at your gc logs? ie:

174.445: [GC 174.446: [ParNew: 66408K->66408K(66416K), 0.618
secs]174.446: [CMS (concurrent mode failure): 161928K->162118K(175104K),
4.0975124 secs] 228336K->162118K(241520K)

That means you have still getting major collections with CMS, and you
don't want that. You might try kicking GC off earlier with something
like: -XX:CMSInitiatingOccupancyFraction=50
> * Right now I have a single Tomcat hosting Solr and other applications. I 
> guess now it's better to have Solr on its own Tomcat, given that it's tricky 
> to adjust the java options.
>
>  
>
> thanks.
>
>
>  
>   
>> From: wun...@wunderwood.org
>> To: solr-user@lucene.apache.org
>> Subject: RE: Solr and Garbage Collection
>> Date: Fri, 25 Sep 2009 09:51:29 -0700
>>
>> 30ms is not better or worse than 1s until you look at the service
>> requirements. For many applications, it is worth dedicating 10% of your
>> processing time to GC if that makes the worst-case pause short.
>>
>> On the other hand, my experience with the IBM JVM was that the maximum query
>> rate was 2-3X better with the concurrent generational GC compared to any of
>> their other GC algorithms, so we got the best throughput along with the
>> shortest pauses.
>>
>> Solr garbage generation (for queries) seems to have two major components:
>> per-request garbage and cache evictions. With a generational collector,
>> these two are handled by separate parts of the collector. Per-request
>> garbage should completely fit in the short-term heap (nursery), so that it
>> can be collected rapidly and returned to use for further requests. If the
>> nursery is too small, the per-request allocations will be made in tenured
>> space and sit there until the next major GC. Cache evictions are almost
>> always in long-term storage (tenured space) because an LRU algorithm
>> guarantees that the garbage will be old.
>>
>> Check the growth rate of tenured space (under constant load, of course)
>> while increasing the size of the nursery. That rate should drop when the
>> nursery gets big enough, then not drop much further as it is increased more.
>>
>> After that, reduce the size of tenured space until major GCs start happening
>> "too often" (a judgment call). A bigger tenured space means longer major GCs
>> and thus longer pauses, so you don't want it oversized by too much.
>>
>> Also check the hit rates of your caches. If the hit rate is low, say 20% or
>> less, make that cache much bigge

Re: conditional sorting

2009-10-02 Thread Uri Boness

If the threshold is only 10, why can't you always sort by popularity and 
if the result set is <10 then resort on the client side based on 
date_entered?


Uri

Bojan Šmid wrote:

Hi all,

I need to perform sorting of my query hits by different criterion depending
on the number of hits. For instance, if there are < 10 hits, sort by
date_entered, otherwise, sort by popularity.

Does anyone know if there is a way to do that with a single query, or I'll
have to send another query with desired sort criterion after I inspect
number of hits on my client?

Thx

Re: How to access the information from SolrJ

2009-10-02 Thread Paul Tomblin

Nope, that just gets you the number of results returned, not how many
there could be.  Like I said, if you look at the XML returned, you'll
see something like

but only 10  returned.  getNumFound returns 10 in that case, not 1251.


2009/10/2 Noble Paul നോബിള്‍  नोब्ळ् :
> QueryResponse#getResults()#getNumFound()
>
> On Thu, Oct 1, 2009 at 11:49 PM, Paul Tomblin  wrote:
>> When I do a query directly form the web, the XML of the response
>> includes how many results would have been returned if it hadn't
>> restricted itself to the first 10 rows:
>>
>> For instance, the query:
>> http://localhost:8080/solrChunk/nutch/select/?q=*:*&fq=category:mysites
>> returns:
>> 
>> 
>> 0
>> 0
>> 
>> *:*
>> category:mysites
>> 
>> 
>> 
>> 
>> mysites
>> 0
>> http://localhost/Chunks/mysites/0-http___xcski.com_.xml
>> Anatomy
>> ...
>>
>> The value I'm talking about is in the "numFound" attribute of the "result" 
>> tag.
>>
>> I don't see any way to retrieve it through SolrJ - it's not in the
>> QueryResponse.getHeader(), for instance.  Can I retrieve it somewhere?
>>
>> --
>> http://www.linkedin.com/in/paultomblin
>>
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
http://www.linkedin.com/in/paultomblin

Re: "Only one usage of each socket address" error

2009-10-02 Thread Steinar Asbjørnsen


Ur the man Mauricio!

Adding and setting MaxUserPort and TCPTimedWaitDelay in the registry  
sure helps!

Over the wend I'll look into doing this programatically.

Thanks!
Steinar

Den 2. okt. 2009 kl. 14.47 skrev Mauricio Scheffer:


Did you try this?
http://blogs.msdn.com/dgorti/archive/2005/09/18/470766.aspx
Also,  
please

post the full exception stack trace.

2009/10/2 Steinar Asbjørnsen 


Tried running solr on jetty now, and I still get the same error:(.

Steinar

Den 1. okt. 2009 kl. 16.23 skrev Steinar Asbjørnsen:


Hi.


This situation is still bugging me.
I thought i had it fixed yday, but no...

Seems like this goes both for deleting and adding, but I'll  
explain the

delete-situation here:
When I'm deleting documents(~5k) from a index, i get a error message
saying
"Only one usage of each socket address (protocol/network address/ 
port) is

normally permitted 127.0.0.1:8983".

I've tried  both delete by id and delete by query, and both gives  
me the

same error.
The command that is giving me the errormessage is solr.Delete(id)  
and

solr.Delete(new SolrQuery("id:"+id)).

The command is issued with SolrNet, and I'm not sure if this is  
SolrNet or

solr related.

I cannot find anything that helps me out in the catalina-log.
Are there any other logs that should be checked?

I'm grateful for any pointers :)

Thanks,
Steinar

Den 29. sep. 2009 kl. 11.15 skrev Steinar Asbjørnsen:

Seems like the post in the SolrNet group:
http://groups.google.com/group/solrnet/browse_thread/thread/7e3034b626d3e82d?pli=1 
 helped

me get trough.

Thanks you solr-user's for helping out too!

Steinar

Videresendt melding:

Fra: Steinar Asbjørnsen 

Dato: 28. september 2009 17.07.15 GMT+02.00
Til: solr-user@lucene.apache.org
Emne: Re: "Only one usage of each socket address" error

I'm using the add(MyObject) command form ()in a foreach loop to  
add my

objects to the index.

In the catalina-log i cannot see anything that helps me out.
It stops at:
28.sep.2009 08:58:40  
org.apache.solr.update.processor.LogUpdateProcessor

finish
INFO: {add=[12345]} 0 187
28.sep.2009 08:58:40 org.apache.solr.core.SolrCore execute
INFO: [core2] webapp=/solr path=/update params={} status=0  
QTime=187

Whitch indicates nothing wrong.

Are there any other logs that should be checked?

What it seems like to me at the moment is that the foreach is  
passing
objects(documents) to solr faster then solr can add them to the  
index. As in

I'm eventually running out of connections (to solr?) or something.

I'm running another incremental update that with other objects  
where the
foreachs isn't quite as fast. This job has added over 100k  
documents without
failing, and still going. Whereas the problematic job fails  
after ~3k.


What I've learned trough the day tho, is that the index where my  
feed is

failing is actually redundant.
I.e I'm off the hook for now.

Still I'd like to figure out whats going wrong.

Steinar

There's nothing in that output that indicates something we can  
help
with over in solr-user land.  What is the call you're making to  
Solr?  Did

Solr log anything anomalous?

  Erik


On Sep 28, 2009, at 4:41 AM, Steinar Asbjørnsen wrote:

I just posted to the SolrNet-group since i have the exact same(?)

problem.
Hope I'm not beeing rude posting here as well (since the  
SolrNet-group

doesn't seem as active as this mailinglist).

The problem occurs when I'm running an incremental feed(self  
made) of

a index.

My post:
[snip]
Whats happening is that i get this error message (in VS):
"A first chance exception of type
'SolrNet.Exceptions.SolrConnectionException' occurred in  
SolrNet.DLL"

And the web browser (which i use to start the feed says:
"System.Data.SqlClient.SqlException: Timeout expired.  The  
timeout
period elapsed prior to completion of the operation or the  
server is

not responding."
At the time of writing my index contains 15k docs, and "lacks"  
~700k

docs that the incremental feed should take care of adding to the
index.
The error message appears after 3k docs are added, and before 4k
docs are added.
I'm committing each 1%1000==0.
In addittion autocommit is set to:

1

More info:
From schema.xml:


I'm fetching data from a (remote) Sql 2008 Server, using  
sqljdbc4.jar.

And Solr is running on a local Tomcat-installation.
SolrNet version: 0.2.3.0
Solr Specification Version: 1.3.0.2009.08.29.08.05.39

[/snip]
Any suggestions on how to fix this would be much apreceiated.

Regards,
Steinar

Re: conditional sorting

2009-10-02 Thread Bojan Šmid

I tried to simplify the problem, but the point is that I could have really
complex requirements. For instance, "if in the first 5 results none are
older than one year, use sort by X, otherwise sort by Y".

So, the question is, is there a way to make Solr recognize complex
situations and apply different sorting criterion.

Bojan

On Fri, Oct 2, 2009 at 4:22 PM, Uri Boness  wrote:

> If the threshold is only 10, why can't you always sort by popularity and if
> the result set is <10 then resort on the client side based on date_entered?
>
> Uri
>
>
> Bojan Šmid wrote:
>
>> Hi all,
>>
>> I need to perform sorting of my query hits by different criterion
>> depending
>> on the number of hits. For instance, if there are < 10 hits, sort by
>> date_entered, otherwise, sort by popularity.
>>
>> Does anyone know if there is a way to do that with a single query, or I'll
>> have to send another query with desired sort criterion after I inspect
>> number of hits on my client?
>>
>> Thx
>>
>>
>>
>

Re: Solr Trunk Heap Space Issues

2009-10-02 Thread Jeff Newburn

The warmers return 11 fields:
3 Strings
2 booleans
2 doubles
2 longs
1 sint (solr.SortableIntField)

Let me know if you need the fields actually be searched on.

name:  fieldCache  
class:  org.apache.solr.search.SolrFieldCacheMBean  
version:  1.0  
description:  Provides introspection of the Lucene FieldCache, this is
**NOT** a cache that is managed by Solr.  
stats: entries_count :  0
insanity_count :  0
 
name:  documentCache  
class:  org.apache.solr.search.LRUCache  
version:  1.0  
description:  LRU Cache(maxSize=10, initialSize=75000)  
stats: lookups :  22620
hits :  337 
hitratio :  0.01 
inserts :  22282 
evictions :  0 
size :  22282 
warmupTime :  0 
cumulative_lookups :  22620
cumulative_hits :  337
cumulative_hitratio :  0.01
cumulative_inserts :  22282
cumulative_evictions :  0


name:  fieldValueCache  
class:  org.apache.solr.search.FastLRUCache  
version:  1.0  
description:  Concurrent LRU Cache(maxSize=1, initialSize=10,
minSize=9000, acceptableSize=9500, cleanupThread=false)  
stats: lookups :  0
hits :  0 
hitratio :  0.00 
inserts :  0 
evictions :  0 
size :  0 
warmupTime :  0 
cumulative_lookups :  0
cumulative_hits :  0
cumulative_hitratio :  0.00
cumulative_inserts :  0
cumulative_evictions :  0

-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


> From: Yonik Seeley 
> Reply-To: 
> Date: Fri, 2 Oct 2009 10:04:27 -0400
> To: 
> Subject: Re: Solr Trunk Heap Space Issues
> 
> On Fri, Oct 2, 2009 at 10:02 AM, Mark Miller  wrote:
>> Jeff Newburn wrote:
>>> that side change enough to push up the memory limits where we would run out
>>> like this?
>>> 
>> Yes - now give us the FieldCache section from the stats section please :)
> 
> And the fieldValueCache section too (used for multi-valued faceting).
> 
> -Yonik
> http://www.lucidimagination.com

Re: Problem with Wildcard...

2009-10-02 Thread Christian Zambrano

Another thing to remember about wildcard and fuzzy searches is that none 
of the token filters will be applied.


If you are using the LowerCaseFilterFactory at index time, then 
"RI-MC50034-1" gets converted to "ri-mc50034-1" which is never going to 
match "RI-MC5000*"


Also, I would probably use the analyze page of your solr admin site to 
see what tokens are produced from "RI-MC500034-1" and "500034" based on 
your schema


On 10/01/2009 02:42 AM, Shalin Shekhar Mangar wrote:

On Tue, Sep 29, 2009 at 6:42 PM, Jörg Agatzwrote:

   

Hi Users...

i have a Problem

I have a lot of fields, (type=text) for search in all fields i copy all
fields in the default text field and use this for default search.

Now i will search...

This is into a Field

"RI-MC500034-1"
when i search "RI-MC500034-1" i found it...
if i seacht "RI-MC5000*" i dosen´t

when i search "500034" i found it...
if i seacht "5000*" i dosen´t

what can i do to use the Wildcards?

 

I guess one thing you need to do is to add preserveOriginal="true" in the
WordDelimiterFactory section in your field type. That would help match
things like "RI-MC5000*". Make sure you re-index all documents after this
change.

As for the others, add debugQuery=on as a request parameter and see how the
query is being parsed. If you have a doubt, paste it on the list and we can
help you.

Re: JVM OOM when using field collapse component

2009-10-02 Thread Joe Calderon

heap space is 4gb set to grow up to 8gb, usage is normally ~1-2gb,
seems to happen within a few searches.

if its just me ill try to isolate it, it could be some other part of
my implementation

thx much

On Fri, Oct 2, 2009 at 1:18 AM, Martijn v Groningen
 wrote:
> No I have not encountered OOM exception yet with current field collapse patch.
> How large is your configured JVM heap space (-Xmx)? Field collapsing
> requires more memory then regular searches so. Does Solr run out of
> memory during the first search(es) or does it run out of memory after
> a while when it performed quite a few field collapse searches?
>
> I see that you are also using the collapse.includeCollapsedDocs.fl
> parameter for your search. This feature will require more memory then
> a normal field collapse search.
>
> I normally give the Solr instance a heap space of 1024M when having an
> index of a few million.
>
> Martijn
>
> 2009/10/2 Joe Calderon :
>> i gotten two different out of memory errors while using the field
>> collapsing component, using the latest patch (2009-09-26) and the
>> latest nightly,
>>
>> has anyone else encountered similar problems? my collection is 5
>> million results but ive gotten the error collapsing as little as a few
>> thousand
>>
>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173)
>>        at 
>> org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749)
>>        at 
>> org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757)
>>        at 
>> org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292)
>>        at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233)
>>        at 
>> org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402)
>>        at 
>> org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115)
>>        at 
>> org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
>>        at 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>>        at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>        at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
>>        at 
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
>>        at 
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>        at 
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>        at 
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>>        at 
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>>        at 
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>        at 
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>        at 
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>        at 
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>>        at 
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
>>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>        at 
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>>        at 
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
>>
>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>        at 
>> org.apache.solr.util.DocSetScoreCollector.(DocSetScoreCollector.java:44)
>>        at 
>> org.apache.solr.search.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:68)
>>        at 
>> org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:205)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
>>        at 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>>        at 
>> org.

TermVector term frequencies for tag cloud

2009-10-02 Thread aodhol

Hello,

I'm trying to create a tag cloud from a term vector, but the array
returned (using JSON wt) is quite complex and takes an inordinate
amount of time to process. Is there a better way to retrieve terms and
their document TF? The TermVectorComponent allows for retrieval of tf
and df though I'm only interested in TF. I know the TermsComponent
gives you DF, but I need TF!

Any suggestions,

Thanks,

Aodh.

snapshot creation and distribution

2009-10-02 Thread Robert . Kay


Hello,

A couple questions with regard to snapshots and distribution:

1. If two snapshots are created in between a snappull, are the changes from
the first snapshot "missed" by the slave, as it only pulls the most recent
snapshot?

2. When triggering snapshooter from the "postCommit" hook, does a commit
always result in a snapshot being created, or is there any kind of quiet
period?

Many thanks,
Rob.


This e-mail is intended to be confidential to the recipient. If you receive a 
copy in error, please inform the sender and then delete this message. Virgin 
Money do not accept responsibility for changes made to any e-mail after 
sending. Virgin Money have swept, and believe this e-mail to be free of viruses 
and profanity but make no guarantees to this effect.

Virgin Money Personal Financial Service Ltd is authorised and regulated by the 
Financial Services Authority. Registered in England no. 3072766. Entered on the 
Financial Services Authority's Register http://www.fsa.gov.uk/register/. 
Register Number 179271. The Virgin Deposit Account is a personal bank account 
with The Royal Bank of Scotland.

Virgin Money Unit Trust Managers Ltd is authorised and regulated by the 
Financial Services Authority. Registered in England no. 3000482. Entered on the 
Financial Services Authority's Register. Register Number 171748.

Virgin Money Ltd. Registered in England no. 4232392. Introducer appointed 
representative only of Virgin Money Personal Financial Service Ltd.

Virgin Money Management Services Ltd. Registered in England no.3072772.

Virgin Money Group Ltd. Registered in England no.3087587.

All the above companies have their Registered office at Discovery House, 
Whiting Road, Norwich NR4 6EJ. 

All products are open only to residents of the United Kingdom.

This message has been checked for viruses and spam by the Virgin Money email 
scanning system powered by Messagelabs.

Google Side-By-Side UI

2009-10-02 Thread Lance Norskog

http://googleenterprise.blogspot.com/2009/08/compare-enterprise-search-relevance.html

This is really cool, and a version for Solr would help in doing
relevance experiments. We don't need the "select A or B" feature, just
seeing search result sets side-by-side would be great.

-- 
Lance Norskog
goks...@gmail.com

Re: best way to get the size of an index

2009-10-02 Thread Mark Miller

Mark Miller wrote:
> Phillip Farber wrote:
>   
>> Resuming this discussion in a new thread to focus only on this question:
>>
>> What is the best way to get the size of an index so it does not get
>> too big to be optimized (or to allow a very large segment merge) given
>> space limits?
>>
>> I already have the largest 15,000rpm SCSI direct attached storage so
>> buying storage is not an option.  I don't do deletes.
>> 
> Even if you did do deletes, its not really a 3x problem - thats just
> theory - you'd have to work to get there. Deletes are merged out as you
> index additional docs as segments are merged over time. The 3x scenario
> brought up is more of a fun mind exercise than anything that would
> realistically happen.
>   
>
And for completeness for those following along:

Lets say you did do some crazy deleting, and deleted half the docs in
your index. Those docs stay around, and the ids are just added to a list
that keeps those docs from being "seen". Later, as natural merging
occurs, or if you force merges with an optimize, those deleted docs will
physically be removed. Lets then say you then managed to re-add all of
those docs without any merging occurring while adding those docs (say
you wanted to see this affect so bad that you wrote and put in a custom
merge policy that doesn't find any segments to merge). Even if you do
all that, before you do the optimize, your going to look at the size of
your index and see its n GB. Thats your current index size. Now say you
kick off the optimize. Its not even going to take 2x that n size to
optimize - this is because all those deletes will be removed as the
index is optimized down to one segment. Its going to take <2x.

This delete thing, as I said, is more of a fun mental thing. It has
little relation to how much space you need to optimize in comparison to
how big your index is before optimizing. And its really worse than a
worse case scenario unless you write a custom merge policy, or crank
some settings insanely high and have enough RAM to do (all the indexing
would have to take place in one huge segment in RAM that would then get
flushed).

-- 
- Mark

http://www.lucidimagination.com

Re: Solr Trunk Heap Space Issues

2009-10-02 Thread Jeff Newburn

I reran the test to try to ensure that other cores on the instance didn't
have searches against them.  This time I get NPE errors just trying to get
into the stats after the system hits its limit.
-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


> From: Jeff Newburn 
> Reply-To: 
> Date: Fri, 02 Oct 2009 08:28:44 -0700
> To: 
> Subject: Re: Solr Trunk Heap Space Issues
> 
> The warmers return 11 fields:
> 3 Strings
> 2 booleans
> 2 doubles
> 2 longs
> 1 sint (solr.SortableIntField)
> 
> Let me know if you need the fields actually be searched on.
> 
> name:  fieldCache  
> class:  org.apache.solr.search.SolrFieldCacheMBean  
> version:  1.0  
> description:  Provides introspection of the Lucene FieldCache, this is
> **NOT** a cache that is managed by Solr.  
> stats: entries_count :  0
> insanity_count :  0
>  
> name:  documentCache  
> class:  org.apache.solr.search.LRUCache  
> version:  1.0  
> description:  LRU Cache(maxSize=10, initialSize=75000)  
> stats: lookups :  22620
> hits :  337 
> hitratio :  0.01 
> inserts :  22282 
> evictions :  0 
> size :  22282 
> warmupTime :  0 
> cumulative_lookups :  22620
> cumulative_hits :  337
> cumulative_hitratio :  0.01
> cumulative_inserts :  22282
> cumulative_evictions :  0
> 
> 
> name:  fieldValueCache  
> class:  org.apache.solr.search.FastLRUCache  
> version:  1.0  
> description:  Concurrent LRU Cache(maxSize=1, initialSize=10,
> minSize=9000, acceptableSize=9500, cleanupThread=false)  
> stats: lookups :  0
> hits :  0 
> hitratio :  0.00 
> inserts :  0 
> evictions :  0 
> size :  0 
> warmupTime :  0 
> cumulative_lookups :  0
> cumulative_hits :  0
> cumulative_hitratio :  0.00
> cumulative_inserts :  0
> cumulative_evictions :  0
> 
> -- 
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
> 
> 
>> From: Yonik Seeley 
>> Reply-To: 
>> Date: Fri, 2 Oct 2009 10:04:27 -0400
>> To: 
>> Subject: Re: Solr Trunk Heap Space Issues
>> 
>> On Fri, Oct 2, 2009 at 10:02 AM, Mark Miller  wrote:
>>> Jeff Newburn wrote:
 that side change enough to push up the memory limits where we would run out
 like this?
 
>>> Yes - now give us the FieldCache section from the stats section please :)
>> 
>> And the fieldValueCache section too (used for multi-valued faceting).
>> 
>> -Yonik
>> http://www.lucidimagination.com
>

Re: snapshot creation and distribution

2009-10-02 Thread Bill Au

A snapshot is a copy of the index at a particular moment in time.  So
changes in earlier snapshots are in the latest one as well.  Nothing is
missed by pulling the latest snapshot.

When triggering snapshooter with the postCommit hook, a commit always
results in a snapshot being created.

Bill

On Fri, Oct 2, 2009 at 11:52 AM,  wrote:

>
> Hello,
>
> A couple questions with regard to snapshots and distribution:
>
> 1. If two snapshots are created in between a snappull, are the changes from
> the first snapshot "missed" by the slave, as it only pulls the most recent
> snapshot?
>
> 2. When triggering snapshooter from the "postCommit" hook, does a commit
> always result in a snapshot being created, or is there any kind of quiet
> period?
>
> Many thanks,
> Rob.
>
>
> This e-mail is intended to be confidential to the recipient. If you receive
> a copy in error, please inform the sender and then delete this message.
> Virgin Money do not accept responsibility for changes made to any e-mail
> after sending. Virgin Money have swept, and believe this e-mail to be free
> of viruses and profanity but make no guarantees to this effect.
>
> Virgin Money Personal Financial Service Ltd is authorised and regulated by
> the Financial Services Authority. Registered in England no. 3072766. Entered
> on the Financial Services Authority's Register
> http://www.fsa.gov.uk/register/. Register Number 179271. The Virgin
> Deposit Account is a personal bank account with The Royal Bank of Scotland.
>
> Virgin Money Unit Trust Managers Ltd is authorised and regulated by the
> Financial Services Authority. Registered in England no. 3000482. Entered on
> the Financial Services Authority's Register. Register Number 171748.
>
> Virgin Money Ltd. Registered in England no. 4232392. Introducer appointed
> representative only of Virgin Money Personal Financial Service Ltd.
>
> Virgin Money Management Services Ltd. Registered in England no.3072772.
>
> Virgin Money Group Ltd. Registered in England no.3087587.
>
> All the above companies have their Registered office at Discovery House,
> Whiting Road, Norwich NR4 6EJ.
>
> All products are open only to residents of the United Kingdom.
>
> This message has been checked for viruses and spam by the Virgin Money
> email scanning system powered by Messagelabs.
>

Re: Google Side-By-Side UI

2009-10-02 Thread Yao Ge

Yes. I think would be very helpful tool for tunning search relevancy - you
can do a controlled experiment with your target audiences to understand
their responses to the parameter changes. We plan to use this feature to
benchmark Lucene/SOLR against our in-house commercial search engine - it
will be an interesting test.

Lance Norskog-2 wrote:
> 
> http://googleenterprise.blogspot.com/2009/08/compare-enterprise-search-relevance.html
> 
> This is really cool, and a version for Solr would help in doing
> relevance experiments. We don't need the "select A or B" feature, just
> seeing search result sets side-by-side would be great.
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Google-Side-By-Side-UI-tp25719087p25719806.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TermVector term frequencies for tag cloud

2009-10-02 Thread Bill Au

Have you considered using facet counts for your tag cloud?

Bill

On Fri, Oct 2, 2009 at 11:34 AM,  wrote:

> Hello,
>
> I'm trying to create a tag cloud from a term vector, but the array
> returned (using JSON wt) is quite complex and takes an inordinate
> amount of time to process. Is there a better way to retrieve terms and
> their document TF? The TermVectorComponent allows for retrieval of tf
> and df though I'm only interested in TF. I know the TermsComponent
> gives you DF, but I need TF!
>
> Any suggestions,
>
> Thanks,
>
> Aodh.
>

Question about PatternReplace filter and automatic Synonym generation

2009-10-02 Thread Prasanna Ranganathan


 Does the PatternReplaceFilter have an option where you can keep the
original token in addition to the modified token? From what I looked at it
does not seem to but I want to confirm the same.

Alternatively, is there a filter available which takes in a pattern and
produces additional forms of the token depending on the pattern? The use
case I am looking at here is using such a filter to automate synonym
generation. In our application, quite a few of the synonym file entries
match a specific pattern and having such a filter would make it easier I
believe. Pl. do correct me in case I am missing some unwanted side-effect
with this approach.

Continuing on that line, what is the performance hit in having additional
index-time filters as opposed to using a synonym file with more entries? How
does the overhead of using a bigger synonym file as opposed to additional
filters compare?

Thanks in advance for the help.

Regards,

Prasanna.

Question regarding synonym

2009-10-02 Thread darniz


Hi 
i have a question regarding synonymfilter
i have a one way mapping defined 
austin martin, astonmartin => aston martin

what baffling me is that if i give at query time the word austin martin 

it first goes through white space and generate two words in analysis page
"austin" and  "martin"

then after synonym filter it replace it with words
aston martin

Thats good and thats what i want but i am wodering sicne it went to white
space tokeniser first and split the word in to two different word "austin"
and "martin" how come it was able to map the entire synonym and replace it.
If i give only austin the after passing thruough synonym filter it does not
replace it with aston.
That leads me to conclude that even though "austin martin" went thru
whitespace tokenizer factory and got split into two the word ordering is
still preserved to find a synonym match.

Can anybody please explain if my observation is correct. This is a very
critical aspect for my work.

Thanks
darniz 
-- 
View this message in context: 
http://www.nabble.com/Question-regarding-synonym-tp25720572p25720572.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to access the information from SolrJ

2009-10-02 Thread Shalin Shekhar Mangar

On Fri, Oct 2, 2009 at 8:11 PM, Paul Tomblin  wrote:

> Nope, that just gets you the number of results returned, not how many
> there could be.  Like I said, if you look at the XML returned, you'll
> see something like
> 
> but only 10  returned.  getNumFound returns 10 in that case, not 1251.
>
>
>
Nope. Check again. getNumFound will definitely give you 1251.
SolrDocumentList#size() will give you 10.

-- 
Regards,
Shalin Shekhar Mangar.

Re: How to access the information from SolrJ

2009-10-02 Thread Paul Tomblin

On Fri, Oct 2, 2009 at 3:13 PM, Shalin Shekhar Mangar
 wrote:
> On Fri, Oct 2, 2009 at 8:11 PM, Paul Tomblin  wrote:
>
>> Nope, that just gets you the number of results returned, not how many
>> there could be.  Like I said, if you look at the XML returned, you'll
>> see something like
>> 
>> but only 10  returned.  getNumFound returns 10 in that case, not 1251.
>>
>>
>>
> Nope. Check again. getNumFound will definitely give you 1251.
> SolrDocumentList#size() will give you 10.

I don't have to check again.  I put this log into my query code:
QueryResponse resp = solrChunkServer.query(query);
SolrDocumentList docs = resp.getResults();
LOG.debug("got " + docs.getNumFound() + " documents (or "
+ docs.size() + " if you prefer)");
and I got exactly the same number in both places every single time.  I
can verify from the URL line that the following query:

http://test.xcski.com:8080/solrChunk/nutch/select/?q=test&fq=category:pharma&fq=concept:Discovery&rows=5

has a  but when I do
the same in SolrJ, docs.getNumFound() returns 5.

144652 [http-8080-14] DEBUG com.lucidityworks.solr.Solr  - got 5
documents (or 5 if you prefer)


-- 
http://www.linkedin.com/in/paultomblin

Re: How to access the information from SolrJ

2009-10-02 Thread Adam Allgaier

We have the same issue as Paul.  We currently parse the XML manually to pull 
out the numFound from the response.

Cheers!
Adam



- Original Message 
From: Paul Tomblin 
To: solr-user@lucene.apache.org
Sent: Friday, October 2, 2009 2:39:01 PM
Subject: Re: How to access the information from SolrJ

On Fri, Oct 2, 2009 at 3:13 PM, Shalin Shekhar Mangar
 wrote:
> On Fri, Oct 2, 2009 at 8:11 PM, Paul Tomblin  wrote:
>
>> Nope, that just gets you the number of results returned, not how many
>> there could be.  Like I said, if you look at the XML returned, you'll
>> see something like
>> 
>> but only 10  returned.  getNumFound returns 10 in that case, not 1251.
>>
>>
>>
> Nope. Check again. getNumFound will definitely give you 1251.
> SolrDocumentList#size() will give you 10.

I don't have to check again.  I put this log into my query code:
QueryResponse resp = solrChunkServer.query(query);
SolrDocumentList docs = resp.getResults();
LOG.debug("got " + docs.getNumFound() + " documents (or "
+ docs.size() + " if you prefer)");
and I got exactly the same number in both places every single time.  I
can verify from the URL line that the following query:

http://test.xcski.com:8080/solrChunk/nutch/select/?q=test&fq=category:pharma&fq=concept:Discovery&rows=5

has a  but when I do
the same in SolrJ, docs.getNumFound() returns 5.

144652 [http-8080-14] DEBUG com.lucidityworks.solr.Solr  - got 5
documents (or 5 if you prefer)


-- 
http://www.linkedin.com/in/paultomblin

search by some functionality

2009-10-02 Thread Elaine Li

Hi,

My doc has three fields, say field1, field2, field3.

My search would be q=field1:string1 && field2:string2. I also need to
do some computation and comparison of the string1 and string2 with the
contents in field3 and then determine if it is a hit.

What can I do to implement this?

Thanks.

Elaine

Invoke "expungeDeletes" using SolrJ's SolrServer.commit()

2009-10-02 Thread Jibo John


Hello,

I know I can invoke expungeDeletes using updatehandler  ( curl update - 
F stream.body=' ' ), however, I was  
wondering if it is possible to invoke it using SolrJ.


It looks like, currently, there are no SolrServer.commit(..) methods  
that I can use for this purpose.


Any input will be helpful.


Thanks,
-Jibo

Advantages of different Servlet Containers

2009-10-02 Thread Simon Wistow

I know that the Solr FAQ says 

"Users should decide for themselves which Servlet Container they 
consider the easiest/best for their use cases based on their 
needs/experience. For high traffic scenarios, investing time for tuning 
the servlet container can often make a big difference."

but is there anywhere that lists some of the variosu advantages and 
disadvantages of, say, Tomcat over Jetty for someone who isn't current 
with the Java ecosystem?

Also, I'm currently using Jetty but I've had to do a horrific hack to 
make it work under init.d in that I start it up in the background and 
then tail the output waiting for the line that says the SocketConnector 
has been started

   while [ '' = "$(tail -1 $LOG | grep 'Started SocketConnector')"  ] ; 
   do
   sleep 1
   done

There's *got* to be a better way of doing this, right? 

Thanks,

Simon

Specifying "all except field" in field list?

2009-10-02 Thread Paul Rosen


Hi,

Is there a way to request all fields in an object EXCEPT a particular 
one? In other words, the following pseudo code is what I'd like to express:


req = Solr::Request::Standard.new(:start => page*size, :rows => size,
:query => my_query, :field_list => [ ALL EXCEPT 'text' ])

Is there a way to say that?

I know I could figure out all possible fields and make an array of them, 
but that list is likely to change over time and I'm sure to forget to 
update it.


I need to do that because the text field is not needed and is likely to 
be really large so the queries will be much faster if it isn't returned.


Thanks,
Paul

Re: Advantages of different Servlet Containers

2009-10-02 Thread Lajos

Just go for Tomcat. For all its problems, and I should know having used 
it since it was originally JavaWebServer, it is perfectly capable of 
handling high-end production environments provided you tune it 
correctly. We use it with our customized Solr 1.3 version without any 
problems.


Lajos


Simon Wistow wrote:
I know that the Solr FAQ says 

"Users should decide for themselves which Servlet Container they 
consider the easiest/best for their use cases based on their 
needs/experience. For high traffic scenarios, investing time for tuning 
the servlet container can often make a big difference."


but is there anywhere that lists some of the variosu advantages and 
disadvantages of, say, Tomcat over Jetty for someone who isn't current 
with the Java ecosystem?


Also, I'm currently using Jetty but I've had to do a horrific hack to 
make it work under init.d in that I start it up in the background and 
then tail the output waiting for the line that says the SocketConnector 
has been started


   while [ '' = "$(tail -1 $LOG | grep 'Started SocketConnector')"  ] ; 
   do

   sleep 1
   done

There's *got* to be a better way of doing this, right? 


Thanks,

Simon






No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.409 / Virus Database: 270.14.2/2408 - Release Date: 10/01/09 18:23:00

RE: Advantages of different Servlet Containers

2009-10-02 Thread Walter Underwood

Netflix uses Tomcat throuought and they tail the log to figure out whether 
it has started, except they look for a message from Solr to see whether 
Solr is ready to go to work.

wunder

-Original Message-
From: Lajos [mailto:la...@protulae.com] 
Sent: Friday, October 02, 2009 1:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Advantages of different Servlet Containers

Just go for Tomcat. For all its problems, and I should know having used 
it since it was originally JavaWebServer, it is perfectly capable of 
handling high-end production environments provided you tune it 
correctly. We use it with our customized Solr 1.3 version without any 
problems.

Lajos


Simon Wistow wrote:
> I know that the Solr FAQ says 
> 
> "Users should decide for themselves which Servlet Container they 
> consider the easiest/best for their use cases based on their 
> needs/experience. For high traffic scenarios, investing time for tuning 
> the servlet container can often make a big difference."
> 
> but is there anywhere that lists some of the variosu advantages and 
> disadvantages of, say, Tomcat over Jetty for someone who isn't current 
> with the Java ecosystem?
> 
> Also, I'm currently using Jetty but I've had to do a horrific hack to 
> make it work under init.d in that I start it up in the background and 
> then tail the output waiting for the line that says the SocketConnector 
> has been started
> 
>while [ '' = "$(tail -1 $LOG | grep 'Started SocketConnector')"  ] ; 
>do
>sleep 1
>done
> 
> There's *got* to be a better way of doing this, right? 
> 
> Thanks,
> 
> Simon
> 
> 
> 
> 
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.409 / Virus Database: 270.14.2/2408 - Release Date: 10/01/09
18:23:00
>

RE: Question regarding synonym

2009-10-02 Thread Ensdorf Ken

> Hi
> i have a question regarding synonymfilter
> i have a one way mapping defined
> austin martin, astonmartin => aston martin
> 
...
> 
> Can anybody please explain if my observation is correct. This is a very
> critical aspect for my work.

That is correct - the synonym filter can recognize multi-token synonyms from 
consecutive tokens in a stream.

Re: How to access the information from SolrJ

2009-10-02 Thread Shalin Shekhar Mangar

On Sat, Oct 3, 2009 at 1:09 AM, Paul Tomblin  wrote:

> >>
> > Nope. Check again. getNumFound will definitely give you 1251.
> > SolrDocumentList#size() will give you 10.
>
> I don't have to check again.  I put this log into my query code:
>QueryResponse resp = solrChunkServer.query(query);
>SolrDocumentList docs = resp.getResults();
>LOG.debug("got " + docs.getNumFound() + " documents (or "
> + docs.size() + " if you prefer)");
> and I got exactly the same number in both places every single time.  I
> can verify from the URL line that the following query:
>
>
> http://test.xcski.com:8080/solrChunk/nutch/select/?q=test&fq=category:pharma&fq=concept:Discovery&rows=5
>
> has a  but when I do
> the same in SolrJ, docs.getNumFound() returns 5.
>
> 144652 [http-8080-14] DEBUG com.lucidityworks.solr.Solr  - got 5
> documents (or 5 if you prefer)
>
>
I can tell you for sure that this is not a bug in Solr 1.3 or trunk. I
checked the code and it is being set correctly. Moreover, I'm using both in
production.

The class com.lucidityworks.solr.Solr suggests that you are using Lucid's
solr build. Perhaps that has a bug? Can you try this with the Solrj client
in the official 1.3 release or even trunk?

-- 
Regards,
Shalin Shekhar Mangar.

Re: Invoke "expungeDeletes" using SolrJ's SolrServer.commit()

2009-10-02 Thread Shalin Shekhar Mangar

On Sat, Oct 3, 2009 at 1:35 AM, Jibo John  wrote:

> Hello,
>
> I know I can invoke expungeDeletes using updatehandler  ( curl update -F
> stream.body=' ' ), however, I was wondering
> if it is possible to invoke it using SolrJ.
>
> It looks like, currently, there are no SolrServer.commit(..) methods that I
> can use for this purpose.
>
> Any input will be helpful.
>
>
You are right. Please create an issue. We need this in 1.4

-- 
Regards,
Shalin Shekhar Mangar.

Re: How to access the information from SolrJ

2009-10-02 Thread Paul Tomblin

LucidityWorks.com is my client.  The similarity to lucid is purely coincidental 
- the client didn't even know I was going to choose Solr.  I am using Solr 
trunk, last updated and compiled a few weeks ago.

-- Sent from my Palm Prē
Shalin Shekhar Mangar wrote:

On Sat, Oct 3, 2009 at 1:09 AM, Paul Tomblin  wrote:



> >>

> > Nope. Check again. getNumFound will definitely give you 1251.

> > SolrDocumentList#size() will give you 10.

>

> I don't have to check again.  I put this log into my query code:

>QueryResponse resp = solrChunkServer.query(query);

>SolrDocumentList docs = resp.getResults();

>LOG.debug("got " + docs.getNumFound() + " documents (or "

> + docs.size() + " if you prefer)");

> and I got exactly the same number in both places every single time.  I

> can verify from the URL line that the following query:

>

>

> http://test.xcski.com:8080/solrChunk/nutch/select/?q=test&fq=category:pharma&fq=concept:Discovery&rows=5

>

> has a  but when I do

> the same in SolrJ, docs.getNumFound() returns 5.

>

> 144652 [http-8080-14] DEBUG com.lucidityworks.solr.Solr  - got 5

> documents (or 5 if you prefer)

>

>

I can tell you for sure that this is not a bug in Solr 1.3 or trunk. I

checked the code and it is being set correctly. Moreover, I'm using both in

production.



The class com.lucidityworks.solr.Solr suggests that you are using Lucid's

solr build. Perhaps that has a bug? Can you try this with the Solrj client

in the official 1.3 release or even trunk?



-- 

Regards,

Shalin Shekhar Mangar.

Re: Advantages of different Servlet Containers

2009-10-02 Thread Shalin Shekhar Mangar

AOL uses Tomcat for all Solr deployments. Our load balancers use a ping
query to put a box back into rotation.

On Sat, Oct 3, 2009 at 2:15 AM, Walter Underwood wrote:

> Netflix uses Tomcat throuought and they tail the log to figure out whether
> it has started, except they look for a message from Solr to see whether
> Solr is ready to go to work.
>
> wunder
>
> -Original Message-
> From: Lajos [mailto:la...@protulae.com]
> Sent: Friday, October 02, 2009 1:35 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Advantages of different Servlet Containers
>
> Just go for Tomcat. For all its problems, and I should know having used
> it since it was originally JavaWebServer, it is perfectly capable of
> handling high-end production environments provided you tune it
> correctly. We use it with our customized Solr 1.3 version without any
> problems.
>
> Lajos
>
>
> Simon Wistow wrote:
> > I know that the Solr FAQ says
> >
> > "Users should decide for themselves which Servlet Container they
> > consider the easiest/best for their use cases based on their
> > needs/experience. For high traffic scenarios, investing time for tuning
> > the servlet container can often make a big difference."
> >
> > but is there anywhere that lists some of the variosu advantages and
> > disadvantages of, say, Tomcat over Jetty for someone who isn't current
> > with the Java ecosystem?
> >
> > Also, I'm currently using Jetty but I've had to do a horrific hack to
> > make it work under init.d in that I start it up in the background and
> > then tail the output waiting for the line that says the SocketConnector
> > has been started
> >
> >while [ '' = "$(tail -1 $LOG | grep 'Started SocketConnector')"  ] ;
> >do
> >sleep 1
> >done
> >
> > There's *got* to be a better way of doing this, right?
> >
> > Thanks,
> >
> > Simon
> >
> >
> >
> > 
> >
> >
> > No virus found in this incoming message.
> > Checked by AVG - www.avg.com
> > Version: 8.5.409 / Virus Database: 270.14.2/2408 - Release Date: 10/01/09
> 18:23:00
> >
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: best way to get the size of an index

2009-10-02 Thread Phillip Farber


Thanks, Mark. I really appreciate your confirmation.

Phil

Mark Miller wrote:

Phillip Farber wrote:

Resuming this discussion in a new thread to focus only on this question:

What is the best way to get the size of an index so it does not get
too big to be optimized (or to allow a very large segment merge) given
space limits?

I already have the largest 15,000rpm SCSI direct attached storage so
buying storage is not an option.  I don't do deletes.

Even if you did do deletes, its not really a 3x problem - thats just
theory - you'd have to work to get there. Deletes are merged out as you
index additional docs as segments are merged over time. The 3x scenario
brought up is more of a fun mind exercise than anything that would
realistically happen.

From what I've read, I expect no more than a 2x increase during
optimization and have not seen more in practice.

I'm thinking: stop indexing, commit, do a du.

Will this give me the number I need for what I'm trying to do? Is
there a better way?

Should work fine. When you do the commit, onCommit will be called on the
IndexDeltionPolicy, and all of the "snapshots" of the index other than
the latest one will be removed. You should have a clean index to gauge
the size with. Using something like Java Replication complicates this
though - in that case, older commit points can be reserved while they
are being copied.

Phil

Re: Invoke "expungeDeletes" using SolrJ's SolrServer.commit()

2009-10-02 Thread Yonik Seeley

You can always add arbitrary parameters to an update request:

UpdateRequest ureq = new UpdateRequest();
ureq.add(doc);
ureq.setParam("expungeDeletes","true");
NamedList rsp = server.request(ureq);


-Yonik
http://www.lucidimagination.com


On Fri, Oct 2, 2009 at 4:05 PM, Jibo John  wrote:
> Hello,
>
> I know I can invoke expungeDeletes using updatehandler  ( curl update -F
> stream.body=' ' ), however, I was wondering
> if it is possible to invoke it using SolrJ.
>
> It looks like, currently, there are no SolrServer.commit(..) methods that I
> can use for this purpose.
>
> Any input will be helpful.
>
>
> Thanks,
> -Jibo
>
>
>

Re: Invoke "expungeDeletes" using SolrJ's SolrServer.commit()

2009-10-02 Thread Jibo John


Created jira issue https://issues.apache.org/jira/browse/SOLR-1487

Thanks,
-Jibo

On Oct 2, 2009, at 2:17 PM, Shalin Shekhar Mangar wrote:


On Sat, Oct 3, 2009 at 1:35 AM, Jibo John  wrote:


Hello,

I know I can invoke expungeDeletes using updatehandler  ( curl  
update -F
stream.body=' ' ), however, I was  
wondering

if it is possible to invoke it using SolrJ.

It looks like, currently, there are no SolrServer.commit(..)  
methods that I

can use for this purpose.

Any input will be helpful.



You are right. Please create an issue. We need this in 1.4

--
Regards,
Shalin Shekhar Mangar.

Re: conditional sorting

2009-10-02 Thread Lance Norskog

Doing a second search immediately after the first one is consistently
under 100 ms for me, usually under 25, on cheap hardware.  Even while
sorting the results, you should have no problems. If necessary, you
could run Solr with the embedded client and do one search right after
the other, avoid the thread-switching that can happen between HTTP
requests.

On Fri, Oct 2, 2009 at 8:15 AM, Bojan Šmid  wrote:
> I tried to simplify the problem, but the point is that I could have really
> complex requirements. For instance, "if in the first 5 results none are
> older than one year, use sort by X, otherwise sort by Y".
>
> So, the question is, is there a way to make Solr recognize complex
> situations and apply different sorting criterion.
>
> Bojan
>
>
> On Fri, Oct 2, 2009 at 4:22 PM, Uri Boness  wrote:
>
>> If the threshold is only 10, why can't you always sort by popularity and if
>> the result set is <10 then resort on the client side based on date_entered?
>>
>> Uri
>>
>>
>> Bojan Šmid wrote:
>>
>>> Hi all,
>>>
>>> I need to perform sorting of my query hits by different criterion
>>> depending
>>> on the number of hits. For instance, if there are < 10 hits, sort by
>>> date_entered, otherwise, sort by popularity.
>>>
>>> Does anyone know if there is a way to do that with a single query, or I'll
>>> have to send another query with desired sort criterion after I inspect
>>> number of hits on my client?
>>>
>>> Thx
>>>
>>>
>>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: How to access the information from SolrJ

2009-10-02 Thread Paul Tomblin

On Fri, Oct 2, 2009 at 5:04 PM, Shalin Shekhar Mangar
 wrote:
> Can you try this with the Solrj client
> in the official 1.3 release or even trunk?

I did a svn update to 821188 and that seems to have fixed the problem.
 (The jar files changed from -1.3.0 to -1.4-dev)  I guess it's been
longer since I did an update than I thought.

logs/catalina.out:318916 [http-8080-1] DEBUG
com.lucidityworks.solr.Solr  - got 138 documents (or 15 if you prefer)

Thanks very much.

-- 
http://www.linkedin.com/in/paultomblin

RE: Question regarding synonym

2009-10-02 Thread darniz


This is not working when i search documents i have a document which contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but when i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin text:martin

dont know why it breaks the word, may be its a desired behaviour 
when i give carDescription:"austin martin" of course in this its able to map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:
> 
>> Hi
>> i have a question regarding synonymfilter
>> i have a one way mapping defined
>> austin martin, astonmartin => aston martin
>> 
> ...
>> 
>> Can anybody please explain if my observation is correct. This is a very
>> critical aspect for my work.
> 
> That is correct - the synonym filter can recognize multi-token synonyms
> from consecutive tokens in a stream.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Question-regarding-synonym-tp25720572p25723829.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Question regarding synonym

2009-10-02 Thread Christian Zambrano

When you use a field qualifier(fieldName:valueToLookFor) it only applies 
to the word right after the semicolon. If you look at the debug 
infomation you will notice that for the second word it is using the 
default field.


carDescription:austin *text*:martin

the following should word:

carDescription:(austin martin)


On 10/02/2009 05:46 PM, darniz wrote:

This is not working when i search documents i have a document which contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but when i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin text:martin

dont know why it breaks the word, may be its a desired behaviour
when i give carDescription:"austin martin" of course in this its able to map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:
   
 

Hi
i have a question regarding synonymfilter
i have a one way mapping defined
austin martin, astonmartin =>  aston martin

   

...
 

Can anybody please explain if my observation is correct. This is a very
critical aspect for my work.
   

That is correct - the synonym filter can recognize multi-token synonyms
from consecutive tokens in a stream.

Re: Question regarding synonym

2009-10-02 Thread darniz


Thanks 
As i said it even works by giving double quotes too.
like carDescription:"austin martin"

So is that the conclusion that in order to map two word synonym i have to
always enclose in double quotes, so that it doen not split the words











Christian Zambrano wrote:
> 
> When you use a field qualifier(fieldName:valueToLookFor) it only applies 
> to the word right after the semicolon. If you look at the debug 
> infomation you will notice that for the second word it is using the 
> default field.
> 
> carDescription:austin *text*:martin
> 
> the following should word:
> 
> carDescription:(austin martin)
> 
> 
> On 10/02/2009 05:46 PM, darniz wrote:
>> This is not working when i search documents i have a document which
>> contains
>> text aston martin
>>
>> when i search carDescription:"austin martin" i get a match but when i
>> dont
>> give double quotes
>>
>> like carDescription:austin martin
>> there is no match
>>
>> in the analyser if i give austin martin with out quotes, when it passes
>> through synonym filter it matches aston martin ,
>> may be by default analyser treats it as a phrase "austin martin" but when
>> i
>> try to do a query by typing
>> carDescription:austin martin i get 0 documents. the following is the
>> debug
>> node info with debugQuery=on
>>
>> carDescription:austin martin
>> carDescription:austin martin
>> carDescription:austin text:martin
>> carDescription:austin text:martin
>>
>> dont know why it breaks the word, may be its a desired behaviour
>> when i give carDescription:"austin martin" of course in this its able to
>> map
>> to synonym and i get the desired result
>>
>> Any opinion
>>
>> darniz
>>
>>
>>
>> Ensdorf Ken wrote:
>>
>>>  
 Hi
 i have a question regarding synonymfilter
 i have a one way mapping defined
 austin martin, astonmartin =>  aston martin


>>> ...
>>>  
 Can anybody please explain if my observation is correct. This is a very
 critical aspect for my work.

>>> That is correct - the synonym filter can recognize multi-token synonyms
>>> from consecutive tokens in a stream.
>>>
>>>
>>>
>>>  
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Question-regarding-synonym-tp25720572p25723980.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Specifying "all except field" in field list?

2009-10-02 Thread Lance Norskog

No, there is only "list of fields", star, and score.  You can choose
to index it and not store it, and then have your application fetch it
from the original data store. This is a common system design pattern
to avoid storing giant text blobs in the index.

http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams

On Fri, Oct 2, 2009 at 1:27 PM, Paul Rosen  wrote:
> Hi,
>
> Is there a way to request all fields in an object EXCEPT a particular one?
> In other words, the following pseudo code is what I'd like to express:
>
> req = Solr::Request::Standard.new(:start => page*size, :rows => size,
> :query => my_query, :field_list => [ ALL EXCEPT 'text' ])
>
> Is there a way to say that?
>
> I know I could figure out all possible fields and make an array of them, but
> that list is likely to change over time and I'm sure to forget to update it.
>
> I need to do that because the text field is not needed and is likely to be
> really large so the queries will be much faster if it isn't returned.
>
> Thanks,
> Paul
>



-- 
Lance Norskog
goks...@gmail.com

Re: Specifying "all except field" in field list?

2009-10-02 Thread Paul Rosen


Thanks, Lance, for the quick reply.

Well, unfortunately, we need the highlighting feature on that field, so 
I think we have to store it.


It's not a big deal, it just seemed like something that would be useful 
and probably be easy to implement, so I figured I just missed it.


Alternately, is there a way, through solr, to get a list of all the 
fields that any object could possibly return? I suppose I could just 
read the schema.xml file and read all the  tags, but I'd miss the 
fields created by . I'm hoping I can query the index to 
get that info.


Lance Norskog wrote:

No, there is only "list of fields", star, and score.  You can choose
to index it and not store it, and then have your application fetch it
from the original data store. This is a common system design pattern
to avoid storing giant text blobs in the index.

http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams

On Fri, Oct 2, 2009 at 1:27 PM, Paul Rosen  wrote:

Hi,

Is there a way to request all fields in an object EXCEPT a particular one?
In other words, the following pseudo code is what I'd like to express:

req = Solr::Request::Standard.new(:start => page*size, :rows => size,
:query => my_query, :field_list => [ ALL EXCEPT 'text' ])

Is there a way to say that?

I know I could figure out all possible fields and make an array of them, but
that list is likely to change over time and I'm sure to forget to update it.

I need to do that because the text field is not needed and is likely to be
really large so the queries will be much faster if it isn't returned.

Thanks,
Paul

Re: Advantages of different Servlet Containers

2009-10-02 Thread Joshua Tuberville

Simon,

Have you tried the bin/jetty.sh script that comes with Jetty  
distributions?  It contains the standard start|stop|restart functions.

Joshua

On Oct 2, 2009, at 1:11 PM, Simon Wistow wrote:

> I know that the Solr FAQ says
>
> "Users should decide for themselves which Servlet Container they
> consider the easiest/best for their use cases based on their
> needs/experience. For high traffic scenarios, investing time for  
> tuning
> the servlet container can often make a big difference."
>
> but is there anywhere that lists some of the variosu advantages and
> disadvantages of, say, Tomcat over Jetty for someone who isn't current
> with the Java ecosystem?
>
> Also, I'm currently using Jetty but I've had to do a horrific hack to
> make it work under init.d in that I start it up in the background and
> then tail the output waiting for the line that says the  
> SocketConnector
> has been started
>
>   while [ '' = "$(tail -1 $LOG | grep 'Started SocketConnector')"  ] ;
>   do
>   sleep 1
>   done
>
> There's *got* to be a better way of doing this, right?
>
> Thanks,
>
> Simon
>
>

Re: Specifying "all except field" in field list?

2009-10-02 Thread Lance Norskog

Maybe the TermsComponent?

You can't ask for facets with a wildcard in the field name. This would
do the trick. It's an issue in JIRA, if you want to vote for it.

http://issues.apache.org/jira/browse/SOLR-247
http://issues.apache.org/jira/browse/SOLR-1387

On Fri, Oct 2, 2009 at 6:36 PM, Paul Rosen  wrote:
> Thanks, Lance, for the quick reply.
>
> Well, unfortunately, we need the highlighting feature on that field, so I
> think we have to store it.
>
> It's not a big deal, it just seemed like something that would be useful and
> probably be easy to implement, so I figured I just missed it.
>
> Alternately, is there a way, through solr, to get a list of all the fields
> that any object could possibly return? I suppose I could just read the
> schema.xml file and read all the  tags, but I'd miss the fields
> created by . I'm hoping I can query the index to get that
> info.
>
> Lance Norskog wrote:
>>
>> No, there is only "list of fields", star, and score.  You can choose
>> to index it and not store it, and then have your application fetch it
>> from the original data store. This is a common system design pattern
>> to avoid storing giant text blobs in the index.
>>
>> http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams
>>
>> On Fri, Oct 2, 2009 at 1:27 PM, Paul Rosen 
>> wrote:
>>>
>>> Hi,
>>>
>>> Is there a way to request all fields in an object EXCEPT a particular
>>> one?
>>> In other words, the following pseudo code is what I'd like to express:
>>>
>>> req = Solr::Request::Standard.new(:start => page*size, :rows => size,
>>> :query => my_query, :field_list => [ ALL EXCEPT 'text' ])
>>>
>>> Is there a way to say that?
>>>
>>> I know I could figure out all possible fields and make an array of them,
>>> but
>>> that list is likely to change over time and I'm sure to forget to update
>>> it.
>>>
>>> I need to do that because the text field is not needed and is likely to
>>> be
>>> really large so the queries will be much faster if it isn't returned.
>>>
>>> Thanks,
>>> Paul
>>>
>>
>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com

66 matches

Mail list logo