Filtering results based on score

2010-11-01 Thread sivaprasad

Hi,
As part of solr results i am able to get the max score.If i want to filter
the results based on the max score, let say the max score  is 10 And i need
only the results between max score  to 50 % of max score.This max score is
going to change dynamically.How can we implement this?Do we need to
customize the solr?Please any suggestions.


Regards,
JS
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-results-based-on-score-tp1819769p1819769.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Relevency Calculation

2010-11-01 Thread sivaprasad

Hi,
I have 25 indexed fields in my document.But by default, if i give
"q=laptops" this is going to search on five fields and iam getting the score
as part of search results.How solr will calculate the score?Is it going to
calculate only on the five fields or on 25 fields which are indexed?What is
the order it is going to take to calculate score?Any documents related to
this topic is helpful for me.

Regards,
JS
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Relevency-Calculation-tp1819798p1819798.html
Sent from the Solr - User mailing list archive at Nabble.com.


Boosting the score based on certain field

2010-11-01 Thread sivaprasad

Hi,

In my document i have a filed called category.This contains
"electronics,games ,..etc".For some of the category values i need to boost
the document score.Let us say, for "electronics" category, i will decide the
boosting parameter grater than the "games" category.Is there any body has
the idea to achieve this functionality?

Regards,
Siva


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-the-score-based-on-certain-field-tp1819820p1819820.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering results based on score

2010-11-01 Thread Ahmet Arslan
> As part of solr results i am able to get the max score.If i
> want to filter
> the results based on the max score, let say the max
> score  is 10 And i need
> only the results between max score  to 50 % of max
> score.This max score is
> going to change dynamically.How can we implement this?Do we
> need to
> customize the solr?Please any suggestions.

frange is advised in a similar discussion:
http://search-lucene.com/m/4AHNF17wIJW1/





Multiple Keyword Search

2010-11-01 Thread Pawan Darira
Hi

There is a situation where i search for more than 1 keyword & my main 2
fields are ad_title & ad_description.
I want those results which match all of the keywords in both fields, should
come on top. Then sequentially one by one keyword can be dropped in further
results.

E.g. In a search of 3 keywords, let there are 100 results. If 35 contain all
the keywords combined in ad_title & ad_description, then they should come
first. Then if 50 results contain combination of any 2 keywords, they should
come next. Finally results with single keyword should come at last

Please suggest

-- 
Thanks,
Pawan Darira


Re:Re: problem of solr replcation's speed

2010-11-01 Thread kafka0102
I hacked SnapPuller to log the cost, and the log is like thus:
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 980
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5
[2010-11-01 
17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979


It's saying it cost about 1000ms for transfering 1M data every 2 times. I used 
jetty as server and embeded solr in my app.I'm so confused.What I have done 
wrong?


At 2010-11-01 10:12:38,"Lance Norskog"  wrote:

>If you are copying from an indexer while you are indexing new content,
>this would cause contention for the disk head. Does indexing slow down
>during this period?
>
>Lance
>
>2010/10/31 Peter Karich :
>>  we have an identical-sized index and it takes ~5minutes
>>
>>
>>> It takes about one hour to replacate 6G index for solr in my env. But my
>>> network can transfer file about 10-20M/s using scp. So solr's http
>>> replcation is too slow, it's normal or I do something wrong?
>>>
>>
>>
>
>
>
>-- 
>Lance Norskog
>goks...@gmail.com


Re:Re:Re: problem of solr replcation's speed

2010-11-01 Thread kafka0102
I suspected my app has some sleeping op every 1s, so
I changed ReplicationHandler.PACKET_SZ to 1024 * 1024*10; // 10MB

and log result is like thus :
[2010-11-01 
17:49:29][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3184
[2010-11-01 
17:49:32][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3426
[2010-11-01 
17:49:36][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3359
[2010-11-01 
17:49:39][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3166
[2010-11-01 
17:49:42][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3513
[2010-11-01 
17:49:46][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3140
[2010-11-01 
17:49:50][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3471

That means It's still slow like before. what's wrong with my env

At 2010-11-01 17:30:32,kafka0102  wrote:
I hacked SnapPuller to log the cost, and the log is like thus:
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 980
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5
[2010-11-01 
17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979


It's saying it cost about 1000ms for transfering 1M data every 2 times. I used 
jetty as server and embeded solr in my app.I'm so confused.What I have done 
wrong?


At 2010-11-01 10:12:38,"Lance Norskog"  wrote:

>If you are copying from an indexer while you are indexing new content,
>this would cause contention for the disk head. Does indexing slow down
>during this period?
>
>Lance
>
>2010/10/31 Peter Karich :
>>  we have an identical-sized index and it takes ~5minutes
>>
>>
>>> It takes about one hour to replacate 6G index for solr in my env. But my
>>> network can transfer file about 10-20M/s using scp. So solr's http
>>> replcation is too slow, it's normal or I do something wrong?
>>>
>>
>>
>
>
>
>-- 
>Lance Norskog
>goks...@gmail.com




Re: Design and Usage Questions

2010-11-01 Thread torin farmer
Hm, I do not have a webserver setup for security reasons.I use SVNKit to 
connect to SVN via the "file://" protocol, what I get then is the 
ByteArrayOutputStream.What would the buffer-solution or the DualThread 
Writer/Reader pair look like?-Ursprüngliche Nachricht-

Von: "Lance Norskog" 

Gesendet: Nov 1, 2010 3:23:55 AM

An: solr-user@lucene.apache.org

Betreff: Re: Design and Usage Questions



>2.

>The SolrJ library handling of content streams is "pull", not "push".

>That is, you give it a reader and it pulls content when it feels like

>it. If your software to feed the connection wants to write the data,

>you have to either buffer the whole thing or do a dual-thread

>writer/reader pair.

>

>The easiest way to pull stuff from SVN is to use one of the web server

>apps. Solr takes a "stream.url" parameter. (Also stream.file.) Note

>that there is no outbound authentication supported; your web server

>has to be open (at least to the Solr instance).

>

>

>On Sun, Oct 31, 2010 at 4:06 PM, getagrip  wrote:

>> Hi,

>>

>> I've got some basic usage / design questions.

>>

>> 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer

>>   instance for all requests to avoid connection leaks.

>>   So if I create a Singleton instance upon application-startup I can

>>   securely use this instance for ALL queries/updates throughout my

>>   application without running into performance issues?

>>

>> 2. My System's documents are stored in a Subversion repository.

>>   For fast searchresults I want to periodically index new documents

>>   from the repository.

>>

>>   What I get from the repository is a ByteArrayOutputStream. How can I

>>   pass this Stream to Solr?

>>

>>   I only see possibilities to pass Files but in my case it does not

>>   make sense to write the ByteArrayOutputStream to disk again as this

>>   would cause performance issues apart from making no sense anyway.

>>

>> 3. Are there any disadvantages using Solrj over some other HTTP based

>>   solution e.g. creating & sending my own HTTP requests? Do I even

>>   have to use HTTP?

>>   I see the EmbeddedSolrServer exists. Any drawbacks using that?

>>

>> Any hints are welcome, Thanks!

>>

>

>

>

>-- 

>Lance Norskog

>goks...@gmail.com
___
Neu: WEB.DE De-Mail - Einfach wie E-Mail, sicher wie ein Brief!  
Jetzt De-Mail-Adresse reservieren: https://produkte.web.de/go/demail02


Re: Design and Usage Questions

2010-11-01 Thread getagrip

Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow?
I do not depend on Solr_J, any connection-method would suffice.

On 11/01/2010 03:23 AM, Lance Norskog wrote:

2.
The SolrJ library handling of content streams is "pull", not "push".
That is, you give it a reader and it pulls content when it feels like
it. If your software to feed the connection wants to write the data,
you have to either buffer the whole thing or do a dual-thread
writer/reader pair.

The easiest way to pull stuff from SVN is to use one of the web server
apps. Solr takes a "stream.url" parameter. (Also stream.file.) Note
that there is no outbound authentication supported; your web server
has to be open (at least to the Solr instance).


On Sun, Oct 31, 2010 at 4:06 PM, getagrip  wrote:

Hi,

I've got some basic usage / design questions.

1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
   instance for all requests to avoid connection leaks.
   So if I create a Singleton instance upon application-startup I can
   securely use this instance for ALL queries/updates throughout my
   application without running into performance issues?

2. My System's documents are stored in a Subversion repository.
   For fast searchresults I want to periodically index new documents
   from the repository.

   What I get from the repository is a ByteArrayOutputStream. How can I
   pass this Stream to Solr?

   I only see possibilities to pass Files but in my case it does not
   make sense to write the ByteArrayOutputStream to disk again as this
   would cause performance issues apart from making no sense anyway.

3. Are there any disadvantages using Solrj over some other HTTP based
   solution e.g. creating&  sending my own HTTP requests? Do I even
   have to use HTTP?
   I see the EmbeddedSolrServer exists. Any drawbacks using that?

Any hints are welcome, Thanks!







Re: Custom Sorting in Solr

2010-11-01 Thread Ezequiel Calderara
Ok i imagined that the double linked list would be far too complicated for
solr.

Now, how can i achieve that solr connects to a webservice and do the import?

I'm sorry if i'm not clear, sometimes my english gets fuzzy :P

On Fri, Oct 29, 2010 at 4:51 PM, Yonik Seeley wrote:

> On Fri, Oct 29, 2010 at 3:39 PM, Ezequiel Calderara 
> wrote:
> > Hi all guys!
> > I'm in a weird situation here.
> > We have index a set of documents which are ordered using a linked list
> (each
> > documents has the reference of the previous and the next).
> >
> > Is there a way when sorting in the solr search, Use the linked list to
> sort?
>
> It seems like you should be able to encode this linked list as an
> integer instead, and sort by that?
> If there are multiple linked lists in the index, it seems like you
> could even use the high bits of the int to designate which list the
> doc belongs to, and the low order bits as the order in that list.
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: solr stuck in writing to inexisting sockets

2010-11-01 Thread Roxana Angheluta
Hi,

Yes, sometimes it takes >5 minutes for a query. I agree this is not desirable. 
However, if the application has no control over the input queries other that 
closing the socket after a while, solr should not continue writing the 
response, but terminate the thread.

In general, is there a way to quantify the complexity of a given query on a 
certain index? Some general guidelines which can be used by non-technical 
people?

Thanks a lot,
roxana 

--- On Sun, 10/31/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: solr stuck in writing to inexisting sockets
> To: solr-user@lucene.apache.org
> Date: Sunday, October 31, 2010, 2:29 AM
> Are you saying that your Solr server
> is at times taking 5 minutes to
> complete? If so,
> I'd get to the bottom of that first off. My first guess
> would be you're
> either hitting
> memory issues and swapping horribly or..well, that would be
> my first guess.
> 
> Best
> Erick
> 
> On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta wrote:
> 
> > Hi all,
> >
> > We are using Solr over Jetty with a large index,
> sharded and distributed
> > over multiple machines. Our queries are quite long,
> involving boolean and
> > proximity operators. We cut the connection at the
> client side after 5
> > minutes. Also, we are using parameter timeAllowed to
> stop executing it on
> > the server after a while.
> > We quite often run into situations when solr "blocks".
> The load on the
> > server increases and a thread dump on the solr process
> shows many threads
> > like below:
> >
> >
> > "btpool0-49" prio=10 tid=0x7f73afe1d000 nid=0x3581
> runnable
> > [0x451a]
> >   java.lang.Thread.State: RUNNABLE
> >        at
> java.io.PrintWriter.write(PrintWriter.java:362)
> >        at
> org.apache.solr.common.util.XML.escape(XML.java:206)
> >        at
> org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
> >        at
> org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832)
> >        at
> org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684)
> >        at
> org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564)
> >        at
> org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435)
> >        at
> org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514)
> >        at
> >
> org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485)
> >        at
> >
> org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494)
> >        at
> org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588)
> >        at
> >
> org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130)
> >        at
> >
> org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
> >        at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> >        at
> >
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> >        at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >        at
> >
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> >        at
> >
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> > ..
> >
> >
> > A netstat on the machine shows sockets in state
> CLOSE_WAIT. However, they
> > are fewer than the number of RUNNABLE threads as the
> above.
> >
> > Why is this happening? Is there anything we can do to
> avoid getting in
> > these situations?
> >
> > Thanks,
> > roxana
> >
> >
> >
> >
> 





big terms in UnInvertedField

2010-11-01 Thread Koji Sekiguchi
Hello,

With solr example, using facet.field=text creates UnInvertedField
for the text field in fieldValueCache. After that, I saw stats page
and I was surprised at counters in *filterCache* were up:

lookups : 213
hits : 106
hitratio : 0.49
inserts : 107
evictions : 0
size : 107
warmupTime : 0
cumulative_lookups : 213
cumulative_hits : 106
cumulative_hitratio : 0.49
cumulative_inserts : 107
cumulative_evictions : 0

Do they cause of big words in UnInvertedField?
If so, when using both facet for multiValued field and facet for
single valued field/facet query, it is difficult to estimate
the size of filterCache.

Koji
-- 
http://www.rondhuit.com/en/


Re: big terms in UnInvertedField

2010-11-01 Thread Yonik Seeley
2010/11/1 Koji Sekiguchi :
> With solr example, using facet.field=text creates UnInvertedField
> for the text field in fieldValueCache. After that, I saw stats page
> and I was surprised at counters in *filterCache* were up:

> Do they cause of big words in UnInvertedField?

Yes.  "big" terms (defined as matching more than 5% of the index) are
not uninverted since it's more efficient (both CPU and memory) to use
the filterCache and calculate intersections.

> If so, when using both facet for multiValued field and facet for
> single valued field/facet query, it is difficult to estimate
> the size of filterCache.

Yep.  At least fieldValueCache (for UnInvertedField) tells you the
number of big terms in each field you are faceting on though.

-Yonik
http://www.lucidimagination.com


Re: big terms in UnInvertedField

2010-11-01 Thread Koji Sekiguchi

Yonik,

Thank you for your reply. I just wanted to share my surprise. :)

Koji
--
http://www.rondhuit.com/en/

(10/11/01 23:17), Yonik Seeley wrote:

2010/11/1 Koji Sekiguchi:

With solr example, using facet.field=text creates UnInvertedField
for the text field in fieldValueCache. After that, I saw stats page
and I was surprised at counters in *filterCache* were up:



Do they cause of big words in UnInvertedField?


Yes.  "big" terms (defined as matching more than 5% of the index) are
not uninverted since it's more efficient (both CPU and memory) to use
the filterCache and calculate intersections.


If so, when using both facet for multiValued field and facet for
single valued field/facet query, it is difficult to estimate
the size of filterCache.


Yep.  At least fieldValueCache (for UnInvertedField) tells you the
number of big terms in each field you are faceting on though.

-Yonik
http://www.lucidimagination.com





Re: Solr Relevency Calculation

2010-11-01 Thread Erick Erickson
Here's a good place to start:
http://search.lucidimagination.com/search/out?u=http://lucene.apache.org/java/2_4_0/scoring.html

But
what do you mean "this is going to search on five fields"? This
sounds like you're using DisMax in which case it throws out all but the
top-scoring
clause when it calculates the score for the document.

HTH
Erick

On Sun, Oct 31, 2010 at 10:48 PM, sivaprasad wrote:

>
> Hi,
> I have 25 indexed fields in my document.But by default, if i give
> "q=laptops" this is going to search on five fields and iam getting the
> score
> as part of search results.How solr will calculate the score?Is it going to
> calculate only on the five fields or on 25 fields which are indexed?What is
> the order it is going to take to calculate score?Any documents related to
> this topic is helpful for me.
>
> Regards,
> JS
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Relevency-Calculation-tp1819798p1819798.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Boosting the score based on certain field

2010-11-01 Thread Erick Erickson
Would simple boosting work? As in category:electronics^2?

If not, perhaps you can explain a bit more about what you're trying to
accomplish...

Best
Erick

On Sun, Oct 31, 2010 at 10:55 PM, sivaprasad wrote:

>
> Hi,
>
> In my document i have a filed called category.This contains
> "electronics,games ,..etc".For some of the category values i need to boost
> the document score.Let us say, for "electronics" category, i will decide
> the
> boosting parameter grater than the "games" category.Is there any body has
> the idea to achieve this functionality?
>
> Regards,
> Siva
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Boosting-the-score-based-on-certain-field-tp1819820p1819820.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Multiple Keyword Search

2010-11-01 Thread Erick Erickson
I'm not sure this exactly fits your use-case, but it may come
"close enough". Have you looked at disMax and the mm parameter
(minimum should match)?

Best
Erick

On Mon, Nov 1, 2010 at 5:00 AM, Pawan Darira  wrote:

> Hi
>
> There is a situation where i search for more than 1 keyword & my main 2
> fields are ad_title & ad_description.
> I want those results which match all of the keywords in both fields, should
> come on top. Then sequentially one by one keyword can be dropped in further
> results.
>
> E.g. In a search of 3 keywords, let there are 100 results. If 35 contain
> all
> the keywords combined in ad_title & ad_description, then they should come
> first. Then if 50 results contain combination of any 2 keywords, they
> should
> come next. Finally results with single keyword should come at last
>
> Please suggest
>
> --
> Thanks,
> Pawan Darira
>


Re: solr stuck in writing to inexisting sockets

2010-11-01 Thread Erick Erickson
I'm going to nudge you in the direction of understanding why the queries
take so long in the first place rather than going toward the blunt approach
of cutting them off after some time. The fact that you don't control the
queries submitted doesn't prevent you from trying to understand what
is taking so long.

The first thing I'd look for is whether the system is memory starved. What
JVM are you using and what memory parameters are you giving it? What
version of Solr are you using? Have you tried any performance monitoring
to determine what is happening?

The reason I'm pushing in this direction is that 5 minute searches are
pathological. Once you're up in that range, virtually any fix you come up
with will simply mask the underlying problems, and you'll be forever
chasing the next manifestation of the underlying problem.

Besides, I don't know how you'd stop Solr processing a query mid-way
through,
I don't know of any way to make that happen.

Best
Erick

On Mon, Nov 1, 2010 at 9:30 AM, Roxana Angheluta wrote:

> Hi,
>
> Yes, sometimes it takes >5 minutes for a query. I agree this is not
> desirable. However, if the application has no control over the input queries
> other that closing the socket after a while, solr should not continue
> writing the response, but terminate the thread.
>
> In general, is there a way to quantify the complexity of a given query on a
> certain index? Some general guidelines which can be used by non-technical
> people?
>
> Thanks a lot,
> roxana
>
> --- On Sun, 10/31/10, Erick Erickson  wrote:
>
> > From: Erick Erickson 
> > Subject: Re: solr stuck in writing to inexisting sockets
> > To: solr-user@lucene.apache.org
> > Date: Sunday, October 31, 2010, 2:29 AM
> > Are you saying that your Solr server
> > is at times taking 5 minutes to
> > complete? If so,
> > I'd get to the bottom of that first off. My first guess
> > would be you're
> > either hitting
> > memory issues and swapping horribly or..well, that would be
> > my first guess.
> >
> > Best
> > Erick
> >
> > On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta  >wrote:
> >
> > > Hi all,
> > >
> > > We are using Solr over Jetty with a large index,
> > sharded and distributed
> > > over multiple machines. Our queries are quite long,
> > involving boolean and
> > > proximity operators. We cut the connection at the
> > client side after 5
> > > minutes. Also, we are using parameter timeAllowed to
> > stop executing it on
> > > the server after a while.
> > > We quite often run into situations when solr "blocks".
> > The load on the
> > > server increases and a thread dump on the solr process
> > shows many threads
> > > like below:
> > >
> > >
> > > "btpool0-49" prio=10 tid=0x7f73afe1d000 nid=0x3581
> > runnable
> > > [0x451a]
> > >   java.lang.Thread.State: RUNNABLE
> > >at
> > java.io.PrintWriter.write(PrintWriter.java:362)
> > >at
> > org.apache.solr.common.util.XML.escape(XML.java:206)
> > >at
> > org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
> > >at
> > org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832)
> > >at
> > org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684)
> > >at
> > org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564)
> > >at
> > org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435)
> > >at
> > org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514)
> > >at
> > >
> > org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485)
> > >at
> > >
> >
> org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494)
> > >at
> > org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588)
> > >at
> > >
> > org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130)
> > >at
> > >
> >
> org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
> > >at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
> > >at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
> > >at
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> > >at
> > >
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> > >at
> > >
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > >at
> > >
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> > >at
> > >
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> > > ..
> > >
> > >
> > > A netstat on the machine shows sockets in state
> > CLOSE_WAIT. However, they
> > > are fewer than the number of RUNNABLE threads as the
> > above.
> > >
> > > Why is this happening? Is there anything we can do to
> > avoid getting in
> > > these situations?
> > >

Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
We are trying to solve some multilingual issues with our Solr analysis filter 
chain and would like to use the new Lucene 3.x filters that are Unicode 
compliant.

Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
UAX#29 support from Solr?

Is it just a matter of writing the appropriate Solr filter factories?  Are 
there any tricky gotchas in writing such a filter?

If so, should I open a JIRA issue or two JIRA issues so the filter factories 
can be contributed to the Solr code base?

Tom



Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Robert Muir
On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom  wrote:
> We are trying to solve some multilingual issues with our Solr analysis filter 
> chain and would like to use the new Lucene 3.x filters that are Unicode 
> compliant.
>
> Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
> UAX#29 support from Solr?

right now, you can use the StandardTokenizerFactory (which is UAX#29 +
URL and IP address recognition) from Solr.
just make sure you set the Version to 3.1 in your solrconfig.xml with
branch_3x, otherwise it will use the "old" standardtokenizer for
backwards compatibility.

  
  LUCENE_31

But if you want the pure UAX#29 Tokenizer without this, there isn't a
factory. Also if you want customization/supplementary character
support, there is no factory for ICUTokenizer at the moment.

> If so, should I open a JIRA issue or two JIRA issues so the filter factories 
> can be contributed to the Solr code base?

Please open issues for a factory for the pure UAX#29 Tokenizer, and
for the ICU factories (maybe we can just put this into a contrib for
now?) !


Re: Solr in virtual host as opposed to /lib

2010-11-01 Thread Jonathan Rochkind
I think you guys are talking about two different kinds of 'virtual 
hosts'.  Lance is talking about CPU virtualization. Eric appears to be 
talking about apache virtual web hosts, although Eric hasn't told us how 
apache is involved in his setup in the first place, so it's unclear.


Assuming you are using apache to reverse proxy to Solr, there is no 
reason I can think of that your front-end apache setup would effect CPU 
utilizaton by Solr, let alone by nutch.


Eric Martin wrote:

Oh. So I should take out the installations and move them to / as opposed to 
inside my virtual host of /home//www
'

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Sunday, October 31, 2010 7:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr in virtual host as opposed to /lib

With virtual hosting you can give CPU & memory quotas to your
different VMs. This allows you to control the Nutch v.s. The World
problem. Unforch, you cannot allocate disk channel. With two i/o bound
apps, this is a problem.

On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin  wrote:
  

Excellent information. Thank you. Solr is acting just fine then. I can
connect to it no issues, it indexes fine and there didn't seem to be any
complication with it. Now I can rule it out and go about solving, what you
pointed out, and I agree, to be a java/nutch issue.

Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open
source and found on apache.org

Thanks for your time.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Sunday, October 31, 2010 4:33 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr in virtual host as opposed to /lib

What servlet container are you putting your Solr in? Jetty? Tomcat?
Something else?  Are you fronting it with apache on top of that? (I think
maybe you are, otherwise I'm not sure how the phrase 'virtual host'
applies).

In general, Solr of course doesn't care what directory it's in on disk, so
long as the process running solr has the neccesary read/write permissions to
the neccesary directories (and if it doesn't, you'd usually find out right
away with an error message).  And clients to Solr don't care what directory
it's in on disk either, they only care that they can get it to it connecting
to a certain port at a certain hostname. In general, if they can't get to it
on a certain port at a certain hostname, that's something you'd discover
right away, not something that would be intermittent.  But I'm not familiar
with nutch, you may want to try connecting to the port you have Solr running
on (the hostname/port you have told nutch to find solr on?) yourself
manually, and just make sure it is connectable.

I can't think of any reason that what directory you have Solr in could cause
CPU utilization issues. I think it's got nothing to do with that.

I am not familar with nutch, if it's nutch that's taking 100% of your CPU,
you might want to find some nutch experts to ask. Perhaps there's a nutch
listserv?  I am also not familiar with hadoop; you mention just in passing
that you're using hadoop too, maybe that's an added complication, I don't
know.

One obvious reason nutch could be taking 100% cpu would be simply because
you've asked it to do a lot of work quickly, and it's trying to.

One reason I have seen Solr take 100% of CPU and become responsive, is when
the Solr process gets caught up in terrible Java garbage collection. If
that's what's happening, then giving the Solr JVM a higher maximum heap size
can sometimes help (although confusingly, I've seen people suggest that if
you give the Solr JVM too MUCH heap it can also result in long GC pauses),
and if you have a multi-core/multi-CPU machine, I've found the JVM argument
-XX:+UseConcMarkSweepGC to be very helpful.

Other than that, it sounds to me like you've got a nutch/hadoop issue, not a
Solr issue.

From: Eric Martin [e...@makethembite.com]
Sent: Sunday, October 31, 2010 7:16 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr in virtual host as opposed to /lib

Hi,

Thank you. This is more than idle curiosity. I am trying to debug an issue I
am having with my installation and this is one step in verifying that I have
a setup that does not consume resources. I am trying to debunk my internal
myth that having Solr nad Nutch in a virtual host would be causing these
issues. Here is the main issue that involves Nutch/Solr and Drupal:

/home/mootlaw/lib/solr
/home/mootlaw/lib/nutch
/home/mootlaw/www/

I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise
Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using
jetty for my Solr. My server is not rooted.

Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm:

/usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs
-Dhadoop.log.file=hadoop.log
-Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64
-classpath
/home/mo

Facet count of zero

2010-11-01 Thread Tod
I'm trying to exclude certain facet results from a facet query.  It 
seems to work but rather than being excluded from the facet list its 
returned with a count of zero.


Ex: 
q=(-foo:bar)&facet=true&facet.field=foo&facet.sort=idx&wt=json&indent=true


This returns bar with a count of zero.  All the other foo's show up with 
valid counts.


Can I do this?  Is my syntax incorrect?



Thanks - Tod


Problem with phrase matches in Solr

2010-11-01 Thread Moazzam Khan
Hey guys,

I have a solr index where i store information about experts from
various fields. The thing is when I search for "channel marketing" i
get people that have the word channel or marketing in their data. I
only want people who have that entire phrase in their bio. I copy the
contents of bio to the default search field (which is text)

How can I make sure that exact phrase matching works while the search
is agile enough that half searches match too (like uni matches
university, etc - this works but not phrase matching)?

I hope I was able to properly explain my problem. If not, please let me know.

Thanks in advance,
Moazzam


Re: Facet count of zero

2010-11-01 Thread Yonik Seeley
On Mon, Nov 1, 2010 at 12:55 PM, Tod  wrote:
> I'm trying to exclude certain facet results from a facet query.  It seems to
> work but rather than being excluded from the facet list its returned with a
> count of zero.

If you don't want to see 0 counts, use facet.mincount=1

http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik
http://www.lucidimagination.co


> Ex:
> q=(-foo:bar)&facet=true&facet.field=foo&facet.sort=idx&wt=json&indent=true
>
> This returns bar with a count of zero.  All the other foo's show up with
> valid counts.
>
> Can I do this?  Is my syntax incorrect?
>
>
>
> Thanks - Tod
>


RE: Solr in virtual host as opposed to /lib

2010-11-01 Thread Eric Martin
I was speaking about apache virtual hosts. I was concerned that there was an 
increase processing time due to the solr and nutch instance being housed inside 
a virtual host as opposed to being dropped in root of my distro.

Thank you for the astute clarification.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Monday, November 01, 2010 9:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr in virtual host as opposed to /lib

I think you guys are talking about two different kinds of 'virtual 
hosts'.  Lance is talking about CPU virtualization. Eric appears to be 
talking about apache virtual web hosts, although Eric hasn't told us how 
apache is involved in his setup in the first place, so it's unclear.

Assuming you are using apache to reverse proxy to Solr, there is no 
reason I can think of that your front-end apache setup would effect CPU 
utilizaton by Solr, let alone by nutch.

Eric Martin wrote:
> Oh. So I should take out the installations and move them to / as 
> opposed to inside my virtual host of /home//www
> '
>
> -Original Message-
> From: Lance Norskog [mailto:goks...@gmail.com]
> Sent: Sunday, October 31, 2010 7:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr in virtual host as opposed to /lib
>
> With virtual hosting you can give CPU & memory quotas to your
> different VMs. This allows you to control the Nutch v.s. The World
> problem. Unforch, you cannot allocate disk channel. With two i/o bound
> apps, this is a problem.
>
> On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin  wrote:
>   
>> Excellent information. Thank you. Solr is acting just fine then. I can
>> connect to it no issues, it indexes fine and there didn't seem to be any
>> complication with it. Now I can rule it out and go about solving, what you
>> pointed out, and I agree, to be a java/nutch issue.
>>
>> Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open
>> source and found on apache.org
>>
>> Thanks for your time.
>>
>> -Original Message-
>> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
>> Sent: Sunday, October 31, 2010 4:33 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Solr in virtual host as opposed to /lib
>>
>> What servlet container are you putting your Solr in? Jetty? Tomcat?
>> Something else?  Are you fronting it with apache on top of that? (I think
>> maybe you are, otherwise I'm not sure how the phrase 'virtual host'
>> applies).
>>
>> In general, Solr of course doesn't care what directory it's in on disk, so
>> long as the process running solr has the neccesary read/write permissions to
>> the neccesary directories (and if it doesn't, you'd usually find out right
>> away with an error message).  And clients to Solr don't care what directory
>> it's in on disk either, they only care that they can get it to it connecting
>> to a certain port at a certain hostname. In general, if they can't get to it
>> on a certain port at a certain hostname, that's something you'd discover
>> right away, not something that would be intermittent.  But I'm not familiar
>> with nutch, you may want to try connecting to the port you have Solr running
>> on (the hostname/port you have told nutch to find solr on?) yourself
>> manually, and just make sure it is connectable.
>>
>> I can't think of any reason that what directory you have Solr in could cause
>> CPU utilization issues. I think it's got nothing to do with that.
>>
>> I am not familar with nutch, if it's nutch that's taking 100% of your CPU,
>> you might want to find some nutch experts to ask. Perhaps there's a nutch
>> listserv?  I am also not familiar with hadoop; you mention just in passing
>> that you're using hadoop too, maybe that's an added complication, I don't
>> know.
>>
>> One obvious reason nutch could be taking 100% cpu would be simply because
>> you've asked it to do a lot of work quickly, and it's trying to.
>>
>> One reason I have seen Solr take 100% of CPU and become responsive, is when
>> the Solr process gets caught up in terrible Java garbage collection. If
>> that's what's happening, then giving the Solr JVM a higher maximum heap size
>> can sometimes help (although confusingly, I've seen people suggest that if
>> you give the Solr JVM too MUCH heap it can also result in long GC pauses),
>> and if you have a multi-core/multi-CPU machine, I've found the JVM argument
>> -XX:+UseConcMarkSweepGC to be very helpful.
>>
>> Other than that, it sounds to me like you've got a nutch/hadoop issue, not a
>> Solr issue.
>> 
>> From: Eric Martin [e...@makethembite.com]
>> Sent: Sunday, October 31, 2010 7:16 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Solr in virtual host as opposed to /lib
>>
>> Hi,
>>
>> Thank you. This is more than idle curiosity. I am trying to debug an issue I
>> am having with my installation and this is one step in verifying that I have
>> a setup that does not consume resources. I am trying to deb

Re: Problem with phrase matches in Solr

2010-11-01 Thread darren
Take a look at term proximity and phrase query.

http://wiki.apache.org/solr/SolrRelevancyCookbook

> Hey guys,
>
> I have a solr index where i store information about experts from
> various fields. The thing is when I search for "channel marketing" i
> get people that have the word channel or marketing in their data. I
> only want people who have that entire phrase in their bio. I copy the
> contents of bio to the default search field (which is text)
>
> How can I make sure that exact phrase matching works while the search
> is agile enough that half searches match too (like uni matches
> university, etc - this works but not phrase matching)?
>
> I hope I was able to properly explain my problem. If not, please let me
> know.
>
> Thanks in advance,
> Moazzam
>



RE: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
Thanks Robert,

I'll use the workaround for now (using StandardTokenizerFactory and specifying 
version 3.1), but I suspect that I don't want the added URL/IP address 
recognition due to my use case.  I've also talked to a couple people who 
recommended using the ICUTokenFilter with some rule modifications, but haven't 
had a chance to investigate that yet.

  I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) 
and https://issues.apache.org/jira/browse/SOLR-2211.  Sometime later this week 
I'll try writing the FilterFactories and upload patches. (Unless someone beats 
me to it :)

Tom

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, November 01, 2010 12:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support 
from Solr

On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom  wrote:
> We are trying to solve some multilingual issues with our Solr analysis filter 
> chain and would like to use the new Lucene 3.x filters that are Unicode 
> compliant.
>
> Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
> UAX#29 support from Solr?

right now, you can use the StandardTokenizerFactory (which is UAX#29 +
URL and IP address recognition) from Solr.
just make sure you set the Version to 3.1 in your solrconfig.xml with
branch_3x, otherwise it will use the "old" standardtokenizer for
backwards compatibility.

  
  LUCENE_31

But if you want the pure UAX#29 Tokenizer without this, there isn't a
factory. Also if you want customization/supplementary character
support, there is no factory for ICUTokenizer at the moment.

> If so, should I open a JIRA issue or two JIRA issues so the filter factories 
> can be contributed to the Solr code base?

Please open issues for a factory for the pure UAX#29 Tokenizer, and
for the ICU factories (maybe we can just put this into a contrib for
now?) !


RE: How does DIH multithreading work?

2010-11-01 Thread Dyer, James
Mark,

I have the same question so I did a little research on this.  Not a complete 
answer but here is what I've found:

- "threads" was aded with SOLR-1352 
(https://issues.apache.org/jira/browse/SOLR-1352).

- Also see 
http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler
 for background info.

- Only available in 3.x and trunk.  Committed on 1/12/2010 by Noble Paul (who 
surely can tell you more accurate info than I can).

- Seems like when using, each thread will call "nextRow" on your root entity 
datasource in parallel.

- Not sure this will help with child entities (ie. I had hoped I could get it 
to build child caches in parallel but I don't think this is the case).

- A doc comment on ThreadedEntityProcessorWrapper indicates this will help 
speed up running transformers becauses they'd be in parallel.  This would make 
sense if maybe your database can only pull back so fast, but then you have an 
intensive transformer.  Maybe adding a thread would make your processing no 
slower than the db...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: markwaddle [mailto:m...@markwaddle.com] 
Sent: Tuesday, October 26, 2010 2:25 PM
To: solr-user@lucene.apache.org
Subject: How does DIH multithreading work?


I understand that the thread count is specified on root entities only. Does
it spawn multiple threads per root entity? Or multiple threads per
descendant entity? Can someone give an example of how you would make a
database query in an entity with 4 threads that would select 1 row per
thread?

Thanks,
Mark
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1776111.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: indexing '-

2010-11-01 Thread PeterKerk

Guys, the "string" type did the trick :)

Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-tp1816969p1823199.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Robert Muir
On Mon, Nov 1, 2010 at 1:34 PM, Burton-West, Tom  wrote:
> Thanks Robert,
>
> I'll use the workaround for now (using StandardTokenizerFactory and 
> specifying version 3.1), but I suspect that I don't want the added URL/IP 
> address recognition due to my use case.  I've also talked to a couple people 
> who recommended using the ICUTokenFilter with some rule modifications, but 
> haven't had a chance to investigate that yet.
>

yes, as far as doing rule modifications, we can think about how to
hook this in. At the end of the day, if we allow someone to specify
the classname of their ICUTokenizerConfig (default:
DefaultICUTokenizerConfig), that would at least allow this
customization.

separately i'd be interested in hearing about whatever rule
modifications might be useful for different purposes.

>  I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) 
> and https://issues.apache.org/jira/browse/SOLR-2211.  Sometime later this 
> week I'll try writing the FilterFactories and upload patches. (Unless someone 
> beats me to it :)
>

Thanks Tom, there are actually a lot of analysis factories (even in
just icu itself) not exposed to Solr, so its a good deal of work. I
know i have a few of them, but they aren't the best. I suggested on
SOLR-2210 we could make a contrib like 'extraAnalyzers' and put all
the analyzers-that-have-large-dependencies/dictionaries (e.g.
SmartChinese too) in there.

So theres a lot to be done... including tests, any help is appreciated!


Testing/packaging question

2010-11-01 Thread Bernhard Reiter
Hi, 

I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
see
http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/

In order to run solrpy's supplied tests at build time, I'd need Solr to
know about the schema.xml that comes with the tests.
Can anyone tell me how do that properly? I'd basically need Solr to
temporarily recognize that schema.xml without permanently installing it
-- is there any way to do this, eg via environment variables?

TIA
Bernhard Reiter



Re: Facet count of zero

2010-11-01 Thread Tod

On 11/1/2010 1:03 PM, Yonik Seeley wrote:

On Mon, Nov 1, 2010 at 12:55 PM, Tod  wrote:

I'm trying to exclude certain facet results from a facet query. �It seems to
work but rather than being excluded from the facet list its returned with a
count of zero.


If you don't want to see 0 counts, use facet.mincount=1

http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik
http://www.lucidimagination.co



Ex:
q=(-foo:bar)&facet=true&facet.field=foo&facet.sort=idx&wt=json&indent=true

This returns bar with a count of zero. �All the other foo's show up with
valid counts.

Can I do this? �Is my syntax incorrect?



Thanks - Tod





Excellent, I completely missed it - thanks!


Re: Solr in virtual host as opposed to /lib

2010-11-01 Thread Chris Hostetter

: References: 
: 
: 
: 
: 
: In-Reply-To: 
: Subject: Solr in virtual host as opposed to /lib

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss


Re: Reverse range search

2010-11-01 Thread Jan Høydahl / Cominvent
Hi,

I think I have seen a comment on the list from someone with the same need a few 
months ago.
He planned to make a new fieldType to support this, e.g. MinMaxRangeFieldType 
which would
be a polyField type holding both a min and max value, and then you could query 
it
q=myminmaxfield:123

I did not find it as a Jira issue however, but I can see how it would be useful 
for a lot of usecases. Perhaps you can create a Jira issue for it and supply a 
patch? :)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 28. okt. 2010, at 23.24, kenf_nc wrote:

> 
> Doing a range search is straightforward. I have a fixed value in a document
> field, I search on [x TO y] and if the fixed value is in the range requested
> it gets a hit. But, what if I have data in a document where there is a min
> value and a max value and my query is a fixed value and I want to get a hit
> if the query value is in that range. For example:
> 
> Solr Doc1:
> field  min_price:100
> field  max_price:500
> 
> Solr Doc2:
> field  min_price:300
> field  max_price:500
> 
> and my query is price:250. I could create a query of (min_price:[* TO 250]
> AND max_price:[250 TO *]) and that should work. It should find only doc 1.
> However, if I have several fields like this and complex queries that include
> most of those fields, it becomes a very ugly query. Ideally I'd like to do
> something similar to what the spatial contrib guys do where they make
> lat/long a single point. If I had a min/max field, I could call it Price
> (100, 500) or Price (300,500) and just do a query of  Price:250 and Solr
> would see if 250 was in the appropriate range.
> 
> Looong question short...Is there something out there already that does this?
> Does anyone else do something like this and have some suggestions?
> Thanks,
> Ken
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Reverse-range-search-tp1789135p1789135.html
> Sent from the Solr - User mailing list archive at Nabble.com.



RE: Solr in virtual host as opposed to /lib

2010-11-01 Thread Eric Martin
I don't think you read the entire thread. I'm assuming you made a mistake.

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Monday, November 01, 2010 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr in virtual host as opposed to /lib


: References: 
: 
: 
: 
: 
: In-Reply-To:

: Subject: Solr in virtual host as opposed to /lib

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



Re: Solr in virtual host as opposed to /lib

2010-11-01 Thread Markus Jelsma
No, he didn't make a mistake but you did. Next time, please start a new thread 
not by conveniently replying to an existing thread and just changing the 
subject. Now we have two threads in thread. :)

> I don't think you read the entire thread. I'm assuming you made a mistake.
> 
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Monday, November 01, 2010 11:49 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr in virtual host as opposed to /lib
> 
> : References:
> : 
> : 
> : 
> : 
> : 
> : 
> : 
> : In-Reply-To:
> 
> 
> : Subject: Solr in virtual host as opposed to /lib
> 
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
> 
> When starting a new discussion on a mailing list, please do not reply to
> an existing message, instead start a fresh email.  Even if you change the
> subject line of your email, other mail headers still track which thread
> you replied to and your question is "hidden" in that thread and gets less
> attention.   It makes following discussions in the mailing list archives
> particularly difficult.
> See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
> 
> 
> 
> -Hoss


RE: Solr in virtual host as opposed to /lib

2010-11-01 Thread Chris Hostetter

: I don't think you read the entire thread. I'm assuming you made a mistake.

No mistake.  When you sent your first message with the subject "Solr in 
virtual host as opposed to /lib" you did so in response to a completely 
unrelated thread ("Searching with wrong keyboard layout or using 
translit")

Please note the headers i quoted below documenting this, or consult any 
mailing list archive that displays full threads...

http://markmail.org/thread/bjl23qcigp6w3kyl


: 
: -Original Message-
: From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
: Sent: Monday, November 01, 2010 11:49 AM
: To: solr-user@lucene.apache.org
: Subject: Re: Solr in virtual host as opposed to /lib
: 
: 
: : References: 
: : 
: : 
: : 
: : 
: : In-Reply-To:
: 
: : Subject: Solr in virtual host as opposed to /lib
: 
: http://people.apache.org/~hossman/#threadhijack
: Thread Hijacking on Mailing Lists
: 
: When starting a new discussion on a mailing list, please do not reply to 
: an existing message, instead start a fresh email.  Even if you change the 
: subject line of your email, other mail headers still track which thread 
: you replied to and your question is "hidden" in that thread and gets less 
: attention.   It makes following discussions in the mailing list archives 
: particularly difficult.
: See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
: 
: 
: 
: -Hoss
: 

-Hoss


is my search fast ?! date search i need some feedback :D

2010-11-01 Thread stockiii

my index is 13M big and i have not index all of my documents. the index in
production system should be about 30M Documents big. 

so with my test 13M Index i try a search over all documents, with 
first query: q:[2008-10-27 12:23:00:00 TO 2009-04-29 23:59:00:00]  
than i run the next query, for statistics. grouped by currency_id and get
the amounts, of these Currencys.

thats my result:
-> EUR Sum: 437.259.518,28 € Founded: 3712331 
-> CHF Sum: 2.048.147,62 SFr. Founded: 10473 
-> GBP Sum: 1.221,41 £ Founded: 181 

for getting the result solr needs 9 seconds. ... i dont think thats really
fast =(
what do you think ? 


for faster search i want to try change precisionStep="6" to --> for deleting
the milliseconds. whats the value for deleting also the seconds ? we only
need HH:MM and not HH:MM:SS:MSMS
and i change the datesearch from q to fq ...

thx


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/is-my-search-fast-date-search-i-need-some-feedback-D-tp1820821p1820821.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Use SolrCloud (SOLR-1873) on trunk, or with 1.4.1?

2010-11-01 Thread Jeremy Hinegardner
I took a swag at applying SOLR-1873 to branch_3x.  It applied mostly, most
of the rest of the issues where Zookeeper integrations, and those
appliedly cleanly by hand.

There were also a few constants and such that need to be pulled in from trunk.

At the moment, it passes all the tests.  I have not actually used it yet,
and probably won't for a few weeks, but if someone else wants to try it out:

http://github.com/collectiveintellect/lucene-solr/tree/branch_3x-cloud

Have at it.

enjoy,

-jeremy

On Thu, Oct 28, 2010 at 11:21:12PM +0200, Jan H?ydahl / Cominvent wrote:
> Hi,
> 
> I would aim for reindexing on branch3_x, which will be the 3.1 release soon. 
> I don't know if SOLR-1873 applies cleanly to 3_x now, but it would surely be 
> less effort to have it apply to 3_x than to 1.4. Perhaps you can help 
> backport the patch to 3_x?
> 
> --
> Jan H?ydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> On 28. okt. 2010, at 03.04, Jeremy Hinegardner wrote:
> 
> > Hi all,
> > 
> > I see that as of r1022188 Solr Cloud has been committed to trunk.
> > 
> > I was wondering about the stability of Solr Cloud on trunk.  We are
> > planning to do a major reindexing soon (within 30 days), several billion 
> > docs,
> > and would like to switch to a Solr Cloud based infrastructure. 
> > 
> > We are wondering should use trunk as it is now that SOLR-1873 is applied, or
> > should we take SOLR-1873 and apply it to Solr 1.4.1.
> > 
> > Has anyone used 1.4.1 + SOLR-1873?  In production?
> > 
> > Thanks,
> > 
> > -jeremy
> > 
> > -- 
> > 
> > Jeremy Hinegardner  jer...@hinegardner.org 
> > 
> 

-- 

 Jeremy Hinegardner  jer...@hinegardner.org 



Re: How does DIH multithreading work?

2010-11-01 Thread Lance Norskog
It is useful for parsing PDFs on a multi-processor machine. Also, if a
sub-entity does an outbound I/O call to a database, a file, or another
SOLR (SOLR-1499).

Anything where the pipeline time outweighs disk i/o time.

Threading happens on a per-document level- there is no concurrent
access inside a document pipeline.

There is a bug which causes Entityprocessor that look up attributes to
throw an exception. This make Tika unusable inside a thread. Two other
EPs also won't work, but I did not test them.

https://issues.apache.org/jira/browse/SOLR-2186

On Mon, Nov 1, 2010 at 10:43 AM, Dyer, James  wrote:
> Mark,
>
> I have the same question so I did a little research on this.  Not a complete 
> answer but here is what I've found:
>
> - "threads" was aded with SOLR-1352 
> (https://issues.apache.org/jira/browse/SOLR-1352).
>
> - Also see 
> http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler
>  for background info.
>
> - Only available in 3.x and trunk.  Committed on 1/12/2010 by Noble Paul (who 
> surely can tell you more accurate info than I can).
>
> - Seems like when using, each thread will call "nextRow" on your root entity 
> datasource in parallel.
>
> - Not sure this will help with child entities (ie. I had hoped I could get it 
> to build child caches in parallel but I don't think this is the case).
>
> - A doc comment on ThreadedEntityProcessorWrapper indicates this will help 
> speed up running transformers becauses they'd be in parallel.  This would 
> make sense if maybe your database can only pull back so fast, but then you 
> have an intensive transformer.  Maybe adding a thread would make your 
> processing no slower than the db...
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: markwaddle [mailto:m...@markwaddle.com]
> Sent: Tuesday, October 26, 2010 2:25 PM
> To: solr-user@lucene.apache.org
> Subject: How does DIH multithreading work?
>
>
> I understand that the thread count is specified on root entities only. Does
> it spawn multiple threads per root entity? Or multiple threads per
> descendant entity? Can someone give an example of how you would make a
> database query in an entity with 4 threads that would select 1 row per
> thread?
>
> Thanks,
> Mark
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1776111.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: solr stuck in writing to inexisting sockets

2010-11-01 Thread Lance Norskog
> Besides, I don't know how you'd stop Solr processing a query mid-way
> through,
> I don't know of any way to make that happen.
The timeAllowed parameter causes a timeout in the Solr server to kill
the searching thread. They uses that now.

But, yes, Erick is right- there is a fundamental problem you should
solve. Since they are all stuck in returning XML results, there is
something wrong in reading back results.

It is possible that there is a bug in timeAllowed, where the
kill-this-thread hits while returning the results and the handler for
this does not work correctly when returning results. It would be great
if someone wrote a unit test for this (not me) and posted it.

On Mon, Nov 1, 2010 at 8:44 AM, Erick Erickson  wrote:
> I'm going to nudge you in the direction of understanding why the queries
> take so long in the first place rather than going toward the blunt approach
> of cutting them off after some time. The fact that you don't control the
> queries submitted doesn't prevent you from trying to understand what
> is taking so long.
>
> The first thing I'd look for is whether the system is memory starved. What
> JVM are you using and what memory parameters are you giving it? What
> version of Solr are you using? Have you tried any performance monitoring
> to determine what is happening?
>
> The reason I'm pushing in this direction is that 5 minute searches are
> pathological. Once you're up in that range, virtually any fix you come up
> with will simply mask the underlying problems, and you'll be forever
> chasing the next manifestation of the underlying problem.
>
> Besides, I don't know how you'd stop Solr processing a query mid-way
> through,
> I don't know of any way to make that happen.
>
> Best
> Erick
>
> On Mon, Nov 1, 2010 at 9:30 AM, Roxana Angheluta wrote:
>
>> Hi,
>>
>> Yes, sometimes it takes >5 minutes for a query. I agree this is not
>> desirable. However, if the application has no control over the input queries
>> other that closing the socket after a while, solr should not continue
>> writing the response, but terminate the thread.
>>
>> In general, is there a way to quantify the complexity of a given query on a
>> certain index? Some general guidelines which can be used by non-technical
>> people?
>>
>> Thanks a lot,
>> roxana
>>
>> --- On Sun, 10/31/10, Erick Erickson  wrote:
>>
>> > From: Erick Erickson 
>> > Subject: Re: solr stuck in writing to inexisting sockets
>> > To: solr-user@lucene.apache.org
>> > Date: Sunday, October 31, 2010, 2:29 AM
>> > Are you saying that your Solr server
>> > is at times taking 5 minutes to
>> > complete? If so,
>> > I'd get to the bottom of that first off. My first guess
>> > would be you're
>> > either hitting
>> > memory issues and swapping horribly or..well, that would be
>> > my first guess.
>> >
>> > Best
>> > Erick
>> >
>> > On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta > >wrote:
>> >
>> > > Hi all,
>> > >
>> > > We are using Solr over Jetty with a large index,
>> > sharded and distributed
>> > > over multiple machines. Our queries are quite long,
>> > involving boolean and
>> > > proximity operators. We cut the connection at the
>> > client side after 5
>> > > minutes. Also, we are using parameter timeAllowed to
>> > stop executing it on
>> > > the server after a while.
>> > > We quite often run into situations when solr "blocks".
>> > The load on the
>> > > server increases and a thread dump on the solr process
>> > shows many threads
>> > > like below:
>> > >
>> > >
>> > > "btpool0-49" prio=10 tid=0x7f73afe1d000 nid=0x3581
>> > runnable
>> > > [0x451a]
>> > >   java.lang.Thread.State: RUNNABLE
>> > >        at
>> > java.io.PrintWriter.write(PrintWriter.java:362)
>> > >        at
>> > org.apache.solr.common.util.XML.escape(XML.java:206)
>> > >        at
>> > org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
>> > >        at
>> > org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832)
>> > >        at
>> > org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684)
>> > >        at
>> > org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564)
>> > >        at
>> > org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435)
>> > >        at
>> > org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514)
>> > >        at
>> > >
>> > org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485)
>> > >        at
>> > >
>> >
>> org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494)
>> > >        at
>> > org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588)
>> > >        at
>> > >
>> > org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130)
>> > >        at
>> > >
>> >
>> org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
>> > >        at
>> > >
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
>> > >        at
>> > >
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doF

Re: Design and Usage Questions

2010-11-01 Thread Lance Norskog
Yes, you can write your own app to read the file with SVNkit and post
it to the ExtractingRequestHandler. This would be easiest.

On Mon, Nov 1, 2010 at 5:49 AM, getagrip  wrote:
> Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow?
> I do not depend on Solr_J, any connection-method would suffice.
>
> On 11/01/2010 03:23 AM, Lance Norskog wrote:
>>
>> 2.
>> The SolrJ library handling of content streams is "pull", not "push".
>> That is, you give it a reader and it pulls content when it feels like
>> it. If your software to feed the connection wants to write the data,
>> you have to either buffer the whole thing or do a dual-thread
>> writer/reader pair.
>>
>> The easiest way to pull stuff from SVN is to use one of the web server
>> apps. Solr takes a "stream.url" parameter. (Also stream.file.) Note
>> that there is no outbound authentication supported; your web server
>> has to be open (at least to the Solr instance).
>>
>>
>> On Sun, Oct 31, 2010 at 4:06 PM, getagrip  wrote:
>>>
>>> Hi,
>>>
>>> I've got some basic usage / design questions.
>>>
>>> 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
>>>   instance for all requests to avoid connection leaks.
>>>   So if I create a Singleton instance upon application-startup I can
>>>   securely use this instance for ALL queries/updates throughout my
>>>   application without running into performance issues?
>>>
>>> 2. My System's documents are stored in a Subversion repository.
>>>   For fast searchresults I want to periodically index new documents
>>>   from the repository.
>>>
>>>   What I get from the repository is a ByteArrayOutputStream. How can I
>>>   pass this Stream to Solr?
>>>
>>>   I only see possibilities to pass Files but in my case it does not
>>>   make sense to write the ByteArrayOutputStream to disk again as this
>>>   would cause performance issues apart from making no sense anyway.
>>>
>>> 3. Are there any disadvantages using Solrj over some other HTTP based
>>>   solution e.g. creating&  sending my own HTTP requests? Do I even
>>>   have to use HTTP?
>>>   I see the EmbeddedSolrServer exists. Any drawbacks using that?
>>>
>>> Any hints are welcome, Thanks!
>>>
>>
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Design and Usage Questions

2010-11-01 Thread Xin Li
If you just want a quick way to query Solr server, Perl module
Webservice::Solr is pretty good.


On Mon, Nov 1, 2010 at 4:56 PM, Lance Norskog  wrote:

> Yes, you can write your own app to read the file with SVNkit and post
> it to the ExtractingRequestHandler. This would be easiest.
>
> On Mon, Nov 1, 2010 at 5:49 AM, getagrip  wrote:
> > Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow?
> > I do not depend on Solr_J, any connection-method would suffice.
> >
> > On 11/01/2010 03:23 AM, Lance Norskog wrote:
> >>
> >> 2.
> >> The SolrJ library handling of content streams is "pull", not "push".
> >> That is, you give it a reader and it pulls content when it feels like
> >> it. If your software to feed the connection wants to write the data,
> >> you have to either buffer the whole thing or do a dual-thread
> >> writer/reader pair.
> >>
> >> The easiest way to pull stuff from SVN is to use one of the web server
> >> apps. Solr takes a "stream.url" parameter. (Also stream.file.) Note
> >> that there is no outbound authentication supported; your web server
> >> has to be open (at least to the Solr instance).
> >>
> >>
> >> On Sun, Oct 31, 2010 at 4:06 PM, getagrip  wrote:
> >>>
> >>> Hi,
> >>>
> >>> I've got some basic usage / design questions.
> >>>
> >>> 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
> >>>   instance for all requests to avoid connection leaks.
> >>>   So if I create a Singleton instance upon application-startup I can
> >>>   securely use this instance for ALL queries/updates throughout my
> >>>   application without running into performance issues?
> >>>
> >>> 2. My System's documents are stored in a Subversion repository.
> >>>   For fast searchresults I want to periodically index new documents
> >>>   from the repository.
> >>>
> >>>   What I get from the repository is a ByteArrayOutputStream. How can I
> >>>   pass this Stream to Solr?
> >>>
> >>>   I only see possibilities to pass Files but in my case it does not
> >>>   make sense to write the ByteArrayOutputStream to disk again as this
> >>>   would cause performance issues apart from making no sense anyway.
> >>>
> >>> 3. Are there any disadvantages using Solrj over some other HTTP based
> >>>   solution e.g. creating&  sending my own HTTP requests? Do I even
> >>>   have to use HTTP?
> >>>   I see the EmbeddedSolrServer exists. Any drawbacks using that?
> >>>
> >>> Any hints are welcome, Thanks!
> >>>
> >>
> >>
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: is my search fast ?! date search i need some feedback :D

2010-11-01 Thread Erick Erickson
Careful here. First searches are known to be slow, various caches
are filled up the first time they are used etc. So even though you're
measuring the second query, it's still perhaps filling caches.

And what are you measuring? The raw search time or the entire response
time? These can be quite different. Try running with &debugQuery=on and one
of the things you'll get back is the search time (not including assembling
the response).

You're right, though, 9 seconds is far too long. If you have a relatively
small
number of currency_ids, think about the "enum" method (see:
http://wiki.apache.org/solr/SimpleFacetParameters#facet.method)

Also, think about autowarming and firstsearch queries to prepare your
solr instance for faster responses.

If none of that helps, please post the relevant parts of your schema.xml and
the results of running your query with &debugQuery=on, that'll give us a lot
more info to go on.

Best
Erick

On Mon, Nov 1, 2010 at 5:37 AM, stockiii  wrote:

>
> my index is 13M big and i have not index all of my documents. the index in
> production system should be about 30M Documents big.
>
> so with my test 13M Index i try a search over all documents, with
> first query: q:[2008-10-27 12:23:00:00 TO 2009-04-29 23:59:00:00]
> than i run the next query, for statistics. grouped by currency_id and get
> the amounts, of these Currencys.
>
> thats my result:
> -> EUR Sum: 437.259.518,28 € Founded: 3712331
> -> CHF Sum: 2.048.147,62 SFr. Founded: 10473
> -> GBP Sum: 1.221,41 £ Founded: 181
>
> for getting the result solr needs 9 seconds. ... i dont think thats really
> fast =(
> what do you think ?
>
>
> for faster search i want to try change precisionStep="6" to --> for
> deleting
> the milliseconds. whats the value for deleting also the seconds ? we only
> need HH:MM and not HH:MM:SS:MSMS
> and i change the datesearch from q to fq ...
>
> thx
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/is-my-search-fast-date-search-i-need-some-feedback-D-tp1820821p1820821.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Which is faster -- delete or update?

2010-11-01 Thread Andy
My documents have a "down_vote" field. Every time a user votes down a document, 
I increment the "down_vote" field in my database and also re-index the document 
to Solr to reflect the new down_vote value.
During searches, I want to restrict the results to only documents with, say 
fewer than 3 down_vote. 2 ways to implement that:
1) When a user down vote a document, check to see if total down votes have 
reached 3. If it has, delete document from Solr index.
2) When a user down vote a document, update the document in Solr index to 
reflect the new down_vote value even if total down votes might have been more 
than 3. During query, add a "fq" to restrict results to documents with fewer 
than 3 down votes.
Which approach is better? Is it faster to delete a document from index or to 
update the document to reflect the new down_vote value?
Thanks.Andy


  

Re: Which is faster -- delete or update?

2010-11-01 Thread Peter Karich
 From the user perspective I wouldn't delete it, because it could be 
that down-voting by mistake or spam or something and up-voting can 
resurrect it.
It could be also wise to keep the docs to see which content (from which 
users?) are down voted to get spam accounts?


From the dev perspective you should benchmark it, if really necessary. 
(I guess updating is a more expensive because I think it is 
delete+completely-new-add)


Regards,
Peter.


My documents have a "down_vote" field. Every time a user votes down a document, I 
increment the "down_vote" field in my database and also re-index the document to Solr to 
reflect the new down_vote value.
During searches, I want to restrict the results to only documents with, say 
fewer than 3 down_vote. 2 ways to implement that:
1) When a user down vote a document, check to see if total down votes have 
reached 3. If it has, delete document from Solr index.
2) When a user down vote a document, update the document in Solr index to reflect the new 
down_vote value even if total down votes might have been more than 3. During query, add a 
"fq" to restrict results to documents with fewer than 3 down votes.
Which approach is better? Is it faster to delete a document from index or to 
update the document to reflect the new down_vote value?
Thanks.Andy





Re: Which is faster -- delete or update?

2010-11-01 Thread Erick Erickson
Just deleting a document is faster because all that really happens
is the document is marked as deleted. An update is really
a delete followed by an add of the same document, so by definition
an update will be slower...

But... does it really make a difference? How often to you expect this to
happen? Perter Karich added a note while I was typing this, and he
makes some cogent points.

I'm starting to think that I don't care about better unless and until my
users notice (or I have a reasonable expectation that they #will# notice).
I'm far more interested in simpler code that I can maintain than I am
shaving off another 4 milliseconds from the response time. That gives
me more chance to put in cool new features that the user will notice...

Best
Erick

On Mon, Nov 1, 2010 at 5:04 PM, Andy  wrote:

> My documents have a "down_vote" field. Every time a user votes down a
> document, I increment the "down_vote" field in my database and also re-index
> the document to Solr to reflect the new down_vote value.
> During searches, I want to restrict the results to only documents with, say
> fewer than 3 down_vote. 2 ways to implement that:
> 1) When a user down vote a document, check to see if total down votes have
> reached 3. If it has, delete document from Solr index.
> 2) When a user down vote a document, update the document in Solr index to
> reflect the new down_vote value even if total down votes might have been
> more than 3. During query, add a "fq" to restrict results to documents with
> fewer than 3 down votes.
> Which approach is better? Is it faster to delete a document from index or
> to update the document to reflect the new down_vote value?
> Thanks.Andy
>
>
>


Re: Which is faster -- delete or update?

2010-11-01 Thread Jonathan Rochkind
The actual time it takes to delete or update the document is unlikely to 
make a difference to you.


What might make a difference to you is the time it takes to actually 
finalize the commit, and the time it takes to re-warm your indexes after 
a commit, and especially the time it takes to run any warming queries 
you have set in newSearcher. Most of these probably won't differ between 
delete or update, but could be a problem either way; one way to find 
out, try it and measure it.


Whether you do a delete or an update, if you're planning on making 
changes to your index more often than, oh, 10 or 20 minute seperation, 
you may run into trouble. Solr isn't so good at frequent changes to the 
index like that.  I haven't looked at it myself, but the Solr patches 
that get called "near real-time" seem like they're intended to deal with 
this, among other things, and allow frequent commits without killing 
performance or RAM usage.


I am not sure how/if other people are effectively dealing with 
user-generated content that needs to be included in the index for 
filtering and searching against. Would be very curious if anyone has any 
successful strategies to share. Another example would be user-generated 
tagging.


Erick Erickson wrote:

Just deleting a document is faster because all that really happens
is the document is marked as deleted. An update is really
a delete followed by an add of the same document, so by definition
an update will be slower...

But... does it really make a difference? How often to you expect this to
happen? Perter Karich added a note while I was typing this, and he
makes some cogent points.

I'm starting to think that I don't care about better unless and until my
users notice (or I have a reasonable expectation that they #will# notice).
I'm far more interested in simpler code that I can maintain than I am
shaving off another 4 milliseconds from the response time. That gives
me more chance to put in cool new features that the user will notice...

Best
Erick

On Mon, Nov 1, 2010 at 5:04 PM, Andy  wrote:

  

My documents have a "down_vote" field. Every time a user votes down a
document, I increment the "down_vote" field in my database and also re-index
the document to Solr to reflect the new down_vote value.
During searches, I want to restrict the results to only documents with, say
fewer than 3 down_vote. 2 ways to implement that:
1) When a user down vote a document, check to see if total down votes have
reached 3. If it has, delete document from Solr index.
2) When a user down vote a document, update the document in Solr index to
reflect the new down_vote value even if total down votes might have been
more than 3. During query, add a "fq" to restrict results to documents with
fewer than 3 down votes.
Which approach is better? Is it faster to delete a document from index or
to update the document to reflect the new down_vote value?
Thanks.Andy






  


Field boosting in DataImportHandler transformer

2010-11-01 Thread Brad Kellett
It's not looking very promising, but is there something I'm missing to be able 
to apply a field boost from within a transformer in the DataImportHandler? Not 
a boost defined within the schema, but a boost applied to the field from the 
transformer itself.

I know you can do a document boost, but I can't see anything for a field boost.

~bck



Possible memory leaks with frequent replication

2010-11-01 Thread Simon Wistow
We've been trying to get a setup in which a slave replicates from a 
master every few seconds (ideally every second but currently we have it 
set at every 5s).

Everything seems to work fine until, periodically, the slave just stops 
responding from what looks like it running out of memory:

org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet jsp threw exception
java.lang.OutOfMemoryError: Java heap space


(our monitoring seems to confirm this).

Looking around my suspicion is that it takes new Readers longer to warm 
than the gap between replication and thus they just build up until all 
memory is consumed (which, I suppose isn't really memory 'leaking' per 
se, more just resource consumption)

That said, we've tried turning off caching on the slave and that didn't 
help either so it's possible I'm wrong.

Is there anything we can do about this? I'm reluctant to increase the 
heap space since I suspect that will mean that there's just a longer 
period between failures. Might Zoie help here? Or should we just query 
against the Master?


Thanks,

Simon


Re: Re:Re: problem of solr replcation's speed

2010-11-01 Thread Lance Norskog
This is the time to replicate and open the new index, right? Opening a
new index can take a lot of time. How many autowarmers and queries are
there in the caches? Opening a new index re-runs all of the queries in
all of the caches.

2010/11/1 kafka0102 :
> I suspected my app has some sleeping op every 1s, so
> I changed ReplicationHandler.PACKET_SZ to 1024 * 1024*10; // 10MB
>
> and log result is like thus :
> [2010-11-01 
> 17:49:29][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
> 3184
> [2010-11-01 
> 17:49:32][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
> 3426
> [2010-11-01 
> 17:49:36][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
> 3359
> [2010-11-01 
> 17:49:39][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
> 3166
> [2010-11-01 
> 17:49:42][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
> 3513
> [2010-11-01 
> 17:49:46][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
> 3140
> [2010-11-01 
> 17:49:50][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
> 3471
>
> That means It's still slow like before. what's wrong with my env
>
> At 2010-11-01 17:30:32,kafka0102  wrote:
> I hacked SnapPuller to log the cost, and the log is like thus:
> [2010-11-01 
> 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
> 979
> [2010-11-01 
> 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
> [2010-11-01 
> 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
> [2010-11-01 
> 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
> 980
> [2010-11-01 
> 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
> [2010-11-01 
> 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5
> [2010-11-01 
> 17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
> 979
>
>
> It's saying it cost about 1000ms for transfering 1M data every 2 times. I 
> used jetty as server and embeded solr in my app.I'm so confused.What I have 
> done wrong?
>
>
> At 2010-11-01 10:12:38,"Lance Norskog"  wrote:
>
>>If you are copying from an indexer while you are indexing new content,
>>this would cause contention for the disk head. Does indexing slow down
>>during this period?
>>
>>Lance
>>
>>2010/10/31 Peter Karich :
>>>  we have an identical-sized index and it takes ~5minutes
>>>
>>>
 It takes about one hour to replacate 6G index for solr in my env. But my
 network can transfer file about 10-20M/s using scp. So solr's http
 replcation is too slow, it's normal or I do something wrong?

>>>
>>>
>>
>>
>>
>>--
>>Lance Norskog
>>goks...@gmail.com
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Possible memory leaks with frequent replication

2010-11-01 Thread Lance Norskog
You should query against the indexer. I'm impressed that you got 5s
replication to work reliably.

On Mon, Nov 1, 2010 at 4:27 PM, Simon Wistow  wrote:
> We've been trying to get a setup in which a slave replicates from a
> master every few seconds (ideally every second but currently we have it
> set at every 5s).
>
> Everything seems to work fine until, periodically, the slave just stops
> responding from what looks like it running out of memory:
>
> org.apache.catalina.core.StandardWrapperValve invoke
> SEVERE: Servlet.service() for servlet jsp threw exception
> java.lang.OutOfMemoryError: Java heap space
>
>
> (our monitoring seems to confirm this).
>
> Looking around my suspicion is that it takes new Readers longer to warm
> than the gap between replication and thus they just build up until all
> memory is consumed (which, I suppose isn't really memory 'leaking' per
> se, more just resource consumption)
>
> That said, we've tried turning off caching on the slave and that didn't
> help either so it's possible I'm wrong.
>
> Is there anything we can do about this? I'm reluctant to increase the
> heap space since I suspect that will mean that there's just a longer
> period between failures. Might Zoie help here? Or should we just query
> against the Master?
>
>
> Thanks,
>
> Simon
>



-- 
Lance Norskog
goks...@gmail.com


Phrase Query Problem?

2010-11-01 Thread Tod
I have a number of fields I need to do an exact match on.  I've defined 
them as 'string' in my schema.xml.  I've noticed that I get back query 
results that don't have all of the words I'm using to search with.


For example:

q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))&start=0&indent=true&wt=json

Should, with an exact match, return only one entry but it returns five 
some of which don't have any of the fields I've specified.  I've tried 
this both with and without quotes.


What could I be doing wrong?


Thanks - Tod



Re: Phrase Query Problem?

2010-11-01 Thread Ken Stanley
On Mon, Nov 1, 2010 at 10:26 PM, Tod  wrote:

> I have a number of fields I need to do an exact match on.  I've defined
> them as 'string' in my schema.xml.  I've noticed that I get back query
> results that don't have all of the words I'm using to search with.
>
> For example:
>
>
> q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))&start=0&indent=true&wt=json
>
> Should, with an exact match, return only one entry but it returns five some
> of which don't have any of the fields I've specified.  I've tried this both
> with and without quotes.
>
> What could I be doing wrong?
>
>
> Thanks - Tod
>
>

Tod,

Without knowing your exact field definition, my first guess would be your
first boolean query; because it is not quoted, what SOLR typically does is
to transform that type of query into something like (assuming your uniqueKey
is "id"): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do
(mykeywords:"Compliance+With+Conduct+Standards) you might see different
(better?) results. Otherwise, append &debugQuery=on to your URL and you can
see exactly how SOLR is parsing your query. If none of that helps, what is
your field definition in your schema.xml?

- Ken


RE: Ensuring stable timestamp ordering

2010-11-01 Thread Dennis Gearon
how about a timrstamp with either a GUID appended on  the end of it?


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sun, 10/31/10, Toke Eskildsen  wrote:

> From: Toke Eskildsen 
> Subject: RE: Ensuring stable timestamp ordering
> To: "solr-user@lucene.apache.org" 
> Date: Sunday, October 31, 2010, 12:18 PM
> Dennis Gearon [gear...@sbcglobal.net]
> wrote:
> > Even microseconds may not be enough on some really
> good, fast machine.
> 
> True, especially since the timer might not provide
> microsecond granularity although the returned value is in
> microseconds. However, an unique timestamp generator should
> keep track of the previous timestamp to guard against
> duplicates. Uniqueness can thus be guaranteed by waiting a
> bit or cheating on the decimals. With microseconds can
> produce 1 million timestamps / second. While I agree that
> duplicates within microseconds can occur on a fast machine,
> guaranteeing uniqueness by waiting should only be a
> performance problem when the number of duplicates is high.
> That's still a few years off, I think.
> 
> As Michael pointed out, using normal timestamps as unique
> IDs might not be such a great idea as it effectively locks
> index-building to a single JVM. By going the ugly route and
> expressing the time in nanos with only microsecond
> granularity and use the last 3 decimals for a builder ID
> this could be fixed. Not very clean though, as the contract
> is not expressed in the data themselves but must
> nevertheless be obeyed by all builders to avoid collisions.
> It also raises the question of who should assign the builder
> IDs. Not trivial in an anarchistic setup where new builders
> can be added by different controllers.
> 
> Pragmatists might use the PID % 1000 or similar for the
> builder ID as it does not require coordination, but this is
> where the Birthday Paradox hits us again: The chance of two
> processes on different machines having the same PID is 10%
> if just 15 machines are used (1% for 5 machines, 50% for 37
> machines). I don't like those odds and that's assuming that
> the PIDs will be randomly distributed, which they won't. It
> could be lowered by reserving more decimals for the salt,
> but then we would decrease the maximum amount of timestamps
> / second, still without guaranteed uniqueness. Guys a lot
> smarter than me has spend time on the unique ID problem and
> it's clearly not easy: Java's UUID takes up 128 bits.
> 
> - Toke


Default file locking on trunk

2010-11-01 Thread Lance Norskog
Scenario:

Git update to current trunk (Nov 1, 2010).
Build all
Run solr in trunk/solr/example with 'java -jar start.jar'
Hi ^C
Jetty reports doing shutdown hook

There is now a data/index with a write lock file in it. I have not
attempted to read the index, let alone add something to it.
I start solr again, and it cannot open the index because of the write lock.

Why is there a write lock file when I have not tried to index anything?

-- 
Lance Norskog
goks...@gmail.com


RE: Ensuring stable timestamp ordering

2010-11-01 Thread Toke Eskildsen
Dennis Gearon [gear...@sbcglobal.net] wrote:
> how about a timrstamp with either a GUID appended on  the end of it?

Since long (8 bytes) is the largest atomic type supported by Java, this would 
have to be represented as a String (or rather BytesRef) and would take up 4 + 
32 bytes + 2 * 4 bytes from the internal BytesRef-attributes + some extra 
overhead. That is quite a large memory penalty to ensure unique timestamps.