Re: Attention Solr 4.0 SolrCloud users

2012-12-08 Thread Jamie Johnson
thanks for the info.  we were looking to move to a stable release soon (we
are on an old nightly build from April!).  Has this issue existed since
then?  Do we have an idea when solr 4.1 will be made available?  I am just
trying to get an idea if we should wait or not.


On Thu, Dec 6, 2012 at 9:11 PM, Mark Miller  wrote:

> I should have sent this some time ago:
>
> https://issues.apache.org/jira/browse/SOLR-3940 "Rejoining the leader
> election incorrectly triggers the code path for a fresh cluster start
> rather than fail over."
>
> The above is a somewhat ugly bug.
>
> It means that if you are playing around with recovery and you kill a
> replica in a shard, it will take 3 minutes before a new leader takes over.
>
> This will be fixed in the upcoming 4.1 release (And has been fixed on 4x
> since early October).
>
> This wait is only meant for cluster startup. The idea is that you might
> introduce some random, old, out of date shard and then start up your
> cluster - you don't want that shard to be a leader - so we wait around for
> all known shards to startup so they can all participate in the initial
> leader election and the best one can be chosen. It's meant as a protective
> measure against a fairly unlikely event. But it's kicking in when it
> shouldn't.
>
> You can just accept the 3 minute wait, or you can lower the wait from 3
> minutes (to like 10 seconds or to 0 seconds - just avoid the scenario I
> mention above if you do).
>
> You can set the wait time in solr.xml by adding the attribute
> leaderVoteWait={whatever miliseconds} to the cores node.
>
> Sorry about this - completely my fault.
>
> - Mark


Re: Attention Solr 4.0 SolrCloud users

2012-12-08 Thread Mark Miller
Hey Jamie - long time, no see.

On Dec 8, 2012, at 5:19 AM, Jamie Johnson  wrote:

> thanks for the info.  we were looking to move to a stable release soon (we
> are on an old nightly build from April!).  Has this issue existed since
> then?  

It was introduced shortly before 4.0 was released, so no, I don't think so.

> Do we have an idea when solr 4.1 will be made available?  I am just
> trying to get an idea if we should wait or not.

I hope very, very soon…just have to herd a few cats…

- Mark

> 
> 
> On Thu, Dec 6, 2012 at 9:11 PM, Mark Miller  wrote:
> 
>> I should have sent this some time ago:
>> 
>> https://issues.apache.org/jira/browse/SOLR-3940 "Rejoining the leader
>> election incorrectly triggers the code path for a fresh cluster start
>> rather than fail over."
>> 
>> The above is a somewhat ugly bug.
>> 
>> It means that if you are playing around with recovery and you kill a
>> replica in a shard, it will take 3 minutes before a new leader takes over.
>> 
>> This will be fixed in the upcoming 4.1 release (And has been fixed on 4x
>> since early October).
>> 
>> This wait is only meant for cluster startup. The idea is that you might
>> introduce some random, old, out of date shard and then start up your
>> cluster - you don't want that shard to be a leader - so we wait around for
>> all known shards to startup so they can all participate in the initial
>> leader election and the best one can be chosen. It's meant as a protective
>> measure against a fairly unlikely event. But it's kicking in when it
>> shouldn't.
>> 
>> You can just accept the 3 minute wait, or you can lower the wait from 3
>> minutes (to like 10 seconds or to 0 seconds - just avoid the scenario I
>> mention above if you do).
>> 
>> You can set the wait time in solr.xml by adding the attribute
>> leaderVoteWait={whatever miliseconds} to the cores node.
>> 
>> Sorry about this - completely my fault.
>> 
>> - Mark



Re: star searches with high page number requests taking long times

2012-12-08 Thread Jack Krupansky
What exactly is the common practice - is there a free, downloadable search 
component that does that or at least a "blueprint" for "recommended best 
practice"? What limit is common? (I know Google limits you to the top 1,000 
results.)


-- Jack Krupansky

-Original Message- 
From: Otis Gospodnetic

Sent: Saturday, December 08, 2012 7:25 AM
To: solr-user@lucene.apache.org
Subject: Re: star searches with high page number requests taking long times

Hi Robert,

You should just prevent deep paging. Humans with wallets don't do that, so
you will not lose anything by doing that. It's common practice.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Dec 7, 2012 8:10 PM, "Petersen, Robert"  wrote:


Hi guys,


Sometimes we get a bot crawling our search function on our retail web
site.  The ebay crawler loves to do this (Request.UserAgent: Terapeakbot).
 They just do a star search and then iterate through page after page. 
I've

noticed that when they get to higher page numbers like page 9000, the
searches are taking more than 20 seconds.  Is this expected behavior?
 We're requesting standard facets with the search as well as incorporating
boosting by function query.  Our index is almost 15 million docs now and
we're on Solr 3.6.1, this isn't causing any errors to occur at the solr
layer but our web layer times out the search after 20 seconds and logs the
exception.



Thanks

Robi






Modeling openinghours using multipoints

2012-12-08 Thread britske
Hi all, 

Over a year ago I posted a usecase to, the in this context familiar, issue
SOLR-2155 of modelling openinghours using multivalued points. 

https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13114839&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13114839

David (Smiley) gave two possible solutions that would work, but I'm
wondering if the latest advancements in spatial search have made a more
straightforward implementation possible. 

The crux: 
 - A venue can have multiple openinghours (depending on day of week, special
festivitydays, and sometimes even multiple timeslots per day) 
 - queries like the following should be possible: "which venues are open at
least for the following timespan: [NOW, NOW+3h] " Or [this monday 6h, this
monday 11pm] 
 - no need to search in the past. 

To me this an [open,close]-timespan could be nicely modelled as a point,
thus all openinghours of a venue could be defined as multiple points.
(multivalued points, multipoint, shape, not sure on the recent nomenclature) 

In the open/close domain the general query would be: 
Given a user defined query Q(open,close): return all venues that have a
timespan T(open,close) (out of many timespans) for which the following
holds: 
q.open <= T.open AND T.close <=q.close 

Mapping 'open' to latitude and 'close' to longitude results in: 

Given a user defined point X, return all docs that have a point P defined
(out of many points) for which the following holds: 
X.latitude <= P.latitude AND P.longitude <=X.longitude 

The question: Is such a query on multipoints now doable out-of-the-box with
spatial4j (or any other supported plugin for that matter) ? 

Any help highly appreciated! 

Kind regards, 
Geert-Jan. 

Oh btw: the idea behind the translation-function becomes easy as I don't
need to search dates in the past. Moreover, a reindex takes place every
night meaning today 0AM could be defined as 0. With a granularity of 15
minutes and wanting to search 100 days ahead: the transform is simply
mapping 9600 intervals (100*24*4) both for open and close onto [-90,90] and
[0,180] respectively. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-tp4025336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Attention Solr 4.0 SolrCloud users

2012-12-08 Thread Jamie Johnson
Yes, been off the radar for some time, a testimant to just how well
SolrCloud works even in it's alpha state!

Glad to hear that it should be soon, we are hoping to move to a stable
version of Solr for our next release at the end of February, so anything in
January should give us enough time to react.  Appreciate the information,
hope all is well!


On Sat, Dec 8, 2012 at 9:25 AM, Mark Miller  wrote:

> Hey Jamie - long time, no see.
>
> On Dec 8, 2012, at 5:19 AM, Jamie Johnson  wrote:
>
> > thanks for the info.  we were looking to move to a stable release soon
> (we
> > are on an old nightly build from April!).  Has this issue existed since
> > then?
>
> It was introduced shortly before 4.0 was released, so no, I don't think so.
>
> > Do we have an idea when solr 4.1 will be made available?  I am just
> > trying to get an idea if we should wait or not.
>
> I hope very, very soon…just have to herd a few cats…
>
> - Mark
>
> >
> >
> > On Thu, Dec 6, 2012 at 9:11 PM, Mark Miller 
> wrote:
> >
> >> I should have sent this some time ago:
> >>
> >> https://issues.apache.org/jira/browse/SOLR-3940 "Rejoining the leader
> >> election incorrectly triggers the code path for a fresh cluster start
> >> rather than fail over."
> >>
> >> The above is a somewhat ugly bug.
> >>
> >> It means that if you are playing around with recovery and you kill a
> >> replica in a shard, it will take 3 minutes before a new leader takes
> over.
> >>
> >> This will be fixed in the upcoming 4.1 release (And has been fixed on 4x
> >> since early October).
> >>
> >> This wait is only meant for cluster startup. The idea is that you might
> >> introduce some random, old, out of date shard and then start up your
> >> cluster - you don't want that shard to be a leader - so we wait around
> for
> >> all known shards to startup so they can all participate in the initial
> >> leader election and the best one can be chosen. It's meant as a
> protective
> >> measure against a fairly unlikely event. But it's kicking in when it
> >> shouldn't.
> >>
> >> You can just accept the 3 minute wait, or you can lower the wait from 3
> >> minutes (to like 10 seconds or to 0 seconds - just avoid the scenario I
> >> mention above if you do).
> >>
> >> You can set the wait time in solr.xml by adding the attribute
> >> leaderVoteWait={whatever miliseconds} to the cores node.
> >>
> >> Sorry about this - completely my fault.
> >>
> >> - Mark
>
>


Re: stress testing Solr 4.x

2012-12-08 Thread Mark Miller
Hmm…I've tried to replicate what looked like a bug from your report (3 Solr 
servers stop/start ), but on 5x it works no problem for me. It shouldn't be any 
different on 4x, but I'll try that next.

In terms of starting up Solr without a working ZooKeeper ensemble - it won't 
work currently. Cores won't be able to register with ZooKeeper and will fail 
loading. It would probably be nicer to come up in search only mode and keep 
trying to reconnect to zookeeper - file a JIRA issue if you are interested.

On the zk data dir, see 
http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup

- Mark

On Dec 7, 2012, at 10:22 PM, Mark Miller  wrote:

> Hey, I'll try and answer this tomorrow.
> 
> There is a def an unreported bug in there that needs to be fixed for the 
> restarting the all nodes case.
> 
> Also, a 404 one is generally when jetty is starting or stopping - there are 
> points where 404's can be returned. I'm not sure why else you'd see one. 
> Generally we do retries when that happens.
> 
> - Mark
> 
> On Dec 7, 2012, at 1:07 PM, Alain Rogister  wrote:
> 
>> I am reporting the results of my stress tests against Solr 4.x. As I was
>> getting many error conditions with 4.0, I switched to the 4.1 trunk in the
>> hope that some of the issues would be fixed already. Here is my setup :
>> 
>> - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I realize
>> this is not representative of a production environment but it's a fine way
>> to find out what happens under resource-constrained conditions.
>> - 3 Solr servers, 3 cores (2 of which are very small, the third one has 410
>> MB of data)
>> - single shard
>> - 3 Zookeeper instances
>> - HAProxy load balancing requests across Solr servers
>> - JMeter or ApacheBench running the tests : 5 thread pools of 20 threads
>> each, sending search requests continuously (no updates)
>> 
>> In nominal conditions, it all works fine i.e. it can process a million
>> requests, maxing out the CPUs at all time, without experiencing nasty
>> failures. There are errors in the logs about replication failures though;
>> they should be benigne in this case as no updates are taking place but it's
>> hard to tell what is going on exactly. Example :
>> 
>> Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse
>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr
>> exception talking to
>> http://192.168.0.101:8985/solr/adressage/, failed
>> org.apache.solr.common.SolrException: Server at
>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
>> message:Not Found
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>> at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:722)
>> 
>> Then I simulated various failure scenarios :
>> 
>> - 1 Solr server stop/start
>> - 2 Solr servers stop/start
>> - 3 Solr servers stop/start : it seems that in this case, the Solr servers
>> *cannot* be restarted : more exactly, the restarted server will consider
>> that it is number 1 out of 4 and wait for the other 3 to come up. The only
>> way out is to stop it again, then stop all Zookeeper instances *and* clean
>> up their zkdata directory, start them, then start the Solr servers.
>> 
>> I noticed that these zkdata directory had grown to 200 MB after a while.
>> What exactly is in there besides the configuration data ? Does it stop
>> growing ?
>> 
>> Then I tried this :
>> 
>> - kill 1 Zookeeper process
>> - kill 2 Zookeeper processes
>> - stop/start 1 Solr server
>> 
>> When doing this, I experienced (many times) situations where the Solr
>> servers could not reconnect and threw scary exceptions. The only way out
>> was to restart the whole cluster.
>> 
>> Q : when, if ever, is one supposed to clean up the zkdata directories ?
>> 
>> Here are the errors I found in the logs. It seems that some of them have
>> been reported in JIRA but 4.1-trunk seems to experience basically the same
>> issues as 4.0 in my test scenarios.
>> 
>> Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse
>> WARNING: PeerSync: core=cachede url=http://192.168.0.101:8983/solr
>> couldn't connect to
>>

Re: Which fields matched?

2012-12-08 Thread Mikhail Khludnev
Jeff,
explain() algorithm is definitely too slow to be used at search time. There
is an approach which I'm aware of - watch for scorers during the search
time. If scorer matches some doc _at some moment_ scorer.docID()==docNum.
My team successfully implemented such Match Spotting algorithm, it performs
quite well, and provides info like http://goo.gl/7vgrB
The problem with this algorithm is that it's tightly coupled with low level
scorers behavior, and they intended to behave contra-intuitively sometimes,
and changes that behavior due to performance optimizations in lucene core.
https://issues.apache.org/jira/browse/LUCENE-1999 sounds almost the same,
but I never looked into the source.


On Fri, Dec 7, 2012 at 11:00 PM, Jeff Wartes  wrote:

> Thanks, I did start to dig into how DebugComponent does its thing a
> little, and I'm not all the way down the rabbit hole yet, but the lucene
> indexSearcher's explain() method has this comment:
>
> "This is intended to be used in developing Similarity implementations,
> and, for good performance, should not be displayed with every hit.
> Computing an explanation is as expensive as executing the query over the
> entire index."
>
> Which makes me wonder if I'd get almost all of the debugQuery=true
> performance penalty anyway if I try to do as you suggest.
>
>
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com]
> Sent: Friday, December 07, 2012 10:47 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Which fields matched?
>
> The debugQuery "explain" is simply a text display of what Lucene has
> already calculated. As such, you could do a custom search component that
> gets the non-text Lucene "Explanation" object for the query and then
> traverse it to get your matched field list without all the text. No parsed
> would be required, but the Explanation structure could get messy.
>
> -- Jack Krupansky
>
> -Original Message-
> From: Jeff Wartes
> Sent: Friday, December 07, 2012 11:59 AM
> To: solr-user@lucene.apache.org
> Subject: Which fields matched?
>
>
> If I have an arbitrarily complex query that uses ORs, something like:
> q=(simple_fieldtype:foo OR complex_fieldtype:foo) AND
> (another_simple_fieldtype:bar OR another_complex_fieldtype:bar)
>
> I want to know which fields actually contributed to the match for each
> document returned. Something like:
> docID=1,
> fields_matched=simple_fieldtype,complex_fieldtype,another_complex_fieldtype
> docID=2, fields_matched=simple_fieldtype,another_complex_fieldtype
>
>
> My basic use case is that I have several copyField'ed variations on the
> same
> data (using different complex FieldTypes), and I want to know which
> variations contributed to the document so I can conclude things like "Well,
> this document matched the field with the SynonymFilterFactory, but not the
> one without, so this particular document must've been a synonym match."
>
> I know you could probably lift this from debugQuery output, but that's a
> non-starter due to parsing complexity and query performance impact.
> I think you could edge into some of this using the HighlightComponent
> output, but that's a non-starter because it requires fields be stored=true.
> Most of my fieldTypes are intended solely for indexing/search, and make no
> sense from a stored/retrieval standpoint. And to be clear, I really don't
> care about which terms matched anyway, only which fields.
>
> If there's an easy way to get this, I'd love to hear it. Otherwise, I'm
> mostly looking for a head start on where to go looking for this data so I
> can add my own Component or something - assuming the data is even available
> in the solr layer?
>
> Thanks.
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: star searches with high page number requests taking long times

2012-12-08 Thread Otis Gospodnetic
It is common practise not to allow drilling deep in search results.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Dec 8, 2012 10:27 AM, "Jack Krupansky"  wrote:

> What exactly is the common practice - is there a free, downloadable search
> component that does that or at least a "blueprint" for "recommended best
> practice"? What limit is common? (I know Google limits you to the top 1,000
> results.)
>
> -- Jack Krupansky
>
> -Original Message- From: Otis Gospodnetic
> Sent: Saturday, December 08, 2012 7:25 AM
> To: solr-user@lucene.apache.org
> Subject: Re: star searches with high page number requests taking long times
>
> Hi Robert,
>
> You should just prevent deep paging. Humans with wallets don't do that, so
> you will not lose anything by doing that. It's common practice.
>
> Otis
> --
> SOLR Performance Monitoring - http://sematext.com/spm
> On Dec 7, 2012 8:10 PM, "Petersen, Robert"  wrote:
>
>  Hi guys,
>>
>>
>> Sometimes we get a bot crawling our search function on our retail web
>> site.  The ebay crawler loves to do this (Request.UserAgent: Terapeakbot).
>>  They just do a star search and then iterate through page after page. I've
>> noticed that when they get to higher page numbers like page 9000, the
>> searches are taking more than 20 seconds.  Is this expected behavior?
>>  We're requesting standard facets with the search as well as incorporating
>> boosting by function query.  Our index is almost 15 million docs now and
>> we're on Solr 3.6.1, this isn't causing any errors to occur at the solr
>> layer but our web layer times out the search after 20 seconds and logs the
>> exception.
>>
>>
>>
>> Thanks
>>
>> Robi
>>
>>
>>
>


Re: Modeling openinghours using multipoints

2012-12-08 Thread David Smiley (@MITRE.org)
Hello again Geert-Jan!

What you're trying to do is indeed possible with Solr 4 out of the box. 
Other terminology people use for this is multi-value time duration.  This
creative solution is a pure application of spatial without the geospatial
notion -- we're not using an earth or other sphere model -- it's a flat
plane.  So no need to make reference to longitude & latitude, it's x & y.

I would put opening time into x, and closing time into y.  To express a
point, use "x y" (x space y), and supply this as a string to your
SpatialRecursivePrefixTreeFieldType based field for indexing.  You can give
it multiple values and it will work correctly; this is one of RPT's main
features that set it apart from Solr 3 spatial.  To query for a business
that is open during at least some part of a given time duration, say 6-8
o'clock, the query would look like openDuration:"Intersects(minX minY maxX
maxY)"  and put 0 or minX (always), 6 for minY (start time), 8 for maxX (end
time), and the largest possible value for maxY.  You wouldn't actually use 6
& 8, you'd use the number of 15 minute intervals since your epoch for this
equivalent time span.

You'll need to configure the field correctly: geo="false" worldBounds="0 0
maxTime maxTime" substituting an appropriate value for maxTime based on your
unit of time (number of 15 minute intervals you need) and distErrPct="0"
(full precision).

Let me know how this works for you.

~ David



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-tp4025336p4025359.html
Sent from the Solr - User mailing list archive at Nabble.com.


Posting html with wget

2012-12-08 Thread CalinH
How can I post html web pages to Solr when downloading with wget? How might I
modify the following example so that saving and indexing happen
simultaneously?

wget -P /var/myserver/archive http://www.somesite/products.html

I can't spot an obvious example in the documentation and would be grateful
for any pointers.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Posting-html-with-wget-tp4025376.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Posting html with wget

2012-12-08 Thread Jack Krupansky

The Simple Post Tool that comes with Solr can now do that in Solr 4.0

See:
https://issues.apache.org/jira/browse/SOLR-3691

-- Jack Krupansky

-Original Message- 
From: CalinH

Sent: Saturday, December 08, 2012 2:40 PM
To: solr-user@lucene.apache.org
Subject: Posting html with wget

How can I post html web pages to Solr when downloading with wget? How might 
I

modify the following example so that saving and indexing happen
simultaneously?

wget -P /var/myserver/archive http://www.somesite/products.html

I can't spot an obvious example in the documentation and would be grateful
for any pointers.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Posting-html-with-wget-tp4025376.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: star searches with high page number requests taking long times

2012-12-08 Thread Petersen, Robert
We have a limit in place to restrict searches to the first ten thousand pages. 
I am going to try to get that number reduced!  I'm thinking even as low as page 
fifty should be the limit. What human (with a wallet) would even go as deep as 
fifty pages?  :)

Sent from my iGizmo


On Dec 8, 2012, at 10:21 AM, "Otis Gospodnetic"  
wrote:

> It is common practise not to allow drilling deep in search results.
> 
> Otis
> --
> SOLR Performance Monitoring - http://sematext.com/spm
> On Dec 8, 2012 10:27 AM, "Jack Krupansky"  wrote:
> 
>> What exactly is the common practice - is there a free, downloadable search
>> component that does that or at least a "blueprint" for "recommended best
>> practice"? What limit is common? (I know Google limits you to the top 1,000
>> results.)
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Otis Gospodnetic
>> Sent: Saturday, December 08, 2012 7:25 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: star searches with high page number requests taking long times
>> 
>> Hi Robert,
>> 
>> You should just prevent deep paging. Humans with wallets don't do that, so
>> you will not lose anything by doing that. It's common practice.
>> 
>> Otis
>> --
>> SOLR Performance Monitoring - http://sematext.com/spm
>> On Dec 7, 2012 8:10 PM, "Petersen, Robert"  wrote:
>> 
>> Hi guys,
>>> 
>>> 
>>> Sometimes we get a bot crawling our search function on our retail web
>>> site.  The ebay crawler loves to do this (Request.UserAgent: Terapeakbot).
>>> They just do a star search and then iterate through page after page. I've
>>> noticed that when they get to higher page numbers like page 9000, the
>>> searches are taking more than 20 seconds.  Is this expected behavior?
>>> We're requesting standard facets with the search as well as incorporating
>>> boosting by function query.  Our index is almost 15 million docs now and
>>> we're on Solr 3.6.1, this isn't causing any errors to occur at the solr
>>> layer but our web layer times out the search after 20 seconds and logs the
>>> exception.
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>>> Robi
>> 



Re: stress testing Solr 4.x

2012-12-08 Thread Mark Miller
After some more playing around on 5x I have duplicated the issue. I'll file a 
JIRA issue for you and fix it shortly.

- Mark

On Dec 8, 2012, at 8:43 AM, Mark Miller  wrote:

> Hmm…I've tried to replicate what looked like a bug from your report (3 Solr 
> servers stop/start ), but on 5x it works no problem for me. It shouldn't be 
> any different on 4x, but I'll try that next.
> 
> In terms of starting up Solr without a working ZooKeeper ensemble - it won't 
> work currently. Cores won't be able to register with ZooKeeper and will fail 
> loading. It would probably be nicer to come up in search only mode and keep 
> trying to reconnect to zookeeper - file a JIRA issue if you are interested.
> 
> On the zk data dir, see 
> http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup
> 
> - Mark
> 
> On Dec 7, 2012, at 10:22 PM, Mark Miller  wrote:
> 
>> Hey, I'll try and answer this tomorrow.
>> 
>> There is a def an unreported bug in there that needs to be fixed for the 
>> restarting the all nodes case.
>> 
>> Also, a 404 one is generally when jetty is starting or stopping - there are 
>> points where 404's can be returned. I'm not sure why else you'd see one. 
>> Generally we do retries when that happens.
>> 
>> - Mark
>> 
>> On Dec 7, 2012, at 1:07 PM, Alain Rogister  wrote:
>> 
>>> I am reporting the results of my stress tests against Solr 4.x. As I was
>>> getting many error conditions with 4.0, I switched to the 4.1 trunk in the
>>> hope that some of the issues would be fixed already. Here is my setup :
>>> 
>>> - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I realize
>>> this is not representative of a production environment but it's a fine way
>>> to find out what happens under resource-constrained conditions.
>>> - 3 Solr servers, 3 cores (2 of which are very small, the third one has 410
>>> MB of data)
>>> - single shard
>>> - 3 Zookeeper instances
>>> - HAProxy load balancing requests across Solr servers
>>> - JMeter or ApacheBench running the tests : 5 thread pools of 20 threads
>>> each, sending search requests continuously (no updates)
>>> 
>>> In nominal conditions, it all works fine i.e. it can process a million
>>> requests, maxing out the CPUs at all time, without experiencing nasty
>>> failures. There are errors in the logs about replication failures though;
>>> they should be benigne in this case as no updates are taking place but it's
>>> hard to tell what is going on exactly. Example :
>>> 
>>> Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse
>>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr
>>> exception talking to
>>> http://192.168.0.101:8985/solr/adressage/, failed
>>> org.apache.solr.common.SolrException: Server at
>>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
>>> message:Not Found
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>> at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>> at java.lang.Thread.run(Thread.java:722)
>>> 
>>> Then I simulated various failure scenarios :
>>> 
>>> - 1 Solr server stop/start
>>> - 2 Solr servers stop/start
>>> - 3 Solr servers stop/start : it seems that in this case, the Solr servers
>>> *cannot* be restarted : more exactly, the restarted server will consider
>>> that it is number 1 out of 4 and wait for the other 3 to come up. The only
>>> way out is to stop it again, then stop all Zookeeper instances *and* clean
>>> up their zkdata directory, start them, then start the Solr servers.
>>> 
>>> I noticed that these zkdata directory had grown to 200 MB after a while.
>>> What exactly is in there besides the configuration data ? Does it stop
>>> growing ?
>>> 
>>> Then I tried this :
>>> 
>>> - kill 1 Zookeeper process
>>> - kill 2 Zookeeper processes
>>> - stop/start 1 Solr server
>>> 
>>> When doing this, I experienced (many times) situations where the Solr
>>> servers could not reconnect and threw scary exceptions. The only way out
>>> was to restart the whole cluster.
>>> 
>>> Q : when, if ever, is one supposed to clean up the zkdata directories ?
>>> 
>>> Here are the errors I found in the logs. It seems that some of them 

Re: stress testing Solr 4.x

2012-12-08 Thread Alain Rogister
Great, thanks Mark ! I'll test the fix and post my results.

Alain

On Saturday, December 8, 2012, Mark Miller wrote:

> After some more playing around on 5x I have duplicated the issue. I'll
> file a JIRA issue for you and fix it shortly.
>
> - Mark
>
> On Dec 8, 2012, at 8:43 AM, Mark Miller  wrote:
>
> > Hmm…I've tried to replicate what looked like a bug from your report (3
> Solr servers stop/start ), but on 5x it works no problem for me. It
> shouldn't be any different on 4x, but I'll try that next.
> >
> > In terms of starting up Solr without a working ZooKeeper ensemble - it
> won't work currently. Cores won't be able to register with ZooKeeper and
> will fail loading. It would probably be nicer to come up in search only
> mode and keep trying to reconnect to zookeeper - file a JIRA issue if you
> are interested.
> >
> > On the zk data dir, see
> http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup
> >
> > - Mark
> >
> > On Dec 7, 2012, at 10:22 PM, Mark Miller  wrote:
> >
> >> Hey, I'll try and answer this tomorrow.
> >>
> >> There is a def an unreported bug in there that needs to be fixed for
> the restarting the all nodes case.
> >>
> >> Also, a 404 one is generally when jetty is starting or stopping - there
> are points where 404's can be returned. I'm not sure why else you'd see
> one. Generally we do retries when that happens.
> >>
> >> - Mark
> >>
> >> On Dec 7, 2012, at 1:07 PM, Alain Rogister 
> wrote:
> >>
> >>> I am reporting the results of my stress tests against Solr 4.x. As I
> was
> >>> getting many error conditions with 4.0, I switched to the 4.1 trunk in
> the
> >>> hope that some of the issues would be fixed already. Here is my setup :
> >>>
> >>> - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I
> realize
> >>> this is not representative of a production environment but it's a fine
> way
> >>> to find out what happens under resource-constrained conditions.
> >>> - 3 Solr servers, 3 cores (2 of which are very small, the third one
> has 410
> >>> MB of data)
> >>> - single shard
> >>> - 3 Zookeeper instances
> >>> - HAProxy load balancing requests across Solr servers
> >>> - JMeter or ApacheBench running the tests : 5 thread pools of 20
> threads
> >>> each, sending search requests continuously (no updates)
> >>>
> >>> In nominal conditions, it all works fine i.e. it can process a million
> >>> requests, maxing out the CPUs at all time, without experiencing nasty
> >>> failures. There are errors in the logs about replication failures
> though;
> >>> they should be benigne in this case as no updates are taking place but
> it's
> >>> hard to tell what is going on exactly. Example :
> >>>
> >>> Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse
> >>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr
> >>> exception talking to
> >>> http://192.168.0.101:8985/solr/adressage/, failed
> >>> org.apache.solr.common.SolrException: Server at
> >>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
> >>> message:Not Found
> >>> at
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>> at
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> at
> >>>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
> >>> at
> >>>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
> >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>> at java.util.concurrent.FutureTask.run(FutureTask.


Re: Is there a way to round data when index, but still able to return original content?

2012-12-08 Thread Erick Erickson
Depends on whether the transformation is before or after the doc gets sent
to Solr. If you're changing the data before you give it to Solr, then you'd
have to have two fields, probably indexed=true and stored=false for the one
you search on, and indexed=false stored=true for the one you return to the
user.

This really doesn't take any more resources than using one field.

If you put some sort of (perhaps custom) filter in place, then the original
value would go in as stored and the altered value would get in the index
and you could do both in the same field.

Best
Erick


On Sat, Dec 8, 2012 at 2:34 PM, jefferyyuan  wrote:

> Hi:
>
> I am wondering whether there is a way to round data when index, but still
> able to return original content?
>
> For example, for a date field: 2012-12-21T12:12:12Z, because when search,
> user only cares date part, so I can round it to 2012-12-12T00:00:00Z, when
> index it - this can reduce index size, as there will be less term.
>
> But user still wants to get the original content, so the result of matched
> doc will return 2012-12-21T12:12:12Z not 2012-12-12T00:00:00Z.
>
> This also applies to number and text field.
>
> Is there a way to do this in Solr?
>
> Thanks for you reply :)
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-there-a-way-to-round-data-when-index-but-still-able-to-return-original-content-tp4025405.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: star searches with high page number requests taking long times

2012-12-08 Thread Walter Underwood
I put in a 50 page limit when I was at Netflix.  --wunder

On Dec 8, 2012, at 2:26 PM, Petersen, Robert wrote:

> We have a limit in place to restrict searches to the first ten thousand 
> pages. I am going to try to get that number reduced!  I'm thinking even as 
> low as page fifty should be the limit. What human (with a wallet) would even 
> go as deep as fifty pages?  :)
> 
> Sent from my iGizmo
> 
> 
> On Dec 8, 2012, at 10:21 AM, "Otis Gospodnetic"  
> wrote:
> 
>> It is common practise not to allow drilling deep in search results.
>> 
>> Otis
>> --
>> SOLR Performance Monitoring - http://sematext.com/spm
>> On Dec 8, 2012 10:27 AM, "Jack Krupansky"  wrote:
>> 
>>> What exactly is the common practice - is there a free, downloadable search
>>> component that does that or at least a "blueprint" for "recommended best
>>> practice"? What limit is common? (I know Google limits you to the top 1,000
>>> results.)
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Otis Gospodnetic
>>> Sent: Saturday, December 08, 2012 7:25 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: star searches with high page number requests taking long times
>>> 
>>> Hi Robert,
>>> 
>>> You should just prevent deep paging. Humans with wallets don't do that, so
>>> you will not lose anything by doing that. It's common practice.
>>> 
>>> Otis
>>> --
>>> SOLR Performance Monitoring - http://sematext.com/spm
>>> On Dec 7, 2012 8:10 PM, "Petersen, Robert"  wrote:
>>> 
>>> Hi guys,
 
 
 Sometimes we get a bot crawling our search function on our retail web
 site.  The ebay crawler loves to do this (Request.UserAgent: Terapeakbot).
 They just do a star search and then iterate through page after page. I've
 noticed that when they get to higher page numbers like page 9000, the
 searches are taking more than 20 seconds.  Is this expected behavior?
 We're requesting standard facets with the search as well as incorporating
 boosting by function query.  Our index is almost 15 million docs now and
 we're on Solr 3.6.1, this isn't causing any errors to occur at the solr
 layer but our web layer times out the search after 20 seconds and logs the
 exception.
 
 
 
 Thanks
 
 Robi
>>> 
> 

--
Walter Underwood
wun...@wunderwood.org





Re: stress testing Solr 4.x

2012-12-08 Thread Mark Miller
No problem!

Here is the JIRA issue: https://issues.apache.org/jira/browse/SOLR-4158

- Mark

On Sat, Dec 8, 2012 at 6:03 PM, Alain Rogister  wrote:
> Great, thanks Mark ! I'll test the fix and post my results.
>
> Alain
>
> On Saturday, December 8, 2012, Mark Miller wrote:
>
>> After some more playing around on 5x I have duplicated the issue. I'll
>> file a JIRA issue for you and fix it shortly.
>>
>> - Mark
>>
>> On Dec 8, 2012, at 8:43 AM, Mark Miller  wrote:
>>
>> > Hmm…I've tried to replicate what looked like a bug from your report (3
>> Solr servers stop/start ), but on 5x it works no problem for me. It
>> shouldn't be any different on 4x, but I'll try that next.
>> >
>> > In terms of starting up Solr without a working ZooKeeper ensemble - it
>> won't work currently. Cores won't be able to register with ZooKeeper and
>> will fail loading. It would probably be nicer to come up in search only
>> mode and keep trying to reconnect to zookeeper - file a JIRA issue if you
>> are interested.
>> >
>> > On the zk data dir, see
>> http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup
>> >
>> > - Mark
>> >
>> > On Dec 7, 2012, at 10:22 PM, Mark Miller  wrote:
>> >
>> >> Hey, I'll try and answer this tomorrow.
>> >>
>> >> There is a def an unreported bug in there that needs to be fixed for
>> the restarting the all nodes case.
>> >>
>> >> Also, a 404 one is generally when jetty is starting or stopping - there
>> are points where 404's can be returned. I'm not sure why else you'd see
>> one. Generally we do retries when that happens.
>> >>
>> >> - Mark
>> >>
>> >> On Dec 7, 2012, at 1:07 PM, Alain Rogister 
>> wrote:
>> >>
>> >>> I am reporting the results of my stress tests against Solr 4.x. As I
>> was
>> >>> getting many error conditions with 4.0, I switched to the 4.1 trunk in
>> the
>> >>> hope that some of the issues would be fixed already. Here is my setup :
>> >>>
>> >>> - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I
>> realize
>> >>> this is not representative of a production environment but it's a fine
>> way
>> >>> to find out what happens under resource-constrained conditions.
>> >>> - 3 Solr servers, 3 cores (2 of which are very small, the third one
>> has 410
>> >>> MB of data)
>> >>> - single shard
>> >>> - 3 Zookeeper instances
>> >>> - HAProxy load balancing requests across Solr servers
>> >>> - JMeter or ApacheBench running the tests : 5 thread pools of 20
>> threads
>> >>> each, sending search requests continuously (no updates)
>> >>>
>> >>> In nominal conditions, it all works fine i.e. it can process a million
>> >>> requests, maxing out the CPUs at all time, without experiencing nasty
>> >>> failures. There are errors in the logs about replication failures
>> though;
>> >>> they should be benigne in this case as no updates are taking place but
>> it's
>> >>> hard to tell what is going on exactly. Example :
>> >>>
>> >>> Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse
>> >>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr
>> >>> exception talking to
>> >>> http://192.168.0.101:8985/solr/adressage/, failed
>> >>> org.apache.solr.common.SolrException: Server at
>> >>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
>> >>> message:Not Found
>> >>> at
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>> >>> at
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>> at
>> >>>
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>> >>> at
>> >>>
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>> >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> >>> at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> >>> at java.util.concurrent.FutureTask.run(FutureTask.



-- 
- Mark


Re: Which fields matched?

2012-12-08 Thread Paul Libbrecht
We've used lucene-1999 with some success in ActiveMath to find the language 
that was matched.

paul


Le 8 déc. 2012 à 10:09, Mikhail Khludnev a écrit :

> Jeff,
> explain() algorithm is definitely too slow to be used at search time. There
> is an approach which I'm aware of - watch for scorers during the search
> time. If scorer matches some doc _at some moment_ scorer.docID()==docNum.
> My team successfully implemented such Match Spotting algorithm, it performs
> quite well, and provides info like http://goo.gl/7vgrB
> The problem with this algorithm is that it's tightly coupled with low level
> scorers behavior, and they intended to behave contra-intuitively sometimes,
> and changes that behavior due to performance optimizations in lucene core.
> https://issues.apache.org/jira/browse/LUCENE-1999 sounds almost the same,
> but I never looked into the source.
> 
> 
> On Fri, Dec 7, 2012 at 11:00 PM, Jeff Wartes  wrote:
> 
>> Thanks, I did start to dig into how DebugComponent does its thing a
>> little, and I'm not all the way down the rabbit hole yet, but the lucene
>> indexSearcher's explain() method has this comment:
>> 
>> "This is intended to be used in developing Similarity implementations,
>> and, for good performance, should not be displayed with every hit.
>> Computing an explanation is as expensive as executing the query over the
>> entire index."
>> 
>> Which makes me wonder if I'd get almost all of the debugQuery=true
>> performance penalty anyway if I try to do as you suggest.
>> 
>> 
>> -Original Message-
>> From: Jack Krupansky [mailto:j...@basetechnology.com]
>> Sent: Friday, December 07, 2012 10:47 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Which fields matched?
>> 
>> The debugQuery "explain" is simply a text display of what Lucene has
>> already calculated. As such, you could do a custom search component that
>> gets the non-text Lucene "Explanation" object for the query and then
>> traverse it to get your matched field list without all the text. No parsed
>> would be required, but the Explanation structure could get messy.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message-
>> From: Jeff Wartes
>> Sent: Friday, December 07, 2012 11:59 AM
>> To: solr-user@lucene.apache.org
>> Subject: Which fields matched?
>> 
>> 
>> If I have an arbitrarily complex query that uses ORs, something like:
>> q=(simple_fieldtype:foo OR complex_fieldtype:foo) AND
>> (another_simple_fieldtype:bar OR another_complex_fieldtype:bar)
>> 
>> I want to know which fields actually contributed to the match for each
>> document returned. Something like:
>> docID=1,
>> fields_matched=simple_fieldtype,complex_fieldtype,another_complex_fieldtype
>> docID=2, fields_matched=simple_fieldtype,another_complex_fieldtype
>> 
>> 
>> My basic use case is that I have several copyField'ed variations on the
>> same
>> data (using different complex FieldTypes), and I want to know which
>> variations contributed to the document so I can conclude things like "Well,
>> this document matched the field with the SynonymFilterFactory, but not the
>> one without, so this particular document must've been a synonym match."
>> 
>> I know you could probably lift this from debugQuery output, but that's a
>> non-starter due to parsing complexity and query performance impact.
>> I think you could edge into some of this using the HighlightComponent
>> output, but that's a non-starter because it requires fields be stored=true.
>> Most of my fieldTypes are intended solely for indexing/search, and make no
>> sense from a stored/retrieval standpoint. And to be clear, I really don't
>> care about which terms matched anyway, only which fields.
>> 
>> If there's an easy way to get this, I'd love to hear it. Otherwise, I'm
>> mostly looking for a head start on where to go looking for this data so I
>> can add my own Component or something - assuming the data is even available
>> in the solr layer?
>> 
>> Thanks.
>> 
>> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
> 
> 
> 



Re: Modeling openinghours using multipoints

2012-12-08 Thread David Smiley (@MITRE.org)
britske wrote
> That's seriously awesome!
> 
> Some change in the query though:
> You described: "To query for a business that is open during at least some
> part of a given time duration"
> I want "To query for a business that is open during at least the entire
> given time duration".
> 
> Feels like a small difference but probably isn't (I'm still wrapping my
> head on the intersect query I must admit)

So this would be a slightly different rectangle query.  Interestingly, you
simply swap the location in the rectangle where you put the start and end
time.  In summary:

Indexed span CONTAINS query span:
minX minY maxX maxY -> 0 end start *

Indexed span INTERSECTS (i.e. OVERLAPS) query span:
minX minY maxX maxY -> 0 start end *

Indexed span WITHIN query span:
minX minY maxX maxY -> start 0 * end

I'm using '*' here to denote the max possible value.  At some point I may
add that as a feature.

That was a fun exercise!  I give you credit in prodding me in this direction
as I'm not sure if this use of spatial would have occurred to me otherwise.


britske wrote
> Moreover, any indication on performance? Should, say, 50.000 docs with
> about 100-200 points each (1 a 2 open-close spans per day) be ok? ( I know
> 'your mileage may very' etc. but just a guestimate :)

You should have absolutely no problem.  The real clincher in your favor is
the fact that you only need 9600 discrete time values (so you said), not
Long.MAX_VALUE.  Using Long.MAX_VALUE would simply not be possible with the
current implementation because it's using Doubles which has 52 bits of
precision not the 64 that would be required to be a complete substitute for
any time/date.  Even given the 52 bits, a quad SpatialPrefixTree with
maxLevels="52" would probably not perform well or might fail; not sure. 
Eventually when I have time to work on an implementation that can be based
on a configurable number of grid cells (not unlike how you can configure
precisionStep on the Trie numeric fields), 52 should be no problem.

I'll have to remember to refer back to this email on the approach if I
create a field type that wraps this functionality.

~ David


britske wrote
> Again, this looks good!
> Geert-Jan
> 
> 2012/12/8 David Smiley (@MITRE.org) [via Lucene] <

> ml-node+s472066n4025359h19@.nabble

>>
> 
>> Hello again Geert-Jan!
>>
>> What you're trying to do is indeed possible with Solr 4 out of the box.
>>  Other terminology people use for this is multi-value time duration. 
>> This
>> creative solution is a pure application of spatial without the geospatial
>> notion -- we're not using an earth or other sphere model -- it's a flat
>> plane.  So no need to make reference to longitude & latitude, it's x & y.
>>
>> I would put opening time into x, and closing time into y.  To express a
>> point, use "x y" (x space y), and supply this as a string to your
>> SpatialRecursivePrefixTreeFieldType based field for indexing.  You can
>> give
>> it multiple values and it will work correctly; this is one of RPT's main
>> features that set it apart from Solr 3 spatial.  To query for a business
>> that is open during at least some part of a given time duration, say 6-8
>> o'clock, the query would look like openDuration:"Intersects(minX minY
>> maxX
>> maxY)"  and put 0 or minX (always), 6 for minY (start time), 8 for maxX
>> (end time), and the largest possible value for maxY.  You wouldn't
>> actually
>> use 6 & 8, you'd use the number of 15 minute intervals since your epoch
>> for
>> this equivalent time span.
>>
>> You'll need to configure the field correctly: geo="false" worldBounds="0
>> 0
>> maxTime maxTime" substituting an appropriate value for maxTime based on
>> your unit of time (number of 15 minute intervals you need) and
>> distErrPct="0" (full precision).
>>
>> Let me know how this works for you.
>>
>> ~ David
>>  Author:
>> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book





-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-tp4025336p4025434.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modeling openinghours using multipoints

2012-12-08 Thread britske
Brilliant! Got some great ideas for this. Indeed all sorts of usecases which 
use multiple temporal ranges could benefit.. 

Eg: Another Guy on stackoverflow asked me about this some days ago.. He wants 
to model multiple temporary offers per product (free shopping for christmas, 
20% discount for Black friday , etc) .. All possible with this out of the box. 
Factor in 'offer category' in  x and y as well for some extra powerfull 
querying. 

Yup im enthousiastic about it , which im sure you can tell :)

Thanks a lot David,

Cheers,
Geert-Jan 



Sent from my iPhone

On 9 dec. 2012, at 05:35, "David Smiley (@MITRE.org) [via Lucene]" 
 wrote:

> britske wrote
> That's seriously awesome! 
> 
> Some change in the query though: 
> You described: "To query for a business that is open during at least some 
> part of a given time duration" 
> I want "To query for a business that is open during at least the entire 
> given time duration". 
> 
> Feels like a small difference but probably isn't (I'm still wrapping my 
> head on the intersect query I must admit)
> So this would be a slightly different rectangle query.  Interestingly, you 
> simply swap the location in the rectangle where you put the start and end 
> time.  In summary: 
> 
> Indexed span CONTAINS query span: 
> minX minY maxX maxY -> 0 end start * 
> 
> Indexed span INTERSECTS (i.e. OVERLAPS) query span: 
> minX minY maxX maxY -> 0 start end * 
> 
> Indexed span WITHIN query span: 
> minX minY maxX maxY -> start 0 * end 
> 
> I'm using '*' here to denote the max possible value.  At some point I may add 
> that as a feature. 
> 
> That was a fun exercise!  I give you credit in prodding me in this direction 
> as I'm not sure if this use of spatial would have occurred to me otherwise. 
> 
> britske wrote
> Moreover, any indication on performance? Should, say, 50.000 docs with 
> about 100-200 points each (1 a 2 open-close spans per day) be ok? ( I know 
> 'your mileage may very' etc. but just a guestimate :)
> You should have absolutely no problem.  The real clincher in your favor is 
> the fact that you only need 9600 discrete time values (so you said), not 
> Long.MAX_VALUE.  Using Long.MAX_VALUE would simply not be possible with the 
> current implementation because it's using Doubles which has 52 bits of 
> precision not the 64 that would be required to be a complete substitute for 
> any time/date.  Even given the 52 bits, a quad SpatialPrefixTree with 
> maxLevels="52" would probably not perform well or might fail; not sure.  
> Eventually when I have time to work on an implementation that can be based on 
> a configurable number of grid cells (not unlike how you can configure 
> precisionStep on the Trie numeric fields), 52 should be no problem. 
> 
> I'll have to remember to refer back to this email on the approach if I create 
> a field type that wraps this functionality. 
> 
> ~ David 
> 
> britske wrote
> Again, this looks good! 
> Geert-Jan 
> 
> 2012/12/8 David Smiley (@MITRE.org) [via Lucene] < 
> [hidden email]> 
> 
> > Hello again Geert-Jan! 
> > 
> > What you're trying to do is indeed possible with Solr 4 out of the box. 
> >  Other terminology people use for this is multi-value time duration.  This 
> > creative solution is a pure application of spatial without the geospatial 
> > notion -- we're not using an earth or other sphere model -- it's a flat 
> > plane.  So no need to make reference to longitude & latitude, it's x & y. 
> > 
> > I would put opening time into x, and closing time into y.  To express a 
> > point, use "x y" (x space y), and supply this as a string to your 
> > SpatialRecursivePrefixTreeFieldType based field for indexing.  You can give 
> > it multiple values and it will work correctly; this is one of RPT's main 
> > features that set it apart from Solr 3 spatial.  To query for a business 
> > that is open during at least some part of a given time duration, say 6-8 
> > o'clock, the query would look like openDuration:"Intersects(minX minY maxX 
> > maxY)"  and put 0 or minX (always), 6 for minY (start time), 8 for maxX 
> > (end time), and the largest possible value for maxY.  You wouldn't actually 
> > use 6 & 8, you'd use the number of 15 minute intervals since your epoch for 
> > this equivalent time span. 
> > 
> > You'll need to configure the field correctly: geo="false" worldBounds="0 0 
> > maxTime maxTime" substituting an appropriate value for maxTime based on 
> > your unit of time (number of 15 minute intervals you need) and 
> > distErrPct="0" (full precision). 
> > 
> > Let me know how this works for you. 
> > 
> > ~ David 
> >  Author: 
> > http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>  Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-tp4025336p4025434.html
> To unsubscribe from Modeli

Re: SolrCloud - Query performance degrades with multiple servers

2012-12-08 Thread sausarkar
Spoke too early it seems that SolrCloud is still distributing queries to all
the servers even if numShards=1 We are seeing POST request to all servers in
the cluster, please let me know what is the solution. Here is an example:
(the variable isShard should be false in our case as single shard, please
help)

POST /solr/core0/select HTTP/1.1
Content-Charset: UTF-8
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrServer] 1.0
Content-Length: 991
Host: server1
Connection: Keep-Alive

lowercaseOperators=true&mm=70%&fl=EntityId&df=EntityId&q.op=AND&q.alt=*:*&qs=10&stopwords=true&defType=edismax&rows=3000&q=*:*&start=0&fsv=true&distrib=false&*isShard=true&*shard.url=*server1*:9090/solr/core0/|*server2*:9090/solr/core0/|*server3*:9090/solr/core0/&NOW=1354918880447&wt=javabin&version=2


Re: SolrCloud - Query performance degrades with multiple servers
Dec 06, 2012; 6:29pm — by   Mark Miller-3

On Dec 6, 2012, at 5:08 PM, sausarkar <[hidden email]> wrote: 

> We solved the issue by explicitly adding numShards=1 argument to the solr 
> start up script. Is this a bug? 

Sounds like it…perhaps related to SOLR-3971…not sure though. 

- Mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4025455.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - Query performance degrades with multiple servers

2012-12-08 Thread Mark Miller
If that's true, we will fix it for 4.1. I can look closer tomorrow. 

Mark

Sent from my iPhone

On Dec 9, 2012, at 2:04 AM, sausarkar  wrote:

> Spoke too early it seems that SolrCloud is still distributing queries to all
> the servers even if numShards=1 We are seeing POST request to all servers in
> the cluster, please let me know what is the solution. Here is an example:
> (the variable isShard should be false in our case as single shard, please
> help)
> 
> POST /solr/core0/select HTTP/1.1
> Content-Charset: UTF-8
> Content-Type: application/x-www-form-urlencoded; charset=UTF-8
> User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrServer] 1.0
> Content-Length: 991
> Host: server1
> Connection: Keep-Alive
> 
> lowercaseOperators=true&mm=70%&fl=EntityId&df=EntityId&q.op=AND&q.alt=*:*&qs=10&stopwords=true&defType=edismax&rows=3000&q=*:*&start=0&fsv=true&distrib=false&*isShard=true&*shard.url=*server1*:9090/solr/core0/|*server2*:9090/solr/core0/|*server3*:9090/solr/core0/&NOW=1354918880447&wt=javabin&version=2
> 
> 
> Re: SolrCloud - Query performance degrades with multiple servers
> Dec 06, 2012; 6:29pm — by   Mark Miller-3
> 
> On Dec 6, 2012, at 5:08 PM, sausarkar <[hidden email]> wrote: 
> 
>> We solved the issue by explicitly adding numShards=1 argument to the solr 
>> start up script. Is this a bug?
> 
> Sounds like it…perhaps related to SOLR-3971…not sure though. 
> 
> - Mark
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4025455.html
> Sent from the Solr - User mailing list archive at Nabble.com.