I am curious why the field:* walks the entire terms list.. could this be discovered from a field cache / docvalues?
steve On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower <sbo...@alcyon.net> wrote: > Until I get the data refed I there was another field (a date field) that > was there and not when the geo field was/was not... i tried that field:* > and query times come down to 2.5s .. also just removing that filter brings > the query down to 30ms.. so I'm very hopeful that with just a boolean i'll > be down in that sub 100ms range.. > > steve > > > On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower <sbo...@alcyon.net> wrote: > >> Will give the boolean thing a shot... makes sense... >> >> >> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. <dsmi...@mitre.org>wrote: >> >>> I see the problem ‹ it's +pp:*. It may look innocent but it's a >>> performance killer. What your telling Lucene to do is iterate over >>> *every* term in this index to find all documents that have this data. >>> Most fields are pretty slow to do that. Lucene/Solr does not have some >>> kind of cache for this. Instead, you should index a new boolean field >>> indicating wether or not 'pp' is populated and then do a simple true >>> check >>> against that field. Another approach you could do right now without >>> reindexing is to simplify the last 2 clauses of your 3-clause boolean >>> query by using the "IsDisjointTo" predicate. But unfortunately Lucene >>> doesn't have a generic filter cache capability and so this predicate has >>> no place to cache the whole-world query it does internally (each and >>> every >>> time it's used), so it will be slower than the boolean field I suggested >>> you add. >>> >>> >>> Nevermind on LatLonType; it doesn't support JTS/Polygons. There is >>> something close called SpatialPointVectorFieldType that could be modified >>> trivially but it doesn't support it now. >>> >>> ~ David >>> >>> On 7/30/13 11:32 AM, "Steven Bower" <sbo...@alcyon.net> wrote: >>> >>> >#1 Here is my query: >>> > >>> >sort=vid asc >>> >start=0 >>> >rows=1000 >>> >defType=edismax >>> >q=*:* >>> >fq=recordType:"xxx" >>> >fq=vt:"X12B" AND >>> >fq=(cls:"3" OR cls:"8") >>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z] >>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR >>> >vid:89XXX48 >>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR >>> vid:90XXX33 >>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR >>> vid:90XXX44 >>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR >>> vid:91XXX87 >>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR >>> vid:91XXX94 >>> >OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR >>> vid:91XXX67 >>> >OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR >>> vid:92XXX13 >>> >OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR >>> vid:92XXX99 >>> >OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR >>> vid:92XXX41 >>> >OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR >>> vid:93XXX98 >>> >OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR >>> vid:93XXX28 >>> >OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR >>> vid:94XXX10 >>> >OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR >>> vid:94XXX58 >>> >OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR >>> vid:94XXX56 >>> >OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR >>> vid:96XXX10 >>> >OR vid:96XXX54 ) >>> >fq=gp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0, >>> >47.0 >>> >30.0)))" AND NOT pp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 >>> 27.0, >>> >52.0 30.0, 47.0 30.0)))" AND +pp:* >>> > >>> >Basically looking for a set of records by "vid" then if its gp is in one >>> >polygon and is pp is not in another (and it has a pp)... essentially >>> >looking to see if a record moved between two polygons (gp=current, >>> >pp=prev) >>> >during a time period. >>> > >>> >#2 Yes on JTS (unless from my query above I don't) however this is only >>> an >>> >initial use case and I suspect we'll need more complex stuff in the >>> future >>> > >>> >#3 The data is distributed globally but along generally fixed paths and >>> >then clustering around certain areas... for example the polygon above >>> has >>> >about 11k points (with no date filtering). So basically some areas will >>> be >>> >very dense and most areas not, the majority of searches will be around >>> the >>> >dense areas >>> > >>> >#4 Its very likely to be less than 1M results (with filters) .. is there >>> >any functinoality loss with LatLonType fields? >>> > >>> >Thanks, >>> > >>> >steve >>> > >>> > >>> >On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) < >>> >dsmi...@mitre.org> wrote: >>> > >>> >> Steve, >>> >> (1) Can you give a specific example of how your are specifying the >>> >>spatial >>> >> query? I'm looking to ensure you are not using "IsWithin", which is >>> not >>> >> meant for point data. If your query shape is a circle or the bounding >>> >>box >>> >> of a circle, you should use the geofilt query parser, otherwise use >>> the >>> >> quirky syntax that allows you to specify the spatial predicate with >>> >> "Intersects". >>> >> (2) Do you actually need JTS? i.e. are you using Polygons, etc. >>> >> (3) How "dense" would you estimate the data is at the 50m resolution >>> >>you've >>> >> configured the data? If It's very dense then I'll tell you how to >>> raise >>> >> the >>> >> "prefix grid scan level" to a # closer to max-levels. >>> >> (4) Do all of your searches find less than a million points, >>> considering >>> >> all >>> >> filters? If so then it's worth comparing the results with LatLonType. >>> >> >>> >> ~ David Smiley >>> >> >>> >> >>> >> Steven Bower wrote >>> >> > @Erick it is alot of hw, but basically trying to create a "best case >>> >> > scenario" to take HW out of the question. Will try increasing heap >>> >>size >>> >> > tomorrow.. I haven't seen it get close to the max heap size yet.. >>> but >>> >> it's >>> >> > worth trying... >>> >> > >>> >> > Note that these queries look something like: >>> >> > >>> >> > q=*:* >>> >> > fq=[date range] >>> >> > fq=geo query >>> >> > >>> >> > on the fq for the geo query i've added {!cache=false} to prevent it >>> >>from >>> >> > ending up in the filter cache.. once it's in filter cache queries >>> come >>> >> > back >>> >> > in 10-20ms. For my use case i need the first unique geo search query >>> >>to >>> >> > come back in a more reasonable time so I am currently ignoring the >>> >>cache. >>> >> > >>> >> > @Bill will look into that, I'm not certain it will support the >>> >>particular >>> >> > queries that are being executed but I'll investigate.. >>> >> > >>> >> > steve >>> >> > >>> >> > >>> >> > On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson < >>> >> >>> >> > erickerickson@ >>> >> >>> >> > >wrote: >>> >> > >>> >> >> This is very strange. I'd expect slow queries on >>> >> >> the first few queries while these caches were >>> >> >> warmed, but after that I'd expect things to >>> >> >> be quite fast. >>> >> >> >>> >> >> For a 12G index and 256G RAM, you have on the >>> >> >> surface a LOT of hardware to throw at this problem. >>> >> >> You can _try_ giving the JVM, say, 18G but that >>> >> >> really shouldn't be a big issue, your index files >>> >> >> should be MMaped. >>> >> >> >>> >> >> Let's try the crude thing first and give the JVM >>> >> >> more memory. >>> >> >> >>> >> >> FWIW >>> >> >> Erick >>> >> >> >>> >> >> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower < >>> >> >>> >> > smb-apache@ >>> >> >>> >> > > >>> >> >> wrote: >>> >> >> > I've been doing some performance analysis of a spacial search use >>> >>case >>> >> >> I'm >>> >> >> > implementing in Solr 4.3.0. Basically I'm seeing search times >>> alot >>> >> >> higher >>> >> >> > than I'd like them to be and I'm hoping people may have some >>> >> >> suggestions >>> >> >> > for how to optimize further. >>> >> >> > >>> >> >> > Here are the specs of what I'm doing now: >>> >> >> > >>> >> >> > Machine: >>> >> >> > - 16 cores @ 2.8ghz >>> >> >> > - 256gb RAM >>> >> >> > - 1TB (RAID 1+0 on 10 SSD) >>> >> >> > >>> >> >> > Content: >>> >> >> > - 45M docs (not very big only a few fields with no large textual >>> >> >> content) >>> >> >> > - 1 geo field (using config below) >>> >> >> > - index is 12gb >>> >> >> > - 1 shard >>> >> >> > - Using MMapDirectory >>> >> >> > >>> >> >> > Field config: >>> >> >> > >>> >> >> > >>> >> > <fieldType name="geo" >>> class="solr.SpatialRecursivePrefixTreeFieldType" >>> >> >> >>> >> > > distErrPct="0.025" maxDistErr="0.00045" >>> >> >> > >>> >> >> >>> >> >>> >>> >>spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFa >>> >>ctory" >>> >> >> > units="degrees"/> >>> >> >> > >>> >> >> > >>> >> > <field name="geopoint" indexed="true" multiValued="false" >>> >> >> >>> >> > > required="false" stored="true" type="geo"/> >>> >> >> > >>> >> >> > >>> >> >> > What I've figured out so far: >>> >> >> > >>> >> >> > - Most of my time (98%) is being spent in >>> >> >> > java.nio.Bits.copyToByteArray(long,Object,long,long) which is >>> being >>> >> >> > driven by >>> >> >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() >>> >> >> > which from what I gather is basically reading terms from the .tim >>> >>file >>> >> >> > in blocks >>> >> >> > >>> >> >> > - I moved from Java 1.6 to 1.7 based upon what I read here: >>> >> >> > >>> >> >> >>> >> >>> http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/ >>> >> >> > and it definitely had some positive impact (i haven't been able >>> to >>> >> >> > measure this independantly yet) >>> >> >> > >>> >> >> > - I changed maxDistErr from 0.000009 (which is 1m precision per >>> >>docs) >>> >> >> > to 0.00045 (50m precision) .. >>> >> >> > >>> >> >> > - It looks to me that the .tim file are being memory mapped fully >>> >>(ie >>> >> >> > they show up in pmap output) the virtual size of the jvm is ~18gb >>> >> >> > (heap is 6gb) >>> >> >> > >>> >> >> > - I've optimized the index but this doesn't have a dramatic >>> impact >>> >>on >>> >> >> > performance >>> >> >> > >>> >> >> > Changing the precision and the JVM upgrade yielded a drop from >>> ~18s >>> >> >> > avg query time to ~9s avg query time.. This is fantastic but I >>> >>want to >>> >> >> > get this down into the 1-2 second range. >>> >> >> > >>> >> >> > At this point it seems that basically i am bottle-necked on >>> >>basically >>> >> >> > copying memory out of the mapped .tim file which leads me to >>> think >>> >> >> > that the only solution to my problem would be to read less data >>> or >>> >> >> > somehow read it more efficiently.. >>> >> >> > >>> >> >> > If anyone has any suggestions of where to go with this I'd love >>> to >>> >> know >>> >> >> > >>> >> >> > >>> >> >> > thanks, >>> >> >> > >>> >> >> > steve >>> >> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> ----- >>> >> Author: >>> >> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book >>> >> -- >>> >> View this message in context: >>> >> >>> >> >>> http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search >>> >>-tp4081150p4081309.html >>> >> Sent from the Solr - User mailing list archive at Nabble.com. >>> >> >>> >>> >> >