Steve,
The FieldCache and DocValues are irrelevant to this problem.  Solr's
FilterCache is, and Lucene has no counterpart.  Perhaps it would be cool
if Solr could look for expensive field:* usages when parsing its queries
and re-write them to use the FilterCache.  That's quite doable, I think.
I just created an issue for it:
https://issues.apache.org/jira/browse/SOLR-5093    but don't expect me to
work on it anytime soon ;-)


~ David

On 7/30/13 2:02 PM, "Steven Bower" <sbo...@alcyon.net> wrote:

>I am curious why the field:* walks the entire terms list.. could this be
>discovered from a field cache / docvalues?
>
>steve
>
>
>On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower <sbo...@alcyon.net> wrote:
>
>> Until I get the data refed I there was another field (a date field) that
>> was there and not when the geo field was/was not... i tried that field:*
>> and query times come down to 2.5s .. also just removing that filter
>>brings
>> the query down to 30ms.. so I'm very hopeful that with just a boolean
>>i'll
>> be down in that sub 100ms range..
>>
>> steve
>>
>>
>> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower <sbo...@alcyon.net>
>>wrote:
>>
>>> Will give the boolean thing a shot... makes sense...
>>>
>>>
>>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W.
>>><dsmi...@mitre.org>wrote:
>>>
>>>> I see the problem ‹ it's +pp:*. It may look innocent but it's a
>>>> performance killer.  What your telling Lucene to do is iterate over
>>>> *every* term in this index to find all documents that have this data.
>>>> Most fields are pretty slow to do that.  Lucene/Solr does not have
>>>>some
>>>> kind of cache for this. Instead, you should index a new boolean field
>>>> indicating wether or not 'pp' is populated and then do a simple true
>>>> check
>>>> against that field.  Another approach you could do right now without
>>>> reindexing is to simplify the last 2 clauses of your 3-clause boolean
>>>> query by using the "IsDisjointTo" predicate.  But unfortunately Lucene
>>>> doesn't have a generic filter cache capability and so this predicate
>>>>has
>>>> no place to cache the whole-world query it does internally (each and
>>>> every
>>>> time it's used), so it will be slower than the boolean field I
>>>>suggested
>>>> you add.
>>>>
>>>>
>>>> Nevermind on LatLonType; it doesn't support JTS/Polygons.  There is
>>>> something close called SpatialPointVectorFieldType that could be
>>>>modified
>>>> trivially but it doesn't support it now.
>>>>
>>>> ~ David
>>>>
>>>> On 7/30/13 11:32 AM, "Steven Bower" <sbo...@alcyon.net> wrote:
>>>>
>>>> >#1 Here is my query:
>>>> >
>>>> >sort=vid asc
>>>> >start=0
>>>> >rows=1000
>>>> >defType=edismax
>>>> >q=*:*
>>>> >fq=recordType:"xxx"
>>>> >fq=vt:"X12B" AND
>>>> >fq=(cls:"3" OR cls:"8")
>>>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
>>>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR
>>>> >vid:89XXX48
>>>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR
>>>> vid:90XXX33
>>>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR
>>>> vid:90XXX44
>>>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR
>>>> vid:91XXX87
>>>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR
>>>> vid:91XXX94
>>>> >OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR
>>>> vid:91XXX67
>>>> >OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR
>>>> vid:92XXX13
>>>> >OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR
>>>> vid:92XXX99
>>>> >OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR
>>>> vid:92XXX41
>>>> >OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR
>>>> vid:93XXX98
>>>> >OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR
>>>> vid:93XXX28
>>>> >OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR
>>>> vid:94XXX10
>>>> >OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR
>>>> vid:94XXX58
>>>> >OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR
>>>> vid:94XXX56
>>>> >OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR
>>>> vid:96XXX10
>>>> >OR vid:96XXX54 )
>>>> >fq=gp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0
>>>>30.0,
>>>> >47.0
>>>> >30.0)))" AND NOT pp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0
>>>> 27.0,
>>>> >52.0 30.0, 47.0 30.0)))" AND +pp:*
>>>> >
>>>> >Basically looking for a set of records by "vid" then if its gp is in
>>>>one
>>>> >polygon and is pp is not in another (and it has a pp)... essentially
>>>> >looking to see if a record moved between two polygons (gp=current,
>>>> >pp=prev)
>>>> >during a time period.
>>>> >
>>>> >#2 Yes on JTS (unless from my query above I don't) however this is
>>>>only
>>>> an
>>>> >initial use case and I suspect we'll need more complex stuff in the
>>>> future
>>>> >
>>>> >#3 The data is distributed globally but along generally fixed paths
>>>>and
>>>> >then clustering around certain areas... for example the polygon above
>>>> has
>>>> >about 11k points (with no date filtering). So basically some areas
>>>>will
>>>> be
>>>> >very dense and most areas not, the majority of searches will be
>>>>around
>>>> the
>>>> >dense areas
>>>> >
>>>> >#4 Its very likely to be less than 1M results (with filters) .. is
>>>>there
>>>> >any functinoality loss with LatLonType fields?
>>>> >
>>>> >Thanks,
>>>> >
>>>> >steve
>>>> >
>>>> >
>>>> >On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) <
>>>> >dsmi...@mitre.org> wrote:
>>>> >
>>>> >> Steve,
>>>> >> (1)  Can you give a specific example of how your are specifying the
>>>> >>spatial
>>>> >> query?  I'm looking to ensure you are not using "IsWithin", which
>>>>is
>>>> not
>>>> >> meant for point data.  If your query shape is a circle or the
>>>>bounding
>>>> >>box
>>>> >> of a circle, you should use the geofilt query parser, otherwise use
>>>> the
>>>> >> quirky syntax that allows you to specify the spatial predicate with
>>>> >> "Intersects".
>>>> >> (2) Do you actually need JTS?  i.e. are you using Polygons, etc.
>>>> >> (3) How "dense" would you estimate the data is at the 50m
>>>>resolution
>>>> >>you've
>>>> >> configured the data?  If It's very dense then I'll tell you how to
>>>> raise
>>>> >> the
>>>> >> "prefix grid scan level" to a # closer to max-levels.
>>>> >> (4) Do all of your searches find less than a million points,
>>>> considering
>>>> >> all
>>>> >> filters?  If so then it's worth comparing the results with
>>>>LatLonType.
>>>> >>
>>>> >> ~ David Smiley
>>>> >>
>>>> >>
>>>> >> Steven Bower wrote
>>>> >> > @Erick it is alot of hw, but basically trying to create a "best
>>>>case
>>>> >> > scenario" to take HW out of the question. Will try increasing
>>>>heap
>>>> >>size
>>>> >> > tomorrow.. I haven't seen it get close to the max heap size yet..
>>>> but
>>>> >> it's
>>>> >> > worth trying...
>>>> >> >
>>>> >> > Note that these queries look something like:
>>>> >> >
>>>> >> > q=*:*
>>>> >> > fq=[date range]
>>>> >> > fq=geo query
>>>> >> >
>>>> >> > on the fq for the geo query i've added {!cache=false} to prevent
>>>>it
>>>> >>from
>>>> >> > ending up in the filter cache.. once it's in filter cache queries
>>>> come
>>>> >> > back
>>>> >> > in 10-20ms. For my use case i need the first unique geo search
>>>>query
>>>> >>to
>>>> >> > come back in a more reasonable time so I am currently ignoring
>>>>the
>>>> >>cache.
>>>> >> >
>>>> >> > @Bill will look into that, I'm not certain it will support the
>>>> >>particular
>>>> >> > queries that are being executed but I'll investigate..
>>>> >> >
>>>> >> > steve
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson &lt;
>>>> >>
>>>> >> > erickerickson@
>>>> >>
>>>> >> > &gt;wrote:
>>>> >> >
>>>> >> >> This is very strange. I'd expect slow queries on
>>>> >> >> the first few queries while these caches were
>>>> >> >> warmed, but after that I'd expect things to
>>>> >> >> be quite fast.
>>>> >> >>
>>>> >> >> For a 12G index and 256G RAM, you have on the
>>>> >> >> surface a LOT of hardware to throw at this problem.
>>>> >> >> You can _try_ giving the JVM, say, 18G but that
>>>> >> >> really shouldn't be a big issue, your index files
>>>> >> >> should be MMaped.
>>>> >> >>
>>>> >> >> Let's try the crude thing first and give the JVM
>>>> >> >> more memory.
>>>> >> >>
>>>> >> >> FWIW
>>>> >> >> Erick
>>>> >> >>
>>>> >> >> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower &lt;
>>>> >>
>>>> >> > smb-apache@
>>>> >>
>>>> >> > &gt;
>>>> >> >> wrote:
>>>> >> >> > I've been doing some performance analysis of a spacial search
>>>>use
>>>> >>case
>>>> >> >> I'm
>>>> >> >> > implementing in Solr 4.3.0. Basically I'm seeing search times
>>>> alot
>>>> >> >> higher
>>>> >> >> > than I'd like them to be and I'm hoping people may have some
>>>> >> >> suggestions
>>>> >> >> > for how to optimize further.
>>>> >> >> >
>>>> >> >> > Here are the specs of what I'm doing now:
>>>> >> >> >
>>>> >> >> > Machine:
>>>> >> >> > - 16 cores @ 2.8ghz
>>>> >> >> > - 256gb RAM
>>>> >> >> > - 1TB (RAID 1+0 on 10 SSD)
>>>> >> >> >
>>>> >> >> > Content:
>>>> >> >> > - 45M docs (not very big only a few fields with no large
>>>>textual
>>>> >> >> content)
>>>> >> >> > - 1 geo field (using config below)
>>>> >> >> > - index is 12gb
>>>> >> >> > - 1 shard
>>>> >> >> > - Using MMapDirectory
>>>> >> >> >
>>>> >> >> > Field config:
>>>> >> >> >
>>>> >> >> >
>>>> >> > <fieldType name="geo"
>>>> class="solr.SpatialRecursivePrefixTreeFieldType"
>>>> >> >>
>>>> >> >  > distErrPct="0.025" maxDistErr="0.00045"
>>>> >> >> >
>>>> >> >>
>>>> >>
>>>>
>>>> 
>>>>>>spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialConte
>>>>>>xtFa
>>>> >>ctory"
>>>> >> >> > units="degrees"/>
>>>> >> >> >
>>>> >> >> >
>>>> >> > <field  name="geopoint" indexed="true" multiValued="false"
>>>> >> >>
>>>> >> >  > required="false" stored="true" type="geo"/>
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > What I've figured out so far:
>>>> >> >> >
>>>> >> >> > - Most of my time (98%) is being spent in
>>>> >> >> > java.nio.Bits.copyToByteArray(long,Object,long,long) which is
>>>> being
>>>> >> >> > driven by
>>>> >> >> 
>>>>BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
>>>> >> >> > which from what I gather is basically reading terms from the
>>>>.tim
>>>> >>file
>>>> >> >> > in blocks
>>>> >> >> >
>>>> >> >> > - I moved from Java 1.6 to 1.7 based upon what I read here:
>>>> >> >> >
>>>> >> >>
>>>> >>
>>>> 
>>>>http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance
>>>>/
>>>> >> >> > and it definitely had some positive impact (i haven't been
>>>>able
>>>> to
>>>> >> >> > measure this independantly yet)
>>>> >> >> >
>>>> >> >> > - I changed maxDistErr from 0.000009 (which is 1m precision
>>>>per
>>>> >>docs)
>>>> >> >> > to 0.00045 (50m precision) ..
>>>> >> >> >
>>>> >> >> > - It looks to me that the .tim file are being memory mapped
>>>>fully
>>>> >>(ie
>>>> >> >> > they show up in pmap output) the virtual size of the jvm is
>>>>~18gb
>>>> >> >> > (heap is 6gb)
>>>> >> >> >
>>>> >> >> > - I've optimized the index but this doesn't have a dramatic
>>>> impact
>>>> >>on
>>>> >> >> > performance
>>>> >> >> >
>>>> >> >> > Changing the precision and the JVM upgrade yielded a drop from
>>>> ~18s
>>>> >> >> > avg query time to ~9s avg query time.. This is fantastic but I
>>>> >>want to
>>>> >> >> > get this down into the 1-2 second range.
>>>> >> >> >
>>>> >> >> > At this point it seems that basically i am bottle-necked on
>>>> >>basically
>>>> >> >> > copying memory out of the mapped .tim file which leads me to
>>>> think
>>>> >> >> > that the only solution to my problem would be to read less
>>>>data
>>>> or
>>>> >> >> > somehow read it more efficiently..
>>>> >> >> >
>>>> >> >> > If anyone has any suggestions of where to go with this I'd
>>>>love
>>>> to
>>>> >> know
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > thanks,
>>>> >> >> >
>>>> >> >> > steve
>>>> >> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> -----
>>>> >>  Author:
>>>> >> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>>>> >> --
>>>> >> View this message in context:
>>>> >>
>>>> >>
>>>> 
>>>>http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear
>>>>ch
>>>> >>-tp4081150p4081309.html
>>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>>>> >>
>>>>
>>>>
>>>
>>

Reply via email to