Thank you very much, David. That was a great explanation! Regards,
- Luis Cappa 2013/7/30 Smiley, David W. <dsmi...@mitre.org> > Luis, > > field:* and field:[* TO *] are semantically equivalent -- they have the > same effect. But they internally work differently depending on the field > type. The field type has the chance to intercept the range query to do > something smart (FieldType.getRangeQuery(...)). Numeric/Date (trie) > fields have a reasonably quick implementation for such queries. Spatial > fields could be enhanced similarly but aren't (yet). So in general you > should avoid field:* in favor of field:[* TO *]. Perhaps Solr should > redirect a field:* to the FieldType's getRangeQuery method so that there > is no difference. Anyway, the official/best way to ask for all data in a > field (without cheating and indexing a boolean in a different field) is > field:[* TO *]. > > ~ David > > On 7/30/13 4:44 PM, "Luis Cappa Banda" <luisca...@gmail.com> wrote: > > >Hey, David, > > > >I´ve been reading the thread and I think that is one of the most educative > >mail-threads I´ve read in Solr mailing list. Just for curiosity: > >internally > >for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I > >think that it´s expected to receive the same number of numFound documents, > >but I would like to know the internal behavior of Solr. > > > >Best regards, > > > >- Luis Cappa > > > > > >2013/7/30 Smiley, David W. <dsmi...@mitre.org> > > > >> Steve, > >> The FieldCache and DocValues are irrelevant to this problem. Solr's > >> FilterCache is, and Lucene has no counterpart. Perhaps it would be cool > >> if Solr could look for expensive field:* usages when parsing its queries > >> and re-write them to use the FilterCache. That's quite doable, I think. > >> I just created an issue for it: > >> https://issues.apache.org/jira/browse/SOLR-5093 but don't expect me > >>to > >> work on it anytime soon ;-) > >> > >> > >> ~ David > >> > >> On 7/30/13 2:02 PM, "Steven Bower" <sbo...@alcyon.net> wrote: > >> > >> >I am curious why the field:* walks the entire terms list.. could this > >>be > >> >discovered from a field cache / docvalues? > >> > > >> >steve > >> > > >> > > >> >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower <sbo...@alcyon.net> > >>wrote: > >> > > >> >> Until I get the data refed I there was another field (a date field) > >>that > >> >> was there and not when the geo field was/was not... i tried that > >>field:* > >> >> and query times come down to 2.5s .. also just removing that filter > >> >>brings > >> >> the query down to 30ms.. so I'm very hopeful that with just a boolean > >> >>i'll > >> >> be down in that sub 100ms range.. > >> >> > >> >> steve > >> >> > >> >> > >> >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower <sbo...@alcyon.net> > >> >>wrote: > >> >> > >> >>> Will give the boolean thing a shot... makes sense... > >> >>> > >> >>> > >> >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. > >> >>><dsmi...@mitre.org>wrote: > >> >>> > >> >>>> I see the problem ‹ it's +pp:*. It may look innocent but it's a > >> >>>> performance killer. What your telling Lucene to do is iterate over > >> >>>> *every* term in this index to find all documents that have this > >>data. > >> >>>> Most fields are pretty slow to do that. Lucene/Solr does not have > >> >>>>some > >> >>>> kind of cache for this. Instead, you should index a new boolean > >>field > >> >>>> indicating wether or not 'pp' is populated and then do a simple > >>true > >> >>>> check > >> >>>> against that field. Another approach you could do right now > >>without > >> >>>> reindexing is to simplify the last 2 clauses of your 3-clause > >>boolean > >> >>>> query by using the "IsDisjointTo" predicate. But unfortunately > >>Lucene > >> >>>> doesn't have a generic filter cache capability and so this > >>predicate > >> >>>>has > >> >>>> no place to cache the whole-world query it does internally (each > >>and > >> >>>> every > >> >>>> time it's used), so it will be slower than the boolean field I > >> >>>>suggested > >> >>>> you add. > >> >>>> > >> >>>> > >> >>>> Nevermind on LatLonType; it doesn't support JTS/Polygons. There is > >> >>>> something close called SpatialPointVectorFieldType that could be > >> >>>>modified > >> >>>> trivially but it doesn't support it now. > >> >>>> > >> >>>> ~ David > >> >>>> > >> >>>> On 7/30/13 11:32 AM, "Steven Bower" <sbo...@alcyon.net> wrote: > >> >>>> > >> >>>> >#1 Here is my query: > >> >>>> > > >> >>>> >sort=vid asc > >> >>>> >start=0 > >> >>>> >rows=1000 > >> >>>> >defType=edismax > >> >>>> >q=*:* > >> >>>> >fq=recordType:"xxx" > >> >>>> >fq=vt:"X12B" AND > >> >>>> >fq=(cls:"3" OR cls:"8") > >> >>>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z] > >> >>>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR > >> >>>> >vid:89XXX48 > >> >>>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR > >> >>>> vid:90XXX33 > >> >>>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR > >> >>>> vid:90XXX44 > >> >>>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR > >> >>>> vid:91XXX87 > >> >>>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR > >> >>>> vid:91XXX94 > >> >>>> >OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR > >> >>>> vid:91XXX67 > >> >>>> >OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR > >> >>>> vid:92XXX13 > >> >>>> >OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR > >> >>>> vid:92XXX99 > >> >>>> >OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR > >> >>>> vid:92XXX41 > >> >>>> >OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR > >> >>>> vid:93XXX98 > >> >>>> >OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR > >> >>>> vid:93XXX28 > >> >>>> >OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR > >> >>>> vid:94XXX10 > >> >>>> >OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR > >> >>>> vid:94XXX58 > >> >>>> >OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR > >> >>>> vid:94XXX56 > >> >>>> >OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR > >> >>>> vid:96XXX10 > >> >>>> >OR vid:96XXX54 ) > >> >>>> >fq=gp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 > >> >>>>30.0, > >> >>>> >47.0 > >> >>>> >30.0)))" AND NOT pp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, > >>52.0 > >> >>>> 27.0, > >> >>>> >52.0 30.0, 47.0 30.0)))" AND +pp:* > >> >>>> > > >> >>>> >Basically looking for a set of records by "vid" then if its gp is > >>in > >> >>>>one > >> >>>> >polygon and is pp is not in another (and it has a pp)... > >>essentially > >> >>>> >looking to see if a record moved between two polygons (gp=current, > >> >>>> >pp=prev) > >> >>>> >during a time period. > >> >>>> > > >> >>>> >#2 Yes on JTS (unless from my query above I don't) however this is > >> >>>>only > >> >>>> an > >> >>>> >initial use case and I suspect we'll need more complex stuff in > >>the > >> >>>> future > >> >>>> > > >> >>>> >#3 The data is distributed globally but along generally fixed > >>paths > >> >>>>and > >> >>>> >then clustering around certain areas... for example the polygon > >>above > >> >>>> has > >> >>>> >about 11k points (with no date filtering). So basically some areas > >> >>>>will > >> >>>> be > >> >>>> >very dense and most areas not, the majority of searches will be > >> >>>>around > >> >>>> the > >> >>>> >dense areas > >> >>>> > > >> >>>> >#4 Its very likely to be less than 1M results (with filters) .. is > >> >>>>there > >> >>>> >any functinoality loss with LatLonType fields? > >> >>>> > > >> >>>> >Thanks, > >> >>>> > > >> >>>> >steve > >> >>>> > > >> >>>> > > >> >>>> >On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) < > >> >>>> >dsmi...@mitre.org> wrote: > >> >>>> > > >> >>>> >> Steve, > >> >>>> >> (1) Can you give a specific example of how your are specifying > >>the > >> >>>> >>spatial > >> >>>> >> query? I'm looking to ensure you are not using "IsWithin", > >>which > >> >>>>is > >> >>>> not > >> >>>> >> meant for point data. If your query shape is a circle or the > >> >>>>bounding > >> >>>> >>box > >> >>>> >> of a circle, you should use the geofilt query parser, otherwise > >>use > >> >>>> the > >> >>>> >> quirky syntax that allows you to specify the spatial predicate > >>with > >> >>>> >> "Intersects". > >> >>>> >> (2) Do you actually need JTS? i.e. are you using Polygons, etc. > >> >>>> >> (3) How "dense" would you estimate the data is at the 50m > >> >>>>resolution > >> >>>> >>you've > >> >>>> >> configured the data? If It's very dense then I'll tell you how > >>to > >> >>>> raise > >> >>>> >> the > >> >>>> >> "prefix grid scan level" to a # closer to max-levels. > >> >>>> >> (4) Do all of your searches find less than a million points, > >> >>>> considering > >> >>>> >> all > >> >>>> >> filters? If so then it's worth comparing the results with > >> >>>>LatLonType. > >> >>>> >> > >> >>>> >> ~ David Smiley > >> >>>> >> > >> >>>> >> > >> >>>> >> Steven Bower wrote > >> >>>> >> > @Erick it is alot of hw, but basically trying to create a > >>"best > >> >>>>case > >> >>>> >> > scenario" to take HW out of the question. Will try increasing > >> >>>>heap > >> >>>> >>size > >> >>>> >> > tomorrow.. I haven't seen it get close to the max heap size > >>yet.. > >> >>>> but > >> >>>> >> it's > >> >>>> >> > worth trying... > >> >>>> >> > > >> >>>> >> > Note that these queries look something like: > >> >>>> >> > > >> >>>> >> > q=*:* > >> >>>> >> > fq=[date range] > >> >>>> >> > fq=geo query > >> >>>> >> > > >> >>>> >> > on the fq for the geo query i've added {!cache=false} to > >>prevent > >> >>>>it > >> >>>> >>from > >> >>>> >> > ending up in the filter cache.. once it's in filter cache > >>queries > >> >>>> come > >> >>>> >> > back > >> >>>> >> > in 10-20ms. For my use case i need the first unique geo search > >> >>>>query > >> >>>> >>to > >> >>>> >> > come back in a more reasonable time so I am currently ignoring > >> >>>>the > >> >>>> >>cache. > >> >>>> >> > > >> >>>> >> > @Bill will look into that, I'm not certain it will support the > >> >>>> >>particular > >> >>>> >> > queries that are being executed but I'll investigate.. > >> >>>> >> > > >> >>>> >> > steve > >> >>>> >> > > >> >>>> >> > > >> >>>> >> > On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson < > >> >>>> >> > >> >>>> >> > erickerickson@ > >> >>>> >> > >> >>>> >> > >wrote: > >> >>>> >> > > >> >>>> >> >> This is very strange. I'd expect slow queries on > >> >>>> >> >> the first few queries while these caches were > >> >>>> >> >> warmed, but after that I'd expect things to > >> >>>> >> >> be quite fast. > >> >>>> >> >> > >> >>>> >> >> For a 12G index and 256G RAM, you have on the > >> >>>> >> >> surface a LOT of hardware to throw at this problem. > >> >>>> >> >> You can _try_ giving the JVM, say, 18G but that > >> >>>> >> >> really shouldn't be a big issue, your index files > >> >>>> >> >> should be MMaped. > >> >>>> >> >> > >> >>>> >> >> Let's try the crude thing first and give the JVM > >> >>>> >> >> more memory. > >> >>>> >> >> > >> >>>> >> >> FWIW > >> >>>> >> >> Erick > >> >>>> >> >> > >> >>>> >> >> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower < > >> >>>> >> > >> >>>> >> > smb-apache@ > >> >>>> >> > >> >>>> >> > > > >> >>>> >> >> wrote: > >> >>>> >> >> > I've been doing some performance analysis of a spacial > >>search > >> >>>>use > >> >>>> >>case > >> >>>> >> >> I'm > >> >>>> >> >> > implementing in Solr 4.3.0. Basically I'm seeing search > >>times > >> >>>> alot > >> >>>> >> >> higher > >> >>>> >> >> > than I'd like them to be and I'm hoping people may have > >>some > >> >>>> >> >> suggestions > >> >>>> >> >> > for how to optimize further. > >> >>>> >> >> > > >> >>>> >> >> > Here are the specs of what I'm doing now: > >> >>>> >> >> > > >> >>>> >> >> > Machine: > >> >>>> >> >> > - 16 cores @ 2.8ghz > >> >>>> >> >> > - 256gb RAM > >> >>>> >> >> > - 1TB (RAID 1+0 on 10 SSD) > >> >>>> >> >> > > >> >>>> >> >> > Content: > >> >>>> >> >> > - 45M docs (not very big only a few fields with no large > >> >>>>textual > >> >>>> >> >> content) > >> >>>> >> >> > - 1 geo field (using config below) > >> >>>> >> >> > - index is 12gb > >> >>>> >> >> > - 1 shard > >> >>>> >> >> > - Using MMapDirectory > >> >>>> >> >> > > >> >>>> >> >> > Field config: > >> >>>> >> >> > > >> >>>> >> >> > > >> >>>> >> > <fieldType name="geo" > >> >>>> class="solr.SpatialRecursivePrefixTreeFieldType" > >> >>>> >> >> > >> >>>> >> > > distErrPct="0.025" maxDistErr="0.00045" > >> >>>> >> >> > > >> >>>> >> >> > >> >>>> >> > >> >>>> > >> >>>> > >> > >>>>>>>>spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialCon > >>>>>>>>te > >> >>>>>>xtFa > >> >>>> >>ctory" > >> >>>> >> >> > units="degrees"/> > >> >>>> >> >> > > >> >>>> >> >> > > >> >>>> >> > <field name="geopoint" indexed="true" multiValued="false" > >> >>>> >> >> > >> >>>> >> > > required="false" stored="true" type="geo"/> > >> >>>> >> >> > > >> >>>> >> >> > > >> >>>> >> >> > What I've figured out so far: > >> >>>> >> >> > > >> >>>> >> >> > - Most of my time (98%) is being spent in > >> >>>> >> >> > java.nio.Bits.copyToByteArray(long,Object,long,long) which > >>is > >> >>>> being > >> >>>> >> >> > driven by > >> >>>> >> >> > >> >>>>BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() > >> >>>> >> >> > which from what I gather is basically reading terms from > >>the > >> >>>>.tim > >> >>>> >>file > >> >>>> >> >> > in blocks > >> >>>> >> >> > > >> >>>> >> >> > - I moved from Java 1.6 to 1.7 based upon what I read here: > >> >>>> >> >> > > >> >>>> >> >> > >> >>>> >> > >> >>>> > >> >>>> > >> http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance > >> >>>>/ > >> >>>> >> >> > and it definitely had some positive impact (i haven't been > >> >>>>able > >> >>>> to > >> >>>> >> >> > measure this independantly yet) > >> >>>> >> >> > > >> >>>> >> >> > - I changed maxDistErr from 0.000009 (which is 1m precision > >> >>>>per > >> >>>> >>docs) > >> >>>> >> >> > to 0.00045 (50m precision) .. > >> >>>> >> >> > > >> >>>> >> >> > - It looks to me that the .tim file are being memory mapped > >> >>>>fully > >> >>>> >>(ie > >> >>>> >> >> > they show up in pmap output) the virtual size of the jvm is > >> >>>>~18gb > >> >>>> >> >> > (heap is 6gb) > >> >>>> >> >> > > >> >>>> >> >> > - I've optimized the index but this doesn't have a dramatic > >> >>>> impact > >> >>>> >>on > >> >>>> >> >> > performance > >> >>>> >> >> > > >> >>>> >> >> > Changing the precision and the JVM upgrade yielded a drop > >>from > >> >>>> ~18s > >> >>>> >> >> > avg query time to ~9s avg query time.. This is fantastic > >>but I > >> >>>> >>want to > >> >>>> >> >> > get this down into the 1-2 second range. > >> >>>> >> >> > > >> >>>> >> >> > At this point it seems that basically i am bottle-necked on > >> >>>> >>basically > >> >>>> >> >> > copying memory out of the mapped .tim file which leads me > >>to > >> >>>> think > >> >>>> >> >> > that the only solution to my problem would be to read less > >> >>>>data > >> >>>> or > >> >>>> >> >> > somehow read it more efficiently.. > >> >>>> >> >> > > >> >>>> >> >> > If anyone has any suggestions of where to go with this I'd > >> >>>>love > >> >>>> to > >> >>>> >> know > >> >>>> >> >> > > >> >>>> >> >> > > >> >>>> >> >> > thanks, > >> >>>> >> >> > > >> >>>> >> >> > steve > >> >>>> >> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> ----- > >> >>>> >> Author: > >> >>>> >> > >> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book > >> >>>> >> -- > >> >>>> >> View this message in context: > >> >>>> >> > >> >>>> >> > >> >>>> > >> >>>> > >> http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear > >> >>>>ch > >> >>>> >>-tp4081150p4081309.html > >> >>>> >> Sent from the Solr - User mailing list archive at Nabble.com. > >> >>>> >> > >> >>>> > >> >>>> > >> >>> > >> >> > >> > >> > > > > > >-- > >- Luis Cappa > > -- - Luis Cappa