limit solr results before join
Is there any way to limit the results of a query on the "from" index before it gets joined? The SQL analogy might be... SELECT * from toIndex join (select * from fromIndex where "some query" limit 1000 ) fromIndex on fromIndex.from=toIndex.to Example: _query_:"{!join fromIndex=expressionData from=anatomyID to=anatomyID v='(anatomy:\"brain\")'}" Say I have an index representing data for gene expression (we work with genetics), and you query it by anatomy term. So the above would query for all data that shows gene expression in "brain". Now I want to get a set of related data for each anatomy term via the join. Is there any way to get the related data for only anatomy terms in the first 1000 expression data documents (fromIndex)? The reason is because there could be millions of data documents (fromIndex), and we process them in batches to load a visualization of the query results. Doing the join on all the results for each batch I process is becoming a bottleneck for large sets of data. Thanks, -Kevin The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
RE: limit solr results before join
I don't know what that means. Is that a no? From: Mikhail Khludnev [mkhlud...@griddynamics.com] Sent: Tuesday, June 24, 2014 2:18 PM To: solr-user Subject: Re: limit solr results before join Hello Kevin, You can only apply some restriction clauses (with +) to the from side query. On Tue, Jun 24, 2014 at 8:09 PM, Kevin Stone wrote: > Is there any way to limit the results of a query on the "from" index > before it gets joined? > > The SQL analogy might be... > SELECT * > from toIndex join > (select * from fromIndex > where "some query" > limit 1000 > ) fromIndex on fromIndex.from=toIndex.to > > > Example: > _query_:"{!join fromIndex=expressionData from=anatomyID to=anatomyID > v='(anatomy:\"brain\")'}" > > Say I have an index representing data for gene expression (we work with > genetics), and you query it by anatomy term. So the above would query for > all data that shows gene expression in "brain". > > Now I want to get a set of related data for each anatomy term via the > join. Is there any way to get the related data for only anatomy terms in > the first 1000 expression data documents (fromIndex)? The reason is because > there could be millions of data documents (fromIndex), and we process them > in batches to load a visualization of the query results. > > Doing the join on all the results for each batch I process is becoming a > bottleneck for large sets of data. > > Thanks, > -Kevin > > The information in this email, including attachments, may be confidential > and is intended solely for the addressee(s). If you believe you received > this email by mistake, please notify the sender by return email as soon as > possible. > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
custom field type plugin
I have a particular use case that I think might require a custom field type, however I am having trouble getting the plugin to work. My use case has to do with genetics data, and we are running into several situations were we need to be able to query multiple regions of a chromosome (or gene, or other object types). All that really boils down to is being able to give a number, e.g. 10234, and return documents that have regions containing the number. So you'd have a document with a list like ["1:16090","400:8000","40123:43564"], and it should come back because 10234 falls between "1:16090". If there is a better or easier way to do this please speak up. I'd rather not have to use a "join" on another index, because 1) it's more complex to set up, and 2) we might need to join against something else and you can only do one join at a time. Anyway… I tried creating a field type similar to a PointType just to see if I could get one working. I added the following jars to get it to compile: apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr-solrj-4.0.0. I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib folder, and specified it in my solr.xml (I have multiple cores). After starting up solr, I got the line that it picked up the jar: INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader But I get this error about it not being able to find the AbstractSubTypeFieldType class. Here is the first bit of the trace: SEVERE: null:java.lang.NoClassDefFoundError: org/apache/solr/schema/AbstractSubTypeFieldType at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ...etc… Any hints as to what I did wrong? I can provide source code, or a fuller stack trace, config settings, etc. Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, then repack. However, when I did that, I get a NoClassDefFoundError for my plugin itself. Thanks, Kevin The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
RE: custom field type plugin
I can try again this weekend to get a clean environment. However, the order I did things in was the reverse of what you suggest. I got the AbstractSubTypeFieldType error first. Then I removed my jar from the sharedLib folder, and tried the war repacking solution. That is when I got NoClassDefFoundError on my custom class. The spatial feature looks intriguing, although I have no idea if it could fit my use case. It looks fairly complex a concept, but maybe it is all the different shapes and geometry that is confusing me. If I thought of my problem in terms of geometry, I would say a chromosome region is like a segment of a line. I would need to define multiple line segments and be able to query by a single point and only return documents that have a line segment that the single point falls on. Does that make sense? Is that at all doable with a spatial query? -Kevin From: Chris Hostetter [hossman_luc...@fucit.org] Sent: Friday, July 19, 2013 3:15 PM To: solr-user@lucene.apache.org Subject: Re: custom field type plugin : a chromosome (or gene, or other object types). All that really boils : down to is being able to give a number, e.g. 10234, and return documents : that have regions containing the number. So you'd have a document with a : list like ["1:16090","400:8000","40123:43564"], and it should come You should take a look at some of the build in features using the spatial types... http://wiki.apache.org/solr/SpatialForTimeDurations I believe David also covered this usecase in his talk in san diego... http://www.lucenerevolution.org/2013/Lucene-Solr4-Spatial-Deep-Dive : But I get this error about it not being able to find the AbstractSubTypeFieldType class. : Here is the first bit of the trace: ... : Any hints as to what I did wrong? I can provide source code, or a fuller stack trace, config settings, etc. : : Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, : then repack. However, when I did that, I get a NoClassDefFoundError for : my plugin itself. a fuller stack trace might help -- but the key question is what order did you try these two approaches in? and what exactly did you fieldType declaration look like? my guess is that you tried repacking the war first, and maybe your exploded war classpath is still polluted with your old jar from when you repacked it and now you have multiple copies in the plugin classloaders classpath. (the initial NoClassDefFoundError could have been from a mistake in your declaration) try starting competley clean, using the stock war and sample configs and make sure you get no errors. then try declaring your custom fieldType, using hte fully qualified classname w/o even telling solr about your jar, and ensure that you get a NoClassDefFoundError for your custom class -- if you get an error about AbstractSubTypeFieldType again then you still have a copy of your custom class somwhere in the classpath. *THEN* try adding a directive to load your jar to load it. if that still doesn't work provide us with the details of your servlet container, solr version, the full stack trace, the details of how you are configuring your , how you declared the what your filesystem looks like for your solrhome, war, etc... -Hoss The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
RE: custom field type plugin
maField field, boolean top) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Sorting not supported on PointType " + field.getName()); } @Override public Query getFieldQuery(QParser parser, SchemaField field, String externalVal) { String[] coords = externalVal.split("-"); if(coords.length!=2) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,"Invalid coordinate format for "+externalVal); } BooleanQuery bq = new BooleanQuery(true); SchemaField coord1 = subField(field, 0); Query tq1 = coord1.getType().getFieldQuery(parser, coord1, coords[0]); bq.add(tq1, BooleanClause.Occur.MUST); SchemaField coord2 = subField(field, 1); Query tq2 = coord2.getType().getFieldQuery(parser, coord1, coords[1]); bq.add(tq2, BooleanClause.Occur.MUST); return bq; } } class CoordTypeValueSource extends VectorValueSource { private final SchemaField sf; public CoordTypeValueSource(SchemaField sf, List sources) { super(sources); this.sf = sf; } @Override public String name() { return "point"; } @Override public String description() { return name()+"("+sf.getName()+")"; } } From: Kevin Stone Sent: Saturday, July 20, 2013 8:24 AM To: solr-user@lucene.apache.org Subject: RE: custom field type plugin Thank you for the links, they really helped me understand. I see how the spatial solution works now. I think this could work as a good backup if I cannot get the custom field type working. The custom field would ideally be a bit more robust than what I mentioned before, because a region really means four pieces, a chromosome (e.g. 1-22), a start base pair, an end base pair, and the direction (forward or reverse). But if need be, the chomosome and direction can be multiplied into the base pairs to get it down to two translated numbers. As for the upper bounds, I do have an idea, but it would be a large number, say between 1 and 10 billion depending on how I translate the values. I'll just have to try it out I guess. Ok, now back to the custom field problem. From here on I'll spam source code and stack traces. I started fresh, removing all places where I may have had my jar file, and popped in a fresh solr.war. I define the plugin class in my schema like this: and use it here: Ok, when I start solr, I get this error saying it can't find the plugin class that is defined in my schema. org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType "geneticLocation": Error loading class 'org.jax.mgi.fe.solrplugin.GeneticLocation' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369) at org.apache.solr.schema.IndexSchema.(IndexSchema.java:113) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) ...etc... Caused by: org.apache.solr.common.SolrException: Error loading class 'org.jax.mgi.fe.solrplugin.GeneticLocation' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:436) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:453) ...etc... So, that's all fine. In my solr.xml, I define this sharedLib folder: I shut the server down, drop in my CustomPlugins.jar file and start the server back up. And... I got a different error! It said I was missing the subFieldType or subFieldSuffix in my fieldType definition. So I added one 'subFieldSuffix="_gl". Then I restart the server thinking that I'm making progress, and I get the old error again. I pulled out the jar, did the above test again to verify that it couldn't find my plugin. Then I re-add it and restart. Nope, still this error about AbstractSubTypeFieldType. Here is the full stack trace: SEVERE: null:java.lang.NoClassDefFoundError: org/apache/solr/schema/AbstractSubTypeFieldType at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.
RE: custom field type plugin
Thank you for the links, they really helped me understand. I see how the spatial solution works now. I think this could work as a good backup if I cannot get the custom field type working. The custom field would ideally be a bit more robust than what I mentioned before, because a region really means four pieces, a chromosome (e.g. 1-22), a start base pair, an end base pair, and the direction (forward or reverse). But if need be, the chomosome and direction can be multiplied into the base pairs to get it down to two translated numbers. As for the upper bounds, I do have an idea, but it would be a large number, say between 1 and 10 billion depending on how I translate the values. I'll just have to try it out I guess. Ok, now back to the custom field problem. From here on I'll spam source code and stack traces. I started fresh, removing all places where I may have had my jar file, and popped in a fresh solr.war. I define the plugin class in my schema like this: and use it here: Ok, when I start solr, I get this error saying it can't find the plugin class that is defined in my schema. org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType "geneticLocation": Error loading class 'org.jax.mgi.fe.solrplugin.GeneticLocation' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369) at org.apache.solr.schema.IndexSchema.(IndexSchema.java:113) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) ...etc... Caused by: org.apache.solr.common.SolrException: Error loading class 'org.jax.mgi.fe.solrplugin.GeneticLocation' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:436) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:453) ...etc... So, that's all fine. In my solr.xml, I define this sharedLib folder: I shut the server down, drop in my CustomPlugins.jar file and start the server back up. And... I got a different error! It said I was missing the subFieldType or subFieldSuffix in my fieldType definition. So I added one 'subFieldSuffix="_gl". Then I restart the server thinking that I'm making progress, and I get the old error again. I pulled out the jar, did the above test again to verify that it couldn't find my plugin. Then I re-add it and restart. Nope, still this error about AbstractSubTypeFieldType. Here is the full stack trace: SEVERE: null:java.lang.NoClassDefFoundError: org/apache/solr/schema/AbstractSubTypeFieldType at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:401) at java.lang.ClassLoader.loadClass(ClassLoader.java:410) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789) at java.lang.ClassLoader.loadClass(ClassLoader.java:410) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:266) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:420) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:453) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:81) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369) at org.apache.solr.schema.IndexSchema.(IndexSchema.java:113) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
Re: custom field type plugin
What are the dangers of trying to use a range of 10 billion? Simply a slower index time? Or will I get inaccurate results? I have tried it on a very small sample of documents, and it seemed to work. I could spend some time this week trying to get a more robust (and accurate) dataset loaded to play around with. The reason for the 10 billion is to support being able to query for a region on a chromosome. A user might want to know what genes overlap a point on a specific chromosome. Unless I can use 3 dimensional coordinates (which gave an error when I tried it), I'll need to multiply the coordinates by some offset for each chromosome to be able to normalise the data (at both index and query time). The largest chromosome (chr 1) has almost 250,000,000 base pairs. I could probably squeeze the rest a bit smaller, but I'd rather use one size for all chromosomes, since we have more than just human data to deal with. It would get quite messy otherwise. On 7/22/13 11:50 AM, "David Smiley (@MITRE.org)" wrote: >Like Hoss said, you're going to have to solve this using >http://wiki.apache.org/solr/SpatialForTimeDurations >Using PointType is *not* going to work because your durations are >multi-valued per document. > >It would be useful to create a custom field type that wraps the capability >outlined on the wiki to make it easier to use without requiring the user >to >think spatially. > >You mentioned that these numeric ranges extend upwards of 10 billion or >so. >Unfortunately, the current "prefix tree" implementation under the hood for >non-geodetic spatial, the QuadTree, is unlikely to scale to numbers that >big. I don't know where the boundary is, but I doubt 10B. You could try >and see what happens. I'm working (very slowly on very little spare time) >on improving the PrefixTree implementations to scale to such large >numbers; >I hope something will be available this fall. > >~ David Smiley > > >Kevin Stone wrote >> I have a particular use case that I think might require a custom field >> type, however I am having trouble getting the plugin to work. >> My use case has to do with genetics data, and we are running into >>several >> situations were we need to be able to query multiple regions of a >> chromosome (or gene, or other object types). All that really boils down >>to >> is being able to give a number, e.g. 10234, and return documents that >>have >> regions containing the number. So you'd have a document with a list like >> ["1:16090","400:8000","40123:43564"], and it should come back >>because >> 10234 falls between "1:16090". If there is a better or easier way to >> do this please speak up. I'd rather not have to use a "join" on another >> index, because 1) it's more complex to set up, and 2) we might need to >> join against something else and you can only do one join at a time. >> >> AnywayŠ I tried creating a field type similar to a PointType just to see >> if I could get one working. I added the following jars to get it to >> compile: >> >>apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr >>-solrj-4.0.0. >> I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib >> folder, and specified it in my solr.xml (I have multiple cores). >> >> After starting up solr, I got the line that it picked up the jar: >> INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader >> >> But I get this error about it not being able to find the >> AbstractSubTypeFieldType class. >> Here is the first bit of the trace: >> >> SEVERE: null:java.lang.NoClassDefFoundError: >> org/apache/solr/schema/AbstractSubTypeFieldType >> at java.lang.ClassLoader.defineClass1(Native Method) >> at java.lang.ClassLoader.defineClass(ClassLoader.java:791) >> at >>java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) >> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) >> at java.net.URLClassLoader.access$100(URLClassLoader.java:71) >> at java.net.URLClassLoader$1.run(URLClassLoader.java:361) >> at java.net.URLClassLoader$1.run(URLClassLoader.java:355) >> ...etcŠ >> >> >> Any hints as to what I did wrong? I can provide source code, or a fuller >> stack trace, config settings, etc. >> >> Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, >>then >> repack. However, when I did that, I get a NoClassDefFoundError for my >> plugin itself. >> >> >> Thanks, >> Kevin >> >> The information in this email, includi
Re: custom field type plugin
Sorry for the late response. I needed to find the time to load a lot of extra data (closer to what we're anticipating). I have an index with close to 220,000 documents, each with at least two coordinate regions anywhere between -10 billion to +10 billion, but could potentially have up to maybe half dozen regions in one document. The reason for the negatives, is because you can read a chromosome either backwards or forwards, so many coordinates can be minus. Here is the schema field definition: Here is the first query in the log: INFO: geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFiel dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={distE rrPct=0, geo=false, multiValued=true, worldBounds=-1000 -1000 1000 1000, maxDistErr=0.9, units=degrees}} strat: RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(maxL evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc, worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)}))) maxLevels: 50 Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xml&q=humanCoordinate:"Intersects(0+60330+6033041244+100 )"&rows=100} hits=81112 status=0 QTime=122 Here are some other queries to give different timings (the one above brings back quite a lot): INFO: [testIndex] webapp=/solr path=/select params={wt=xml&q=humanCoordinate:"Intersects(0+60+69+10 0)"&rows=100} hits=6031 status=0 QTime=10 Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xml&q=humanCoordinate:"Intersects(0+0+1000+100)"&row s=100} hits=500 status=0 QTime=15 Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xml&q=humanCoordinate:"Intersects(0+7831329+7831329+100) "&rows=100} hits=4 status=0 QTime=17 INFO: [testIndex] webapp=/solr path=/select params={wt=xml&q=humanCoordinate:"Intersects(-100+-1051057963+-1001 057963+0)"&rows=100} hits=661 status=0 QTime=8 The query times look pretty fast to me. Certainly I'm pretty impressed. Our other backup solutions (involving SQL) likely wouldn't even touch this in terms of speed. We will be testing this more in depth in the coming month. I am sort of jumping ahead of our team to research possible solutions, since this is something that worried us. Looks like it might work! Thanks, -Kevin On 7/23/13 1:47 PM, "David Smiley (@MITRE.org)" wrote: >Oh cool! I'm glad it at least seemed to work. Can you post your >configuration of the field type and report from Solr's logs what the >"maxLevels" is used for this field, which is logged the first time you use >the field type? > >Maybe there isn't a limit under 10B after all. Some quick'n'dirty >calculations I just did indicate there shouldn't be a problem but >real-world >usage will be a better proof. Indexing probably won't be terribly slow, >queries could get pretty slow if the amount of indexed data is really >high. >I'd love to hear how it works out for you. Your use-case would benefit a >lot from an improved prefix tree implementation. > >I don't gather how a 3rd dimension would play into this. Support for >multi-dimensional spatial is on the drawing board. > >~ David > > >Kevin Stone wrote >> What are the dangers of trying to use a range of 10 billion? Simply a >> slower index time? Or will I get inaccurate results? >> I have tried it on a very small sample of documents, and it seemed to >> work. I could spend some time this week trying to get a more robust (and >> accurate) dataset loaded to play around with. The reason for the 10 >> billion is to support being able to query for a region on a chromosome. >> >> A user might want to know what genes overlap a point on a specific >> chromosome. Unless I can use 3 dimensional coordinates (which gave an >> error when I tried it), I'll need to multiply the coordinates by some >> offset for each chromosome to be able to normalise the data (at both >>index >> and query time). The largest chromosome (chr 1) has almost 250,000,000 >> base pairs. I could probably squeeze the rest a bit smaller, but I'd >> rather use one size for all chromosomes, since we have more than just >> human data to deal with. It would get quite messy otherwise. >> >> >> On 7/22/13 11:50 AM, "David Smiley (@MITRE.org)" < > >> DSMILEY@ > >> > wrote: >> >>>Like Hoss said, you're going to have to
Re: custom field type plugin
I tried reducing the maxDistErr to "0.01", just to test making it smaller. I got maxLevels down to 45, and slightly better query times (Indexing time was about the same). However, my queries are not accurate anymore. I need to pad by 2 or 3 whole numbers to get a hit now, which won't work in real use. I can play with the number a bit more, but I didn't see anything wrong when I had it at "0.9". I do know about using a small decimal value to pad around my coordinates, and I'll probably do that for the real implementation, but for testing, whole numbers were working for all my edge cases. -Kevin On 7/23/13 10:45 PM, "Smiley, David W." wrote: >Kevin, > >Those are some good query response times but they could be better. You've >configured the field type sub-optimally. Look again at >http://wiki.apache.org/solr/SpatialForTimeDurations and note in particular >maxDistErr. You've left it at the value that comes pre-configured with >Solr, 0.9, which is ~1 meter measured in degrees, and this value >makes no sense when your numeric range is in whole numbers. I suspect you >inherited this value from Hoss's slides. **Instead use 1.** (as shown on >the wiki). This affects performance in a big way since you've configured >the prefixTree to hold 2.22e18 values (calculated via (max-min) / >maxDistErr) as opposed to "just" 2e10. Your log shows maxLevels is 50 for >quad tree. The comments in QuadPrefixTree (and I put them there once) >indicate maxLevels of 50 is about as much as is supported. But again, I'm >not certain what the limit really is without validating. Hopefully you >can stay clear of 50. To do some tests, try querying just on the edge on >either side of an indexed value to make sure you match the point and then >don't match the indexed point as you would expect based on the >instructions. Also, be sure to read more of the details on "Search" on >this wiki page in which you are advised to buffer the query shape >slightly; you didn't do this in your examples below. This is all a bit of >a hack when using a field that internally is using floating point instead >of fixed precision. > >~ David Smiley > >On 7/23/13 9:32 PM, "Kevin Stone" wrote: > >>Sorry for the late response. I needed to find the time to load a lot of >>extra data (closer to what we're anticipating). I have an index with >>close >>to 220,000 documents, each with at least two coordinate regions anywhere >>between -10 billion to +10 billion, but could potentially have up to >>maybe >>half dozen regions in one document. The reason for the negatives, is >>because you can read a chromosome either backwards or forwards, so many >>coordinates can be minus. >> >>Here is the schema field definition: >> >>> class="solr.SpatialRecursivePrefixTreeFieldType" >> multiValued="true" >> geo="false" >> worldBounds="-1000 -1000 1000 >>1000" >> distErrPct="0" >> maxDistErr="0.9" >> units="degrees" >> /> >> >> >>Here is the first query in the log: >> >>INFO: >>geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFi >>e >>l >>dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={dis >>t >>E >>rrPct=0, geo=false, multiValued=true, worldBounds=-1000 >>-1000 1000 1000, maxDistErr=0.9, >>units=degrees}} strat: >>RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(ma >>x >>L >>evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc, >>worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)}))) >>maxLevels: 50 >>Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute >>INFO: [testIndex] webapp=/solr path=/select >>params={wt=xml&q=humanCoordinate:"Intersects(0+60330+6033041244+1 >>0 >>0 >>)"&rows=100} hits=81112 status=0 QTime=122 >> >> >> >> >> >>Here are some other queries to give different timings (the one above >>brings back quite a lot): >> >>INFO: [testIndex] webapp=/solr path=/select >>params={wt=xml&q=humanCoordinate:"Intersects(0+60+69+1000 >>0 >>0 >>0)"&rows=100} hits=6031 status=0 QTime=10 >>Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute >>INFO: [testIndex] webapp=/solr path=/select >
Solr 4.0 indexing performance question
I am having some difficulty migrating our solr indexing scripts from using 3.5 to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 is about 5-10 times slower when indexing documents. Querying is still quite fast. The code adds documents in groups of 1000, and adds each group to the solr in a thread. The documents are somewhat large, including maybe 30-40 different field types, mostly multivalued. Here are some snippets of the code we used in 3.5. MultiThreadedHttpConnectionManager mgr = new MultiThreadedHttpConnectionManager(); HttpClient client = new HttpClient(mgr); CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for our index",client ); server.setRequestWriter(new BinaryRequestWriter()); Then, we delete the index, and proceed to generate documents and load the groups in a thread that looks kind of like this. I've omitted some overhead for handling exceptions, and retry attempts. class DocWriterThread implements Runnable { CommonsHttpSolrServer server; Collection docs; private int commitWithin = 5; // 50 seconds public DocWriterThread(CommonsHttpSolrServer server,Collection docs) { this.server=server; this.docs=docs; } public void run() { // set the commitWithin feature server.add(docs,commitWithin); } } Now, I've had to change some things to get this compile with the Solr 4.0 libraries. Here is what I tried to convert the above code to. I don't know if these are the correct equivalents, as I am not familiar with apache httpcomponents. ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(); DefaultHttpClient client = new DefaultHttpClient(mgr); HttpSolrServer server = new HttpSolrServer( "some url for our solr index",client ); server.setRequestWriter(new BinaryRequestWriter()); The thread method is the same, but uses HttpSolrServer instead of CommonsHttpSolrServer. We also, had an old solrconfig (not sure what version, but it is pre 3.x and had mostly default values) that I had to replace with a 4.0 style solrconfig.xml. I don't want to post the entire file (as it is large), but I copied one from the solr 4.0 examples, and made a couple changes. First, I wanted to turn off transaction logging. So essentially I have a line like this (everything inside is commented out): And I added a handler for javabin application/javabin I'm not sure what other configurations I should look at. I would think that there should be a big obvious reason why the indexing performance would drop nearly 10 fold. Against our 3.5 instance I timed our index load, and it adds roughly 40,000 documents every 3-8 seconds. Against our 4.0 instance it adds 40,000 documents every 70-75 seconds. This isn't the end of the world, and I would love to use the new join feature in solr 4.0. However, we have many different indexes with millions of documents, and this kind of increase in load time is troubling. Thanks for your help. -Kevin The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
Re: Solr 4.0 indexing performance question
Do you mean commenting out the ... tag? Because that I already commented out. Or do I also need to remove the entire tag? Sorry, I am not too familiar with everything in the solrconfig file. I have a tag that essentially looks like this: Everything inside is commented out. -Kevin On 1/23/13 11:21 AM, "Mark Miller" wrote: >It's hard to guess, but I might start by looking at what the new >UpdateLog is costing you. Take it's definition out of solrconfig.xml and >try your test again. Then let's take it from there. > >- Mark > >On Jan 23, 2013, at 11:00 AM, Kevin Stone wrote: > >> I am having some difficulty migrating our solr indexing scripts from >>using 3.5 to solr 4.0. Notably, I am trying to track down why our >>performance in solr 4.0 is about 5-10 times slower when indexing >>documents. Querying is still quite fast. >> >> The code adds documents in groups of 1000, and adds each group to the >>solr in a thread. The documents are somewhat large, including maybe >>30-40 different field types, mostly multivalued. Here are some snippets >>of the code we used in 3.5. >> >> >> MultiThreadedHttpConnectionManager mgr = new >>MultiThreadedHttpConnectionManager(); >> >> HttpClient client = new HttpClient(mgr); >> >> CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for >>our index",client ); >> >> server.setRequestWriter(new BinaryRequestWriter()); >> >> >> Then, we delete the index, and proceed to generate documents and load >>the groups in a thread that looks kind of like this. I've omitted some >>overhead for handling exceptions, and retry attempts. >> >> >> class DocWriterThread implements Runnable >> >> { >> >>CommonsHttpSolrServer server; >> >>Collection docs; >> >>private int commitWithin = 5; // 50 seconds >> >>public DocWriterThread(CommonsHttpSolrServer >>server,Collection docs) >> >>{ >> >>this.server=server; >> >>this.docs=docs; >> >>} >> >> public void run() >> >> { >> >>// set the commitWithin feature >> >>server.add(docs,commitWithin); >> >> } >> >> } >> >> >> Now, I've had to change some things to get this compile with the Solr >>4.0 libraries. Here is what I tried to convert the above code to. I >>don't know if these are the correct equivalents, as I am not familiar >>with apache httpcomponents. >> >> >> >> ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(); >> >> DefaultHttpClient client = new DefaultHttpClient(mgr); >> >> HttpSolrServer server = new HttpSolrServer( "some url for our solr >>index",client ); >> >> server.setRequestWriter(new BinaryRequestWriter()); >> >> >> >> >> The thread method is the same, but uses HttpSolrServer instead of >>CommonsHttpSolrServer. >> >> We also, had an old solrconfig (not sure what version, but it is pre >>3.x and had mostly default values) that I had to replace with a 4.0 >>style solrconfig.xml. I don't want to post the entire file (as it is >>large), but I copied one from the solr 4.0 examples, and made a couple >>changes. First, I wanted to turn off transaction logging. So essentially >>I have a line like this (everything inside is commented out): >> >> >> >> >> >> And I added a handler for javabin >> >> >> >class="solr.BinaryUpdateRequestHandler"> >> >> >> >> application/javabin >> >> >> >> >> >> I'm not sure what other configurations I should look at. I would think >>that there should be a big obvious reason why the indexing performance >>would drop nearly 10 fold. >> >> Against our 3.5 instance I timed our index load, and it adds roughly >>40,000 documents every 3-8 seconds. >> >> Against our 4.0 instance it adds 40,000 documents every 70-75 seconds. >> >> This isn't the end of the world, and I would love to use the new join >>feature in solr 4.0. However, we have many different indexes with >>millions of documents, and this kind of increase in load time is >>troubling. >> >> >> Thanks for your help. >> >> >> -Kevin >> >> >> The information in this email, including attachments, may be >>confidential and is intended solely for the addressee(s). If you believe >>you received this email by mistake, please notify the sender by return >>email as soon as possible. > The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
Re: Solr 4.0 indexing performance question
I'm still poking around trying to find the differences. I found a couple things that may or may not be relevant. First, when I start up my 3.5 solr, I get all sorts of warnings that my solrconfig is old and will run using 2.4 emulation. Of course I had to upgrade the solconfig for the 4.0 instance (which I already described). I am curious if there could be some feature I was taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know. Second when I look at the console logs for my server (3.5 and 4.0) and I run the indexer against each, I see a subtle difference in this print out when it connects to the solr core. The 3.5 version prints this out: webapp=/solr path=/update params={waitSearcher=true&wt=javabin&commit=true&softCommit=false&version=2 } {commit=} 0 2722 The 4.0 version prints this out webapp=/solr path=/update/javabin params={wt=javabin&commit=true&waitFlush=true&waitSearcher=true&version=2} status=0 QTime=1404 The params for the update handle seem ever so slightly different. The 3.5 version (the one that runs fast) has a setting softCommit=false. The 4.0 version does not print that setting, but instead prints this setting waitFlush=true. These could be irrelevant, but thought I should add the information. -Kevin On 1/23/13 11:42 AM, "Kevin Stone" wrote: >Do you mean commenting out the ... tag? Because >that I already commented out. Or do I also need to remove the entire > tag? Sorry, I am not too familiar with everything in the >solrconfig file. I have a tag that essentially looks like this: > > > > >Everything inside is commented out. > >-Kevin > >On 1/23/13 11:21 AM, "Mark Miller" wrote: > >>It's hard to guess, but I might start by looking at what the new >>UpdateLog is costing you. Take it's definition out of solrconfig.xml and >>try your test again. Then let's take it from there. >> >>- Mark >> >>On Jan 23, 2013, at 11:00 AM, Kevin Stone wrote: >> >>> I am having some difficulty migrating our solr indexing scripts from >>>using 3.5 to solr 4.0. Notably, I am trying to track down why our >>>performance in solr 4.0 is about 5-10 times slower when indexing >>>documents. Querying is still quite fast. >>> >>> The code adds documents in groups of 1000, and adds each group to the >>>solr in a thread. The documents are somewhat large, including maybe >>>30-40 different field types, mostly multivalued. Here are some snippets >>>of the code we used in 3.5. >>> >>> >>> MultiThreadedHttpConnectionManager mgr = new >>>MultiThreadedHttpConnectionManager(); >>> >>> HttpClient client = new HttpClient(mgr); >>> >>> CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for >>>our index",client ); >>> >>> server.setRequestWriter(new BinaryRequestWriter()); >>> >>> >>> Then, we delete the index, and proceed to generate documents and load >>>the groups in a thread that looks kind of like this. I've omitted some >>>overhead for handling exceptions, and retry attempts. >>> >>> >>> class DocWriterThread implements Runnable >>> >>> { >>> >>>CommonsHttpSolrServer server; >>> >>>Collection docs; >>> >>>private int commitWithin = 5; // 50 seconds >>> >>>public DocWriterThread(CommonsHttpSolrServer >>>server,Collection docs) >>> >>>{ >>> >>>this.server=server; >>> >>>this.docs=docs; >>> >>>} >>> >>> public void run() >>> >>> { >>> >>>// set the commitWithin feature >>> >>>server.add(docs,commitWithin); >>> >>> } >>> >>> } >>> >>> >>> Now, I've had to change some things to get this compile with the Solr >>>4.0 libraries. Here is what I tried to convert the above code to. I >>>don't know if these are the correct equivalents, as I am not familiar >>>with apache httpcomponents. >>> >>> >>> >>> ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(); >>> >>> DefaultHttpClient client = new DefaultHttpClient(mgr); >>> >>> HttpSolrServer server = new HttpSolrServer( "some url for our solr >>>index",client ); >>> >>> server.setRequestWriter(new BinaryRequestWriter()); >>> >>> >>> >>> >>> The thread method is the
Re: Solr 4.0 indexing performance question
Another revelation... I can see that there is a time difference in the Solr output for adding these documents when I watch it realtime. Here are some rows from the 3.5 solr server: Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute INFO: [gxdResult] webapp=/solr path=/update/javabin params={wt=javabin&version=2} status=0 QTime=6196 Jan 23, 2013 11:57:23 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[RNA in situ-1386104, RNA in situ-1351487, RNA in situ-1363917, RNA in situ-1377125, RNA in situ-1371738, RNA in situ-1378746, RNA in situ-1383410, RNA in situ-1362712, ... (1001 adds)]} 0 6266 Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute INFO: [gxdResult] webapp=/solr path=/update/javabin params={wt=javabin&version=2} status=0 QTime=6266 Jan 23, 2013 11:57:24 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[RNA in situ-1371578, RNA in situ-1377716, RNA in situ-1378151, RNA in situ-1360580, RNA in situ-1391657, RNA in situ-1370288, RNA in situ-1388236, RNA in situ-1361465, ... (1001 adds)]} 0 6371 Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute INFO: [gxdResult] webapp=/solr path=/update/javabin params={wt=javabin&version=2} status=0 QTime=6371 Jan 23, 2013 11:57:24 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[RNA in situ-1350555, RNA in situ-1350887, RNA in situ-1379699, RNA in situ-1373773, RNA in situ-1374004, RNA in situ-1372265, RNA in situ-1373027, RNA in situ-1380691, ... (1001 adds)]} 0 6440 Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute And here from the 4.0 solr: Jan 23, 2013 3:40:22 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2} {add=[RNA in situ-115650, RNA in situ-4109, RNA in situ-107614, RNA in situ-86038, RNA in situ-19647, RNA in situ-1422, RNA in situ-119536, RNA in situ-5, RNA in situ-86825, RNA in situ-91009, ... (1001 adds)]} 0 3105 Jan 23, 2013 3:40:23 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2} {add=[RNA in situ-38103, RNA in situ-15797, RNA in situ-79946, RNA in situ-124877, RNA in situ-62025, RNA in situ-67908, RNA in situ-70527, RNA in situ-20581, RNA in situ-107574, RNA in situ-96497, ... (1001 adds)]} 0 2689 Jan 23, 2013 3:40:24 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2} {add=[RNA in situ-35518, RNA in situ-50512, RNA in situ-109961, RNA in situ-113025, RNA in situ-33729, RNA in situ-116967, RNA in situ-133871, RNA in situ-55287, RNA in situ-67367, RNA in situ-8617, ... (1001 adds)]} 0 2367 Jan 23, 2013 3:40:28 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2} {add=[RNA in situ-105749, RNA in situ-125415, RNA in situ-14667, RNA in situ-41067, RNA in situ-1099, RNA in situ-86169, RNA in situ-90834, RNA in situ-114639, RT-PCR-26160, RNA in situ-79745, ... (1001 adds)]} 0 3401 Jan 23, 2013 3:40:28 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2} {add=[RNA in situ-82061, RNA in situ-96965, RNA in situ-22677, RNA in situ-52637, RNA in situ-131842, RNA in situ-31863, RNA in situ-111656, RNA in situ-120509, RNA in situ-29659, RNA in situ-63579, ... (1001 adds)]} 0 3580 Jan 23, 2013 3:40:31 PM org.apache.solr.update.processor.LogUpdateProcessor finish I know that they aren't the same exact documents (like I said, there are millions to load), but the times look pretty much like this for all of them. Can someone help me parse out the times of this? It *appears* to me that the inserts are happening just as fast, if not faster in 4.0 than 3.5, BUT the timestamps between the LogUpdateProcessor calls are much longer in 4.0. I do not have the tag anywhere in my solrconfig.xml. So why does it look to me like it is spending a lot of time logging? It shouldn't really be logging anything, right? Bear in mind that these inserts happen in threads that are pushing to Solr concurrently. So if 4.0 is logging somewhere that 3.5 didn't, then the file-locking on that log file could be slowing me down. -Kevin On 1/23/13 12:03 PM, "Kevin Stone" wrote: >I'm still poking around trying to find the differences. I found a couple >things that may or may not be relevant. >First, when I start up my 3.5 solr, I get all sorts of warnings that my >solrconfig is old and will run using 2.4 emulation. >Of course I had to upgrade the solconfig for the 4.0 instance (which I >already described). I am curious if there could be some feature I was >taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know. > >Second when