limit solr results before join

2014-06-24 Thread Kevin Stone
Is there any way to limit the results of a query on the "from" index before it 
gets joined?

The SQL analogy might be...
SELECT *
from toIndex join
(select * from fromIndex
where "some query"
limit 1000
) fromIndex on fromIndex.from=toIndex.to


Example:
_query_:"{!join fromIndex=expressionData from=anatomyID to=anatomyID 
v='(anatomy:\"brain\")'}"

Say I have an index representing data for gene expression (we work with 
genetics), and you query it by anatomy term. So the above would query for all 
data that shows gene expression in "brain".

Now I want to get a set of related data for each anatomy term via the join. Is 
there any way to get the related data for only anatomy terms in the first 1000 
expression data documents (fromIndex)? The reason is because there could be 
millions of data documents (fromIndex), and we process them in batches to load 
a visualization of the query results.

Doing the join on all the results for each batch I process is becoming a 
bottleneck for large sets of data.

Thanks,
-Kevin

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


RE: limit solr results before join

2014-06-24 Thread Kevin Stone
I don't know what that means. Is that a no?


From: Mikhail Khludnev [mkhlud...@griddynamics.com]
Sent: Tuesday, June 24, 2014 2:18 PM
To: solr-user
Subject: Re: limit solr results before join

Hello Kevin,
You can only apply some restriction clauses (with +) to the from side
query.


On Tue, Jun 24, 2014 at 8:09 PM, Kevin Stone  wrote:

> Is there any way to limit the results of a query on the "from" index
> before it gets joined?
>
> The SQL analogy might be...
> SELECT *
> from toIndex join
> (select * from fromIndex
> where "some query"
> limit 1000
> ) fromIndex on fromIndex.from=toIndex.to
>
>
> Example:
> _query_:"{!join fromIndex=expressionData from=anatomyID to=anatomyID
> v='(anatomy:\"brain\")'}"
>
> Say I have an index representing data for gene expression (we work with
> genetics), and you query it by anatomy term. So the above would query for
> all data that shows gene expression in "brain".
>
> Now I want to get a set of related data for each anatomy term via the
> join. Is there any way to get the related data for only anatomy terms in
> the first 1000 expression data documents (fromIndex)? The reason is because
> there could be millions of data documents (fromIndex), and we process them
> in batches to load a visualization of the query results.
>
> Doing the join on all the results for each batch I process is becoming a
> bottleneck for large sets of data.
>
> Thanks,
> -Kevin
>
> The information in this email, including attachments, may be confidential
> and is intended solely for the addressee(s). If you believe you received
> this email by mistake, please notify the sender by return email as soon as
> possible.
>



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


custom field type plugin

2013-07-19 Thread Kevin Stone
I have a particular use case that I think might require a custom field type, 
however I am having trouble getting the plugin to work.
My use case has to do with genetics data, and we are running into several 
situations were we need to be able to query multiple regions of a chromosome 
(or gene, or other object types). All that really boils down to is being able 
to give a number, e.g. 10234, and return documents that have regions containing 
the number. So you'd have a document with a list like 
["1:16090","400:8000","40123:43564"], and it should come back because 10234 
falls between "1:16090". If there is a better or easier way to do this 
please speak up. I'd rather not have to use a "join" on another index, because 
1) it's more complex to set up, and 2) we might need to join against something 
else and you can only do one join at a time.

Anyway… I tried creating a field type similar to a PointType just to see if I 
could get one working. I added the following jars to get it to compile: 
apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr-solrj-4.0.0.
 I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib folder, 
and specified it in my solr.xml (I have multiple cores).

After starting up solr, I got the line that it picked up the jar:
INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader

But I get this error about it not being able to find the 
AbstractSubTypeFieldType class.
Here is the first bit of the trace:

SEVERE: null:java.lang.NoClassDefFoundError: 
org/apache/solr/schema/AbstractSubTypeFieldType
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
...etc…


Any hints as to what I did wrong? I can provide source code, or a fuller stack 
trace, config settings, etc.

Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, then 
repack. However, when I did that, I get a NoClassDefFoundError for my plugin 
itself.


Thanks,
Kevin

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


RE: custom field type plugin

2013-07-19 Thread Kevin Stone
I can try again this weekend to get a clean environment. However, the order I 
did things in was the reverse of what you suggest. I got the 
AbstractSubTypeFieldType error first. Then I removed my jar from the sharedLib 
folder, and tried the war repacking solution. That is when I got 
NoClassDefFoundError on my custom class.

The spatial feature looks intriguing, although I have no idea if it could fit 
my use case. It looks fairly complex a concept, but maybe it is all the 
different shapes and geometry that is confusing me. If I thought of my problem 
in terms of geometry, I would say a chromosome region is like a segment of a 
line. I would need to define multiple line segments and be able to query by a 
single point and only return documents that have a line segment that the single 
point falls on. Does that make sense? Is that at all doable with a spatial 
query?

-Kevin

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Friday, July 19, 2013 3:15 PM
To: solr-user@lucene.apache.org
Subject: Re: custom field type plugin

: a chromosome (or gene, or other object types). All that really boils
: down to is being able to give a number, e.g. 10234, and return documents
: that have regions containing the number. So you'd have a document with a
: list like ["1:16090","400:8000","40123:43564"], and it should come

You should take a look at some of the build in features using the spatial
types...

http://wiki.apache.org/solr/SpatialForTimeDurations

I believe David also covered this usecase in his talk in san diego...

http://www.lucenerevolution.org/2013/Lucene-Solr4-Spatial-Deep-Dive

: But I get this error about it not being able to find the 
AbstractSubTypeFieldType class.
: Here is the first bit of the trace:
...
: Any hints as to what I did wrong? I can provide source code, or a fuller 
stack trace, config settings, etc.
:
: Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib,
: then repack. However, when I did that, I get a NoClassDefFoundError for
: my plugin itself.

a fuller stack trace might help -- but the key question is what order did
you try these two approaches in? and what exactly did you fieldType
declaration look like?

my guess is that you tried repacking the war first, and maybe your
exploded war classpath is still polluted with your old jar from when you
repacked it and now you have multiple copies in the plugin classloaders
classpath.  (the initial NoClassDefFoundError could have been from a
mistake in your  declaration)

try starting competley clean, using the stock war and sample configs and
make sure you get no errors.  then try declaring your custom fieldType,
using hte fully qualified classname w/o even telling solr about your jar,
and ensure that you get a NoClassDefFoundError for your custom class -- if
you get an error about AbstractSubTypeFieldType again then you still have
a copy of your custom class somwhere in the classpath.  *THEN* try adding
a  directive to load your jar to load it.

if that still doesn't work provide us with the details of your servlet
container, solr version, the full stack trace, the details of how you are
configuring your , how you declared the  what your
filesystem looks like for your solrhome, war, etc...




-Hoss

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


RE: custom field type plugin

2013-07-20 Thread Kevin Stone
maField field, boolean top) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, 
"Sorting not supported on PointType " + field.getName());
  }
  
  @Override
  public Query getFieldQuery(QParser parser, SchemaField field, String 
externalVal) 
  {
String[] coords = externalVal.split("-");

if(coords.length!=2)
{
throw new 
SolrException(SolrException.ErrorCode.BAD_REQUEST,"Invalid coordinate format 
for "+externalVal);
}

BooleanQuery bq = new BooleanQuery(true);
SchemaField coord1 = subField(field, 0);
Query tq1 = coord1.getType().getFieldQuery(parser, coord1, 
coords[0]);
bq.add(tq1, BooleanClause.Occur.MUST);
SchemaField coord2 = subField(field, 1);
Query tq2 = coord2.getType().getFieldQuery(parser, coord1, 
coords[1]);
bq.add(tq2, BooleanClause.Occur.MUST);

return bq;
  }
}
class CoordTypeValueSource extends VectorValueSource 
{
  private final SchemaField sf;
  
  public CoordTypeValueSource(SchemaField sf, List 
sources) {
super(sources);
this.sf = sf;
  }

  @Override
  public String name() {
return "point";
  }

  @Override
  public String description() {
return name()+"("+sf.getName()+")";
  }
}


From: Kevin Stone
Sent: Saturday, July 20, 2013 8:24 AM
To: solr-user@lucene.apache.org
Subject: RE: custom field type plugin

Thank you for the links, they really helped me understand. I see how the 
spatial solution works now. I think this could work as a good backup if I 
cannot get the custom field type working. The custom field would ideally be a 
bit more robust than what I mentioned before, because a region really means 
four pieces, a chromosome (e.g. 1-22), a start base pair, an end base pair, and 
the direction (forward or reverse). But if need be, the chomosome and direction 
can be multiplied into the base pairs to get it down to two translated numbers.

As for the upper bounds, I do have an idea, but  it would be a large number, 
say between 1 and 10 billion depending on how I translate the values. I'll just 
have to try it out I guess.


Ok, now back to the custom field problem. From here on I'll spam source code 
and stack traces.

I started fresh, removing all places where I may have had my jar file, and 
popped in a fresh solr.war.
I define the plugin class in my schema like this:

 

and use it here:

 


Ok, when I start solr, I get this error saying it can't find the plugin class 
that is defined in my schema.
org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] 
fieldType "geneticLocation": Error loading class 
'org.jax.mgi.fe.solrplugin.GeneticLocation'
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:113)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
...etc...
Caused by: org.apache.solr.common.SolrException: Error loading class 
'org.jax.mgi.fe.solrplugin.GeneticLocation'
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:436)
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457)
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:453)
...etc...

So, that's all fine.

In my solr.xml, I define this sharedLib folder:


I shut the server down, drop in my CustomPlugins.jar file and start the server 
back up. And... I got a different error! It said I was missing the subFieldType 
or subFieldSuffix in my fieldType definition. So I added one 
'subFieldSuffix="_gl". Then I restart the server thinking that I'm making 
progress, and I get the old error again. I pulled out the jar, did the above 
test again to verify that it couldn't find my plugin. Then I re-add it and 
restart. Nope, still this error about AbstractSubTypeFieldType. Here is the 
full stack trace:

SEVERE: null:java.lang.NoClassDefFoundError: 
org/apache/solr/schema/AbstractSubTypeFieldType
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.

RE: custom field type plugin

2013-07-20 Thread Kevin Stone
Thank you for the links, they really helped me understand. I see how the 
spatial solution works now. I think this could work as a good backup if I 
cannot get the custom field type working. The custom field would ideally be a 
bit more robust than what I mentioned before, because a region really means 
four pieces, a chromosome (e.g. 1-22), a start base pair, an end base pair, and 
the direction (forward or reverse). But if need be, the chomosome and direction 
can be multiplied into the base pairs to get it down to two translated numbers.

As for the upper bounds, I do have an idea, but  it would be a large number, 
say between 1 and 10 billion depending on how I translate the values. I'll just 
have to try it out I guess.


Ok, now back to the custom field problem. From here on I'll spam source code 
and stack traces.

I started fresh, removing all places where I may have had my jar file, and 
popped in a fresh solr.war.
I define the plugin class in my schema like this:

 

and use it here:

 


Ok, when I start solr, I get this error saying it can't find the plugin class 
that is defined in my schema.
org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] 
fieldType "geneticLocation": Error loading class 
'org.jax.mgi.fe.solrplugin.GeneticLocation'
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:113)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
...etc...
Caused by: org.apache.solr.common.SolrException: Error loading class 
'org.jax.mgi.fe.solrplugin.GeneticLocation'
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:436)
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457)
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:453)
...etc...

So, that's all fine. 

In my solr.xml, I define this sharedLib folder:


I shut the server down, drop in my CustomPlugins.jar file and start the server 
back up. And... I got a different error! It said I was missing the subFieldType 
or subFieldSuffix in my fieldType definition. So I added one 
'subFieldSuffix="_gl". Then I restart the server thinking that I'm making 
progress, and I get the old error again. I pulled out the jar, did the above 
test again to verify that it couldn't find my plugin. Then I re-add it and 
restart. Nope, still this error about AbstractSubTypeFieldType. Here is the 
full stack trace:

SEVERE: null:java.lang.NoClassDefFoundError: 
org/apache/solr/schema/AbstractSubTypeFieldType
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at 
org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:401)
at java.lang.ClassLoader.loadClass(ClassLoader.java:410)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789)
at java.lang.ClassLoader.loadClass(ClassLoader.java:410)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:266)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:420)
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457)
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:453)
at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:81)
at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:113)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
  

Re: custom field type plugin

2013-07-23 Thread Kevin Stone
What are the dangers of trying to use a range of 10 billion? Simply a
slower index time? Or will I get inaccurate results?
I have tried it on a very small sample of documents, and it seemed to
work. I could spend some time this week trying to get a more robust (and
accurate) dataset loaded to play around with. The reason for the 10
billion is to support being able to query for a region on a chromosome.

A user might want to know what genes overlap a point on a specific
chromosome. Unless I can use 3 dimensional coordinates (which gave an
error when I tried it), I'll need to multiply the coordinates by some
offset for each chromosome to be able to normalise the data (at both index
and query time). The largest chromosome (chr 1) has almost 250,000,000
base pairs. I could probably squeeze the rest a bit smaller, but I'd
rather use one size for all chromosomes, since we have more than just
human data to deal with. It would get quite messy otherwise.


On 7/22/13 11:50 AM, "David Smiley (@MITRE.org)"  wrote:

>Like Hoss said, you're going to have to solve this using
>http://wiki.apache.org/solr/SpatialForTimeDurations
>Using PointType is *not* going to work because your durations are
>multi-valued per document.
>
>It would be useful to create a custom field type that wraps the capability
>outlined on the wiki to make it easier to use without requiring the user
>to
>think spatially.
>
>You mentioned that these numeric ranges extend upwards of 10 billion or
>so.
>Unfortunately, the current "prefix tree" implementation under the hood for
>non-geodetic spatial, the QuadTree, is unlikely to scale to numbers that
>big.  I don't know where the boundary is, but I doubt 10B.  You could try
>and see what happens.  I'm working (very slowly on very little spare time)
>on improving the PrefixTree implementations to scale to such large
>numbers;
>I hope something will be available this fall.
>
>~ David Smiley
>
>
>Kevin Stone wrote
>> I have a particular use case that I think might require a custom field
>> type, however I am having trouble getting the plugin to work.
>> My use case has to do with genetics data, and we are running into
>>several
>> situations were we need to be able to query multiple regions of a
>> chromosome (or gene, or other object types). All that really boils down
>>to
>> is being able to give a number, e.g. 10234, and return documents that
>>have
>> regions containing the number. So you'd have a document with a list like
>> ["1:16090","400:8000","40123:43564"], and it should come back
>>because
>> 10234 falls between "1:16090". If there is a better or easier way to
>> do this please speak up. I'd rather not have to use a "join" on another
>> index, because 1) it's more complex to set up, and 2) we might need to
>> join against something else and you can only do one join at a time.
>>
>> AnywayŠ I tried creating a field type similar to a PointType just to see
>> if I could get one working. I added the following jars to get it to
>> compile:
>>
>>apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr
>>-solrj-4.0.0.
>> I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib
>> folder, and specified it in my solr.xml (I have multiple cores).
>>
>> After starting up solr, I got the line that it picked up the jar:
>> INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader
>>
>> But I get this error about it not being able to find the
>> AbstractSubTypeFieldType class.
>> Here is the first bit of the trace:
>>
>> SEVERE: null:java.lang.NoClassDefFoundError:
>> org/apache/solr/schema/AbstractSubTypeFieldType
>> at java.lang.ClassLoader.defineClass1(Native Method)
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
>> at
>>java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> ...etcŠ
>>
>>
>> Any hints as to what I did wrong? I can provide source code, or a fuller
>> stack trace, config settings, etc.
>>
>> Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib,
>>then
>> repack. However, when I did that, I get a NoClassDefFoundError for my
>> plugin itself.
>>
>>
>> Thanks,
>> Kevin
>>
>> The information in this email, includi

Re: custom field type plugin

2013-07-23 Thread Kevin Stone
Sorry for the late response. I needed to find the time to load a lot of
extra data (closer to what we're anticipating). I have an index with close
to 220,000 documents, each with at least two coordinate regions anywhere
between -10 billion to +10 billion, but could potentially have up to maybe
half dozen regions in one document. The reason for the negatives, is
because you can read a chromosome either backwards or forwards, so many
coordinates can be minus.

Here is the schema field definition:




Here is the first query in the log:

INFO:
geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFiel
dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={distE
rrPct=0, geo=false, multiValued=true, worldBounds=-1000
-1000 1000 1000, maxDistErr=0.9,
units=degrees}} strat:
RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(maxL
evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc,
worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)})))
maxLevels: 50
Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+60330+6033041244+100
)"&rows=100} hits=81112 status=0 QTime=122





Here are some other queries to give different timings (the one above
brings back quite a lot):

INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+60+69+10
0)"&rows=100} hits=6031 status=0 QTime=10
Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+0+1000+100)"&row
s=100} hits=500 status=0 QTime=15
Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+7831329+7831329+100)
"&rows=100} hits=4 status=0 QTime=17
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(-100+-1051057963+-1001
057963+0)"&rows=100} hits=661 status=0 QTime=8



The query times look pretty fast to me. Certainly I'm pretty impressed.
Our other backup solutions (involving SQL) likely wouldn't even touch this
in terms of speed.



We will be testing this more in depth in the coming month. I am sort of
jumping ahead of our team to research possible solutions, since this is
something that worried us. Looks like it might work!

Thanks,
-Kevin

On 7/23/13 1:47 PM, "David Smiley (@MITRE.org)"  wrote:

>Oh cool!  I'm glad it at least seemed to work.  Can you post your
>configuration of the field type and report from Solr's logs what the
>"maxLevels" is used for this field, which is logged the first time you use
>the field type?
>
>Maybe there isn't a limit under 10B after all.  Some quick'n'dirty
>calculations I just did indicate there shouldn't be a problem but
>real-world
>usage will be a better proof.  Indexing probably won't be terribly slow,
>queries could get pretty slow if the amount of indexed data is really
>high.
>I'd love to hear how it works out for you.  Your use-case would benefit a
>lot from an improved prefix tree implementation.
>
>I don't gather how a 3rd dimension would play into this.  Support for
>multi-dimensional spatial is on the drawing board.
>
>~ David
>
>
>Kevin Stone wrote
>> What are the dangers of trying to use a range of 10 billion? Simply a
>> slower index time? Or will I get inaccurate results?
>> I have tried it on a very small sample of documents, and it seemed to
>> work. I could spend some time this week trying to get a more robust (and
>> accurate) dataset loaded to play around with. The reason for the 10
>> billion is to support being able to query for a region on a chromosome.
>>
>> A user might want to know what genes overlap a point on a specific
>> chromosome. Unless I can use 3 dimensional coordinates (which gave an
>> error when I tried it), I'll need to multiply the coordinates by some
>> offset for each chromosome to be able to normalise the data (at both
>>index
>> and query time). The largest chromosome (chr 1) has almost 250,000,000
>> base pairs. I could probably squeeze the rest a bit smaller, but I'd
>> rather use one size for all chromosomes, since we have more than just
>> human data to deal with. It would get quite messy otherwise.
>>
>>
>> On 7/22/13 11:50 AM, "David Smiley (@MITRE.org)" <
>
>> DSMILEY@
>
>> > wrote:
>>
>>>Like Hoss said, you're going to have to 

Re: custom field type plugin

2013-07-24 Thread Kevin Stone
I tried reducing the maxDistErr to "0.01", just to test making it smaller.
I got maxLevels down to 45, and slightly better query times (Indexing time
was about the same). However, my queries are not accurate anymore. I need
to pad by 2 or 3 whole numbers to get a hit now, which won't work in real
use. I can play with the number a bit more, but I didn't see anything
wrong when I had it at "0.9". I do know about using a small
decimal value to pad around my coordinates, and I'll probably do that for
the real implementation, but for testing, whole numbers were working for
all my edge cases.

-Kevin

On 7/23/13 10:45 PM, "Smiley, David W."  wrote:

>Kevin,
>
>Those are some good query response times but they could be better.  You've
>configured the field type sub-optimally.  Look again at
>http://wiki.apache.org/solr/SpatialForTimeDurations and note in particular
>maxDistErr.  You've left it at the value that comes pre-configured with
>Solr, 0.9, which is ~1 meter measured in degrees, and this value
>makes no sense when your numeric range is in whole numbers.  I suspect you
>inherited this value from Hoss's slides.  **Instead use 1.** (as shown on
>the wiki). This affects performance in a big way since you've configured
>the prefixTree to hold 2.22e18 values (calculated via (max-min) /
>maxDistErr) as opposed to "just" 2e10.  Your log shows maxLevels is 50 for
>quad tree.  The comments in QuadPrefixTree (and I put them there once)
>indicate maxLevels of 50 is about as much as is supported.  But again, I'm
>not certain what the limit really is without validating.  Hopefully you
>can stay clear of 50.  To do some tests, try querying just on the edge on
>either side of an indexed value to make sure you match the point and then
>don't match the indexed point as you would expect based on the
>instructions.  Also, be sure to read more of the details on "Search" on
>this wiki page in which you are advised to buffer the query shape
>slightly; you didn't do this in your examples below.  This is all a bit of
>a hack when using a field that internally is using floating point instead
>of fixed precision.
>
>~ David Smiley
>
>On 7/23/13 9:32 PM, "Kevin Stone"  wrote:
>
>>Sorry for the late response. I needed to find the time to load a lot of
>>extra data (closer to what we're anticipating). I have an index with
>>close
>>to 220,000 documents, each with at least two coordinate regions anywhere
>>between -10 billion to +10 billion, but could potentially have up to
>>maybe
>>half dozen regions in one document. The reason for the negatives, is
>>because you can read a chromosome either backwards or forwards, so many
>>coordinates can be minus.
>>
>>Here is the schema field definition:
>>
>>> class="solr.SpatialRecursivePrefixTreeFieldType"
>> multiValued="true"
>> geo="false"
>> worldBounds="-1000 -1000 1000
>>1000"
>> distErrPct="0"
>> maxDistErr="0.9"
>> units="degrees"
>> />
>>
>>
>>Here is the first query in the log:
>>
>>INFO:
>>geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFi
>>e
>>l
>>dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={dis
>>t
>>E
>>rrPct=0, geo=false, multiValued=true, worldBounds=-1000
>>-1000 1000 1000, maxDistErr=0.9,
>>units=degrees}} strat:
>>RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(ma
>>x
>>L
>>evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc,
>>worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)})))
>>maxLevels: 50
>>Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute
>>INFO: [testIndex] webapp=/solr path=/select
>>params={wt=xml&q=humanCoordinate:"Intersects(0+60330+6033041244+1
>>0
>>0
>>)"&rows=100} hits=81112 status=0 QTime=122
>>
>>
>>
>>
>>
>>Here are some other queries to give different timings (the one above
>>brings back quite a lot):
>>
>>INFO: [testIndex] webapp=/solr path=/select
>>params={wt=xml&q=humanCoordinate:"Intersects(0+60+69+1000
>>0
>>0
>>0)"&rows=100} hits=6031 status=0 QTime=10
>>Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute
>>INFO: [testIndex] webapp=/solr path=/select
>

Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
I am having some difficulty migrating our solr indexing scripts from using 3.5 
to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 
is about 5-10 times slower when indexing documents. Querying is still quite 
fast.

The code adds  documents in groups of 1000, and adds each group to the solr in 
a thread. The documents are somewhat large, including maybe 30-40 different 
field types, mostly multivalued. Here are some snippets of the code we used in 
3.5.


 MultiThreadedHttpConnectionManager mgr = new 
MultiThreadedHttpConnectionManager();

 HttpClient client = new HttpClient(mgr);

 CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for our 
index",client );

 server.setRequestWriter(new BinaryRequestWriter());


 Then, we delete the index, and proceed to generate documents and load the 
groups in a thread that looks kind of like this. I've omitted some overhead for 
handling exceptions, and retry attempts.


class DocWriterThread implements Runnable

{

CommonsHttpSolrServer server;

Collection docs;

private int commitWithin = 5; // 50 seconds

public DocWriterThread(CommonsHttpSolrServer 
server,Collection docs)

{

this.server=server;

this.docs=docs;

}

public void run()

{

// set the commitWithin feature

server.add(docs,commitWithin);

}

}


Now, I've had to change some things to get this compile with the Solr 4.0 
libraries. Here is what I tried to convert the above code to. I don't know if 
these are the correct equivalents, as I am not familiar with apache 
httpcomponents.



 ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();

 DefaultHttpClient client = new DefaultHttpClient(mgr);

 HttpSolrServer server = new HttpSolrServer( "some url for our solr 
index",client );

 server.setRequestWriter(new BinaryRequestWriter());




The thread method is the same, but uses HttpSolrServer instead of 
CommonsHttpSolrServer.

We also, had an old solrconfig (not sure what version, but it is pre 3.x and 
had mostly default values) that I had to replace with a 4.0 style 
solrconfig.xml. I don't want to post the entire file (as it is large), but I 
copied one from the solr 4.0 examples, and made a couple changes. First, I 
wanted to turn off transaction logging. So essentially I have a line like this 
(everything inside is commented out):





And I added a handler for javabin






 application/javabin

   

  

I'm not sure what other configurations I should look at. I would think that 
there should be a big obvious reason why the indexing performance would drop 
nearly 10 fold.

Against our 3.5 instance I timed our index load, and it adds roughly 40,000 
documents every 3-8 seconds.

Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.

This isn't the end of the world, and I would love to use the new join feature 
in solr 4.0. However, we have many different indexes with millions of 
documents, and this kind of increase in load time is troubling.


Thanks for your help.


-Kevin


The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
Do you mean commenting out the ... tag? Because
that I already commented out. Or do I also need to remove the entire
 tag? Sorry, I am not too familiar with everything in the
solrconfig file. I have a tag that essentially looks like this:




Everything inside is commented out.

-Kevin

On 1/23/13 11:21 AM, "Mark Miller"  wrote:

>It's hard to guess, but I might start by looking at what the new
>UpdateLog is costing you. Take it's definition out of solrconfig.xml and
>try your test again. Then let's take it from there.
>
>- Mark
>
>On Jan 23, 2013, at 11:00 AM, Kevin Stone  wrote:
>
>> I am having some difficulty migrating our solr indexing scripts from
>>using 3.5 to solr 4.0. Notably, I am trying to track down why our
>>performance in solr 4.0 is about 5-10 times slower when indexing
>>documents. Querying is still quite fast.
>>
>> The code adds  documents in groups of 1000, and adds each group to the
>>solr in a thread. The documents are somewhat large, including maybe
>>30-40 different field types, mostly multivalued. Here are some snippets
>>of the code we used in 3.5.
>>
>>
>> MultiThreadedHttpConnectionManager mgr = new
>>MultiThreadedHttpConnectionManager();
>>
>> HttpClient client = new HttpClient(mgr);
>>
>> CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for
>>our index",client );
>>
>> server.setRequestWriter(new BinaryRequestWriter());
>>
>>
>> Then, we delete the index, and proceed to generate documents and load
>>the groups in a thread that looks kind of like this. I've omitted some
>>overhead for handling exceptions, and retry attempts.
>>
>>
>> class DocWriterThread implements Runnable
>>
>> {
>>
>>CommonsHttpSolrServer server;
>>
>>Collection docs;
>>
>>private int commitWithin = 5; // 50 seconds
>>
>>public DocWriterThread(CommonsHttpSolrServer
>>server,Collection docs)
>>
>>{
>>
>>this.server=server;
>>
>>this.docs=docs;
>>
>>}
>>
>> public void run()
>>
>> {
>>
>>// set the commitWithin feature
>>
>>server.add(docs,commitWithin);
>>
>> }
>>
>> }
>>
>>
>> Now, I've had to change some things to get this compile with the Solr
>>4.0 libraries. Here is what I tried to convert the above code to. I
>>don't know if these are the correct equivalents, as I am not familiar
>>with apache httpcomponents.
>>
>>
>>
>> ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();
>>
>> DefaultHttpClient client = new DefaultHttpClient(mgr);
>>
>> HttpSolrServer server = new HttpSolrServer( "some url for our solr
>>index",client );
>>
>> server.setRequestWriter(new BinaryRequestWriter());
>>
>>
>>
>>
>> The thread method is the same, but uses HttpSolrServer instead of
>>CommonsHttpSolrServer.
>>
>> We also, had an old solrconfig (not sure what version, but it is pre
>>3.x and had mostly default values) that I had to replace with a 4.0
>>style solrconfig.xml. I don't want to post the entire file (as it is
>>large), but I copied one from the solr 4.0 examples, and made a couple
>>changes. First, I wanted to turn off transaction logging. So essentially
>>I have a line like this (everything inside is commented out):
>>
>>
>> 
>>
>>
>> And I added a handler for javabin
>>
>>
>> >class="solr.BinaryUpdateRequestHandler">
>>
>>
>>
>> application/javabin
>>
>>   
>>
>>  
>>
>> I'm not sure what other configurations I should look at. I would think
>>that there should be a big obvious reason why the indexing performance
>>would drop nearly 10 fold.
>>
>> Against our 3.5 instance I timed our index load, and it adds roughly
>>40,000 documents every 3-8 seconds.
>>
>> Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.
>>
>> This isn't the end of the world, and I would love to use the new join
>>feature in solr 4.0. However, we have many different indexes with
>>millions of documents, and this kind of increase in load time is
>>troubling.
>>
>>
>> Thanks for your help.
>>
>>
>> -Kevin
>>
>>
>> The information in this email, including attachments, may be
>>confidential and is intended solely for the addressee(s). If you believe
>>you received this email by mistake, please notify the sender by return
>>email as soon as possible.
>


The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
I'm still poking around trying to find the differences. I found a couple
things that may or may not be relevant.
First, when I start up my 3.5 solr, I get all sorts of warnings that my
solrconfig is old and will run using 2.4 emulation.
Of course I had to upgrade the solconfig for the 4.0 instance (which I
already described). I am curious if there could be some feature I was
taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know.

Second when I look at the console logs for my server (3.5 and 4.0) and I
run the indexer against each, I see a subtle difference in this print out
when it connects to the solr core.
The 3.5 version prints this out:
webapp=/solr path=/update
params={waitSearcher=true&wt=javabin&commit=true&softCommit=false&version=2
} {commit=} 0 2722


The 4.0 version prints this out
 webapp=/solr path=/update/javabin
params={wt=javabin&commit=true&waitFlush=true&waitSearcher=true&version=2}
status=0 QTime=1404



The params for the update handle seem ever so slightly different. The 3.5
version (the one that runs fast) has a setting softCommit=false.
The 4.0 version does not print that setting, but instead prints this
setting waitFlush=true.

These could be irrelevant, but thought I should add the information.

-Kevin

On 1/23/13 11:42 AM, "Kevin Stone"  wrote:

>Do you mean commenting out the ... tag? Because
>that I already commented out. Or do I also need to remove the entire
> tag? Sorry, I am not too familiar with everything in the
>solrconfig file. I have a tag that essentially looks like this:
>
>
>
>
>Everything inside is commented out.
>
>-Kevin
>
>On 1/23/13 11:21 AM, "Mark Miller"  wrote:
>
>>It's hard to guess, but I might start by looking at what the new
>>UpdateLog is costing you. Take it's definition out of solrconfig.xml and
>>try your test again. Then let's take it from there.
>>
>>- Mark
>>
>>On Jan 23, 2013, at 11:00 AM, Kevin Stone  wrote:
>>
>>> I am having some difficulty migrating our solr indexing scripts from
>>>using 3.5 to solr 4.0. Notably, I am trying to track down why our
>>>performance in solr 4.0 is about 5-10 times slower when indexing
>>>documents. Querying is still quite fast.
>>>
>>> The code adds  documents in groups of 1000, and adds each group to the
>>>solr in a thread. The documents are somewhat large, including maybe
>>>30-40 different field types, mostly multivalued. Here are some snippets
>>>of the code we used in 3.5.
>>>
>>>
>>> MultiThreadedHttpConnectionManager mgr = new
>>>MultiThreadedHttpConnectionManager();
>>>
>>> HttpClient client = new HttpClient(mgr);
>>>
>>> CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for
>>>our index",client );
>>>
>>> server.setRequestWriter(new BinaryRequestWriter());
>>>
>>>
>>> Then, we delete the index, and proceed to generate documents and load
>>>the groups in a thread that looks kind of like this. I've omitted some
>>>overhead for handling exceptions, and retry attempts.
>>>
>>>
>>> class DocWriterThread implements Runnable
>>>
>>> {
>>>
>>>CommonsHttpSolrServer server;
>>>
>>>Collection docs;
>>>
>>>private int commitWithin = 5; // 50 seconds
>>>
>>>public DocWriterThread(CommonsHttpSolrServer
>>>server,Collection docs)
>>>
>>>{
>>>
>>>this.server=server;
>>>
>>>this.docs=docs;
>>>
>>>}
>>>
>>> public void run()
>>>
>>> {
>>>
>>>// set the commitWithin feature
>>>
>>>server.add(docs,commitWithin);
>>>
>>> }
>>>
>>> }
>>>
>>>
>>> Now, I've had to change some things to get this compile with the Solr
>>>4.0 libraries. Here is what I tried to convert the above code to. I
>>>don't know if these are the correct equivalents, as I am not familiar
>>>with apache httpcomponents.
>>>
>>>
>>>
>>> ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();
>>>
>>> DefaultHttpClient client = new DefaultHttpClient(mgr);
>>>
>>> HttpSolrServer server = new HttpSolrServer( "some url for our solr
>>>index",client );
>>>
>>> server.setRequestWriter(new BinaryRequestWriter());
>>>
>>>
>>>
>>>
>>> The thread method is the 

Re: Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
Another revelation...
I can see that there is a time difference in the Solr output for adding
these documents when I watch it realtime.
Here are some rows from the 3.5 solr server:

Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabin&version=2} status=0 QTime=6196
Jan 23, 2013 11:57:23 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1386104, RNA in situ-1351487, RNA in situ-1363917,
RNA in situ-1377125, RNA in situ-1371738, RNA in situ-1378746, RNA in
situ-1383410, RNA in situ-1362712, ... (1001 adds)]} 0 6266
Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabin&version=2} status=0 QTime=6266
Jan 23, 2013 11:57:24 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1371578, RNA in situ-1377716, RNA in situ-1378151,
RNA in situ-1360580, RNA in situ-1391657, RNA in situ-1370288, RNA in
situ-1388236, RNA in situ-1361465, ... (1001 adds)]} 0 6371
Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabin&version=2} status=0 QTime=6371
Jan 23, 2013 11:57:24 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1350555, RNA in situ-1350887, RNA in situ-1379699,
RNA in situ-1373773, RNA in situ-1374004, RNA in situ-1372265, RNA in
situ-1373027, RNA in situ-1380691, ... (1001 adds)]} 0 6440
Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute



And here from the 4.0 solr:

Jan 23, 2013 3:40:22 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-115650, RNA in situ-4109, RNA in situ-107614, RNA in
situ-86038, RNA in situ-19647, RNA in situ-1422, RNA in situ-119536, RNA
in situ-5, RNA in situ-86825, RNA in situ-91009, ... (1001 adds)]} 0
3105
Jan 23, 2013 3:40:23 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-38103, RNA in situ-15797, RNA in situ-79946, RNA in
situ-124877, RNA in situ-62025, RNA in situ-67908, RNA in situ-70527, RNA
in situ-20581, RNA in situ-107574, RNA in situ-96497, ... (1001 adds)]} 0
2689
Jan 23, 2013 3:40:24 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-35518, RNA in situ-50512, RNA in situ-109961, RNA in
situ-113025, RNA in situ-33729, RNA in situ-116967, RNA in situ-133871,
RNA in situ-55287, RNA in situ-67367, RNA in situ-8617, ... (1001 adds)]}
0 2367
Jan 23, 2013 3:40:28 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-105749, RNA in situ-125415, RNA in situ-14667, RNA in
situ-41067, RNA in situ-1099, RNA in situ-86169, RNA in situ-90834, RNA in
situ-114639, RT-PCR-26160, RNA in situ-79745, ... (1001 adds)]} 0 3401
Jan 23, 2013 3:40:28 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-82061, RNA in situ-96965, RNA in situ-22677, RNA in
situ-52637, RNA in situ-131842, RNA in situ-31863, RNA in situ-111656, RNA
in situ-120509, RNA in situ-29659, RNA in situ-63579, ... (1001 adds)]} 0
3580
Jan 23, 2013 3:40:31 PM
org.apache.solr.update.processor.LogUpdateProcessor finish



I know that they aren't the same exact documents (like I said, there are
millions to load), but the times look pretty much like this for all of
them.

Can someone help me parse out the times of this? It *appears* to me that
the inserts are happening just as fast, if not faster in 4.0 than 3.5, BUT
the timestamps between the LogUpdateProcessor calls are much longer in
4.0.
I do not have the  tag anywhere in my solrconfig.xml. So why
does it look to me like it is spending a lot of time logging? It shouldn't
really be logging anything, right? Bear in mind that these inserts happen
in threads that are pushing to Solr concurrently. So if 4.0 is logging
somewhere that 3.5 didn't, then the file-locking on that log file could be
slowing me down.

-Kevin

On 1/23/13 12:03 PM, "Kevin Stone"  wrote:

>I'm still poking around trying to find the differences. I found a couple
>things that may or may not be relevant.
>First, when I start up my 3.5 solr, I get all sorts of warnings that my
>solrconfig is old and will run using 2.4 emulation.
>Of course I had to upgrade the solconfig for the 4.0 instance (which I
>already described). I am curious if there could be some feature I was
>taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know.
>
>Second when