Re: statistics in hitlist

Erick Erickson Thu, 15 Mar 2018 12:15:05 -0700

What does the fq clause look like?


On Thu, Mar 15, 2018 at 11:51 AM, John Smith <localde...@gmail.com> wrote:
> Hi Joel, I did some more work on this statistics stuff today. Yes, we do
> have nulls in our data; the document contains many fields, we don't always
> have values for each field, but we can't set the nulls to 0 either (or any
> other value, really) as that will mess up other calculations (such as when
> calculating average etc); we would normally just ignore fields with null
> values when calculating stats manually ourselves.
>
> Adding a check in the "q" parameter to ensure that the fields used in the
> calculations are > 0 does work now. Thanks for the tip (and sorry, should
> have caught that myself). But I am unable to use "fq" for these checks,
> they have to be added to the q instead. Adding fq's doesn't have any effect.
>
>
> Anyway, I'm trying to change this up a little. This is what I'm currently
> using (switched from "random" to "search" since I actually need the full
> hitlist not just a random subset):
>
> let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> fq="isParent:true", rows="1500000",
> fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
> asc"),
>      b=col(a, oil_first_90_days_production),
>      c=col(a, oil_last_30_days_production),
>      d=regress(b, c))
>
> So I have 2 fields there defined, that works great (in terms of a test and
> running the query); but I need to replace the second field,
> "oil_last_30_days_production" with the avg value in
> oil_first_90_days_production.
>
> I can get the avg with this expression:
> stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> fq="isParent:true", rows="1500000", avg(oil_first_90_days_production))
>
> But I don't know how to push that avg value into the first streaming
> expression; guessing I have to set "c=...." but that is where I'm getting
> lost, since avg only returns 1 value and the first parameter, "b", returns
> a list of sorts. Somehow I have to get the avg value stuffed inside a
> "col", where it is the same value for every row in the hitlist...?
>
> Thanks for your help!
>
>
> On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <joels...@gmail.com> wrote:
>
>> I suspect you've got nulls in your data. I just tested with null values and
>> got the same error. For testing purposes try loading the data with default
>> values of zero.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <joels...@gmail.com>
>> wrote:
>>
>> > Let's break the expression down and build it up slowly. Let's start with:
>> >
>> > let(echo="true",
>> >      a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
>> > fl="oil_first_90_days_production,oil_last_30_days_production"),
>> >      b=col(a, oil_first_90_days_production))
>> >
>> >
>> > This should return variables a and b. Let's see what the data looks like.
>> > I changed the rows from 15 to 15000. If it all looks good we can expand
>> the
>> > rows and continue adding functions.
>> >
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <localde...@gmail.com> wrote:
>> >
>> >> Thanks Joel for your help on this.
>> >>
>> >> What I've done so far:
>> >> - unzip downloaded solr-7.2
>> >> - modify the _default "managed-schema" to add the random field type and
>> >> the dynamic random field
>> >> - start solr7 using "solr start -c"
>> >> - indexed my data using pint/pdouble/boolean field types etc
>> >>
>> >> I can now run the random function all by itself, it returns random
>> >> results as expected. So far so good!
>> >>
>> >> However... now trying to get the regression stuff working:
>> >>
>> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> >> rows="15000", fl="oil_first_90_days_producti
>> >> on,oil_last_30_days_production"),
>> >>     b=col(a, oil_first_90_days_production),
>> >>     c=col(a, oil_last_30_days_production),
>> >>     d=regress(b, c))
>> >>
>> >> Posted directly into solr admin UI. Run the streaming expression and I
>> >> get this error message:
>> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
>> >> expected but found type java.lang.String for value
>> >> oil_first_90_days_production"
>> >>
>> >> It thinks my numeric field is defined as a string? But when I view the
>> >> schema, those 2 fields are defined as ints:
>> >>
>> >>
>> >> When I run a normal query and choose xml as output format, then it also
>> >> puts "int" elements into the hitlist, so the schema appears to be
>> correct
>> >> it's just when using this regress function that something goes wrong and
>> >> solr thinks the field is string.
>> >>
>> >> Any suggestions?
>> >> Thanks!
>> >>
>> >>
>> >>
>> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <joels...@gmail.com>
>> >> wrote:
>> >>
>> >>> The field type will also need to be in the schema:
>> >>>
>> >>>  <!-- The "RandomSortField" is not used to store or search any
>> >>>
>> >>>          data.  You can declare fields of this type it in your schema
>> >>>
>> >>>          to generate pseudo-random orderings of your docs for sorting
>> >>>
>> >>>          or function purposes.  The ordering is generated based on the
>> >>> field
>> >>>
>> >>>          name and the version of the index. As long as the index
>> version
>> >>>
>> >>>          remains unchanged, and the same field name is reused,
>> >>>
>> >>>          the ordering of the docs will be consistent.
>> >>>
>> >>>          If you want different psuedo-random orderings of documents,
>> >>>
>> >>>          for the same version of the index, use a dynamicField and
>> >>>
>> >>>          change the field name in the request.
>> >>>
>> >>>      -->
>> >>>
>> >>> <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>> >>>
>> >>>
>> >>> Joel Bernstein
>> >>> http://joelsolr.blogspot.com/
>> >>>
>> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <joels...@gmail.com>
>> >>> wrote:
>> >>>
>> >>> > You'll need to have this field in your schema:
>> >>> >
>> >>> > <dynamicField name="random_*" type="random" />
>> >>> >
>> >>> > I'll check to see if the default schema used with solr start -c has
>> >>> this
>> >>> > field, if not I'll add it. Thanks for pointing this out.
>> >>> >
>> >>> > I checked and right now the random expression is only accepting one
>> fq,
>> >>> > but I consider this a bug. It should accept multiple. I'll create
>> >>> ticket
>> >>> > for getting this fixed.
>> >>> >
>> >>> >
>> >>> >
>> >>> > Joel Bernstein
>> >>> > http://joelsolr.blogspot.com/
>> >>> >
>> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <localde...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> Joel, thanks for the pointers to the streaming feature. I had no
>> idea
>> >>> solr
>> >>> >> had that (and also just discovered the very intersting sql feature!
>> I
>> >>> will
>> >>> >> be sure to investigate that in more detail in the future).
>> >>> >>
>> >>> >> However I'm having some trouble getting basic streaming functions
>> >>> working.
>> >>> >> I've already figured out that I had to move to "solr cloud" instead
>> of
>> >>> >> "solr standalone" because I was getting errors about "cannot find zk
>> >>> >> instance" or whatever which went away when using "solr start -c"
>> >>> instead.
>> >>> >>
>> >>> >> But now I'm trying to use the random function since that was one of
>> >>> the
>> >>> >> functions used in your example.
>> >>> >>
>> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
>> >>> >>
>> >>> >> I posted that directly in the "stream" section of the solr admin UI.
>> >>> This
>> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions
>> in
>> >>> case
>> >>> >> it was a bug in one)
>> >>> >>
>> >>> >> I get back an error message:
>> >>> >> *sort param could not be parsed as a query, and is not a field that
>> >>> exists
>> >>> >> in the index: random_-255009774*
>> >>> >>
>> >>> >> I'm not passing in any sort field anywhere. But the solr logs show
>> >>> these
>> >>> >> three log entries:
>> >>> >>
>> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
>> s:shard1
>> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
>> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
>> >>> status=400
>> >>> >> QTime=19
>> >>> >>
>> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
>> s:shard1
>> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>> >>> o.a.s.c.s.i.CloudSolrClient
>> >>> >> Request to collection [tx_header] failed due to (400)
>> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
>> RemoteSolrException:
>> >>> >> Error
>> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>> >>> could
>> >>> >> not be parsed as a query, and is not a field that exists in the
>> index:
>> >>> >> random_-255009774, retry? 0
>> >>> >>
>> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
>> s:shard1
>> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>> >>> o.a.s.c.s.i.s.ExceptionStream
>> >>> >> java.io.IOException:
>> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
>> RemoteSolrException:
>> >>> >> Error
>> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>> >>> could
>> >>> >> not be parsed as a query, and is not a field that exists in the
>> index:
>> >>> >> random_-255009774
>> >>> >>
>> >>> >>
>> >>> >> So basically it looks like solr is injecting the "sort=random_"
>> stuff
>> >>> into
>> >>> >> my query and of course that is failing on the search since that
>> >>> >> field/column doesn't exist in my schema. Everytime I run the random
>> >>> >> function, I get a slightly different field name that it injects, but
>> >>> they
>> >>> >> all start with "random_" etc.
>> >>> >>
>> >>> >> I have tried adding my own sort field instead, hoping solr wouldn't
>> >>> inject
>> >>> >> one for me, but it still injected a random sort fieldname:
>> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
>> >>> sort="countyname
>> >>> >> asc")
>> >>> >>
>> >>> >>
>> >>> >> Assuming I can fix that whole problem, my second question is: can I
>> >>> add
>> >>> >> multiple "fq=" parameters to the random function? I build a pretty
>> >>> >> complicated query using many fq= fields, and then want to run some
>> >>> stats
>> >>> >> on
>> >>> >> that hitlist; so somehow I have to pass in the query that made up
>> the
>> >>> >> exact
>> >>> >> hitlist to these various functions, but when I used multiple "fq="
>> >>> values
>> >>> >> it only seemed to use the last one I specified and just ignored all
>> >>> the
>> >>> >> previous fq's?
>> >>> >>
>> >>> >> Thanks in advance for any comments/suggestions...!
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <joels...@gmail.com
>> >
>> >>> >> wrote:
>> >>> >>
>> >>> >> > This is going to be a complex answer because Solr actually now has
>> >>> >> multiple
>> >>> >> > ways of doing regression analysis as part of the Streaming
>> >>> Expression
>> >>> >> > statistical programming library. The basic documentation is here:
>> >>> >> >
>> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-program
>> >>> ming.html
>> >>> >> >
>> >>> >> > Here is a sample expression that performs a simple linear
>> >>> regression in
>> >>> >> > Solr 7.2:
>> >>> >> >
>> >>> >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
>> >>> >> > fieldB"),
>> >>> >> >     b=col(a, fieldA),
>> >>> >> >     c=col(a, fieldB),
>> >>> >> >     d=regress(b, c))
>> >>> >> >
>> >>> >> >
>> >>> >> > The expression above takes a random sample of 15000 results from
>> >>> >> > collection1. The result set will include fieldA and fieldB in each
>> >>> >> record.
>> >>> >> > The result set is stored in variable "a".
>> >>> >> >
>> >>> >> > Then the "col" function creates arrays of numbers from the results
>> >>> >> stored
>> >>> >> > in variable a. The values in fieldA are stored in the variable
>> "b".
>> >>> The
>> >>> >> > values in fieldB are stored in variable "c".
>> >>> >> >
>> >>> >> > Then the regress function performs a simple linear regression on
>> >>> arrays
>> >>> >> > stored in variables "b" and "c".
>> >>> >> >
>> >>> >> > The output of the regress function is a map containing the
>> >>> regression
>> >>> >> > result. This result includes RSquared and other attributes of the
>> >>> >> > regression model such as R (correlation), slope, y intercept
>> etc...
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > Joel Bernstein
>> >>> >> > http://joelsolr.blogspot.com/
>> >>> >> >
>> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <localde...@gmail.com
>> >
>> >>> >> wrote:
>> >>> >> >
>> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but
>> >>> the
>> >>> >> end
>> >>> >> > > result of all this is supposed to be obtaining R^2. Is there no
>> >>> way of
>> >>> >> > > obtaining this value, then (short of iterating over all the
>> >>> results in
>> >>> >> > the
>> >>> >> > > hitlist and calculating it myself)?
>> >>> >> > >
>> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
>> >>> joels...@gmail.com>
>> >>> >> > > wrote:
>> >>> >> > >
>> >>> >> > > > Typically SSE is the sum of the squared errors of the
>> >>> prediction in
>> >>> >> a
>> >>> >> > > > regression analysis. The stats component doesn't perform
>> >>> regression,
>> >>> >> > > > although it might be a nice feature.
>> >>> >> > > >
>> >>> >> > > >
>> >>> >> > > >
>> >>> >> > > > Joel Bernstein
>> >>> >> > > > http://joelsolr.blogspot.com/
>> >>> >> > > >
>> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
>> >>> localde...@gmail.com>
>> >>> >> > > wrote:
>> >>> >> > > >
>> >>> >> > > > > I'm using solr, and enabling stats as per this page:
>> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
>> component
>> >>> .html
>> >>> >> > > > >
>> >>> >> > > > > I want to get more stat values though. Specifically I'm
>> >>> looking
>> >>> >> for
>> >>> >> > > > > r-squared (coefficient of determination). This value is not
>> >>> >> present
>> >>> >> > in
>> >>> >> > > > > solr, however some of the pieces used to calculate r^2 are
>> in
>> >>> the
>> >>> >> > stats
>> >>> >> > > > > element, for example:
>> >>> >> > > > >
>> >>> >> > > > > <double name="min">0.0</double>
>> >>> >> > > > > <double name="max">10.0</double>
>> >>> >> > > > > <long name="count">15</long>
>> >>> >> > > > > <long name="missing">17</long>
>> >>> >> > > > > <double name="sum">85.0</double>
>> >>> >> > > > > <double name="sumOfSquares">603.0</double>
>> >>> >> > > > > <double name="mean">5.666666666666667</double>
>> >>> >> > > > > <double name="stddev">2.943920288775949</double>
>> >>> >> > > > >
>> >>> >> > > > >
>> >>> >> > > > > So I have the sumOfSquares available (SST), and using this
>> >>> >> > > calculation, I
>> >>> >> > > > > can get R^2:
>> >>> >> > > > >
>> >>> >> > > > > R^2 = 1 - SSE/SST
>> >>> >> > > > >
>> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE from
>> >>> those
>> >>> >> > other
>> >>> >> > > > > stats in solr?
>> >>> >> > > > >
>> >>> >> > > > > Thanks in advance!
>> >>> >> > > > >
>> >>> >> > > >
>> >>> >> > >
>> >>> >> >
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>

Re: statistics in hitlist

Reply via email to