Using hundreds of dynamic fields

2014-06-27 Thread Andy Crossen
Hi folks,

My application requires tracking a daily performance metric for all
documents. I start tracking for an 18 month window from the time a doc is
indexed, so each doc will have ~548 of these fields.  I have in my schema a
dynamic field to capture this requirement:



Example:
metric_2014_06_24 : 15
metric_2014_06_25 : 21
…

My application then issues a query that:
a) sorts documents by the sum of the metrics within a date range that is
variable for each query;
b) gathers stats on the metrics using the Statistics component.

With this design, the app must unfortunately:
a) construct the sort as a long list of fields within the spec’d date range
to accomplish the sum; e.g. sort=sum(metric_2014_06_24,metric_2014_06_25…)
desc
b) specify each field in the range independently to the Stats component;
e.g. stats.field=metric_2014_06_24&stats.field=metric_2014_06_25…

Am I missing a cleaner way to accomplish this given the requirements above?

Thanks for any suggestions you may have.


Re: Using hundreds of dynamic fields

2014-07-16 Thread Andy Crossen
Thanks, Jack and Jared, for your input on this.  I'm looking into whether
parent-child relationships via block or query time join will meet my
requirements.

Jack, I noticed in a bunch of other posts around the web that you've
suggested to use dynamic fields in moderation.  Is this suggestion based on
negative performance implications of having to read and rewrite all
previous fields for a document when doing atomic updates?  Or are there
additional inherent negatives to using lots of dynamic fields?

Andy


On Fri, Jun 27, 2014 at 11:46 AM, Jared Whiklo 
wrote:

> This is probably not the best answer, but my gut says that even if you
> changed your document to a simple 2 fields and have one as your metric and
> the other as a TrieDateField you would speed up and simplify your date
> range queries.
>
>
> --
> Jared Whiklo
>
>
>
> On 2014-06-27 10:10 AM, "Andy Crossen"  wrote:
>
> >Hi folks,
> >
> >My application requires tracking a daily performance metric for all
> >documents. I start tracking for an 18 month window from the time a doc is
> >indexed, so each doc will have ~548 of these fields.  I have in my schema
> >a
> >dynamic field to capture this requirement:
> >
> >
> >
> >Example:
> >metric_2014_06_24 : 15
> >metric_2014_06_25 : 21
> >…
> >
> >My application then issues a query that:
> >a) sorts documents by the sum of the metrics within a date range that is
> >variable for each query;
> >b) gathers stats on the metrics using the Statistics component.
> >
> >With this design, the app must unfortunately:
> >a) construct the sort as a long list of fields within the spec’d date
> >range
> >to accomplish the sum; e.g. sort=sum(metric_2014_06_24,metric_2014_06_25…)
> >desc
> >b) specify each field in the range independently to the Stats component;
> >e.g. stats.field=metric_2014_06_24&stats.field=metric_2014_06_25…
> >
> >Am I missing a cleaner way to accomplish this given the requirements
> >above?
> >
> >Thanks for any suggestions you may have.
>
>


Sorting by a dynamically-generated field in a distributed context

2014-01-21 Thread Andy Crossen
Hi folks,

Using Solr 4.6.0 in a cloud configuration, I'm developing a SearchComponent
that generates a custom score for each document.  Its operational flow
looks like this:

1. The score is derived from an analysis of search results coming out of
the QueryComponent.  Therefore, the component is installed after
QueryComponent in the processing chain.
2. The scores are generated in the component's process method (i.e. at the
shard level), and a map of uniqueKey:score is attached to each shard's
response at this point.
3. The shard-wise maps are combined in handleResponses and the aggregate
map is attached to the top-level distributed query's response.
4. In the finishStage method at the coordinator node level (i.e. response
stage = Get Fields), I'm presented with the final list of search results
sorted by Lucene score.  My custom scores are now added as fields to their
corresponding documents based on a uniqueKey lookup in the aggregate score
map.

Now I need to sort the final document list (or do it at the shard level) by
the custom score, but I'm having trouble understanding how to accomplish
this.  Yes, I could just sort my list (which will never exceed 1K results)
in finishStage and be done with it, but I'm trying to learn Solr best
practices to see if there's a better way.  At the end of the day, I'd like
to be able to take advantage of the "sort" request parameter to effect my
sort.

Given the current operational flow, it seems like I'd need to add a new
SortField for my score in step 4 and reinvoke QueryComponent's mergeIds
sort routine now that my custom field is present in the document list.  Of
course, I can't do that since it's all private code; nor does it seem wise
from an extensibility perspective to copy that code into my component for
use in this manner.

Reading Sujit Pal's blog post on "Custom Sorting in Solr using External
Database Data", I started down the path of defining a custom
FieldType/FieldComparatorSource for my score, but I didn't see how that
would help since the sort is still applied in QueryComponent - before my
custom score is available.  Regardless, Sujit's example seems pretty close
to what I want.

I must be misusing/misunderstanding the distributed design here in some
way.  Can an expert on distributed search components weigh in here?

Thanks!


Duplicate scoring situation in DelegatingCollector

2014-11-14 Thread Andy Crossen
Hi folks,

I have a DelegatingCollector installed via a PostFilter (kind of like an
AnalyticsQuery) that needs the document score to a) add to a collection of
score-based stats, and b) decide whether to keep the document based on the
score.

If I keep the document, I call super.collect() (where super is a
TopScoreDocCollector), which re-scores the document in its collect method.
The scoring is custom and reasonably expensive.

Is there an easy way to avoid this?  Or do I have to stop calling
super.collect(), manage my own bitset/PQ, and pass the filtered results in
the DelegatingCollector's finish() method?

There's a thread out there ("Configurable collectors for custom ranking")
that kind of talks about the above.  Seems cumbersome.

Thanks for any direction!