Thank you Joel.  I'm really having a good time with the machine learning component in Solr.  In this case, the weather model was built by classifying tweets as positive or negative.  I started by searching for tweets with terms like tornado, storm, forecast, typhoon, hurricane, blizzard, snow, lightning, flood warning, etc.. and making those positive.  Then I grabbed some randoms tweets about Trump, ISIS, Kardashian, etc. to make negative tweets.  At that point I started to classify data and refine the model (adding more positive/negative) as more data came into the system.

I hope that helps.  The model works very well at this point with just 650 tweets manually classified (pos/neg about split even) and using 150 terms.

I like your idea about using the model to re-rank the top n search results.  That said, the results can be significantly 'better' if I classify more data and reorder based on high probability scores; but as you pointed out at the cost of much slower searches.  In some cases, I would suspect a user may want to search just with a model and without any search terms, but in those cases it may be best to classify data as it comes in.  I guess it's a toss up between what is more important - high probability from the classifier vs high rank from the search engine.
Thanks Joel.

-Joe


On 8/23/2017 3:08 PM, Joel Bernstein wrote:
Can you describe the weather model?

In general the idea is to rerank the top N docs, because it will be too
slow to classify the whole result set.

In this scenario the search engine ranking will already be returning
relevant candidate documents and the model is only used to get a better
ordering of the top docs.



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

Hi All - One of the really neat features of solr 6 is the ability to
create machine learning models (information gain) and then use those models
as a query.  If I want a user to be able to execute a query for the text
Hawaii and use a machine learning model related to weather data, how can I
correctly rank the results?  It looks like I would need to classify all the
documents in some date range (assuming the query is date restricted), look
at the probability_d and pick the top n documents.  Is there a better way
to do this?

I'm using a stream like this:
classify(model(models,id="WeatherModel",cacheMillis=5000),
search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
DocTimestamp:[2017-07-23T04:00:00Z TO 
2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
asc",rows="10000"),field="ClusterText")

This sends this to all the shards who can return at most 10,000 docs each.

Thanks!

-Joe



---
This email has been checked for viruses by AVG.
http://www.avg.com


Reply via email to