Thank you Joel. I'm really having a good time with the machine learning
component in Solr. In this case, the weather model was built by
classifying tweets as positive or negative. I started by searching for
tweets with terms like tornado, storm, forecast, typhoon, hurricane,
blizzard, snow, lightning, flood warning, etc.. and making those
positive. Then I grabbed some randoms tweets about Trump, ISIS,
Kardashian, etc. to make negative tweets. At that point I started to
classify data and refine the model (adding more positive/negative) as
more data came into the system.
I hope that helps. The model works very well at this point with just
650 tweets manually classified (pos/neg about split even) and using 150
terms.
I like your idea about using the model to re-rank the top n search
results. That said, the results can be significantly 'better' if I
classify more data and reorder based on high probability scores; but as
you pointed out at the cost of much slower searches. In some cases, I
would suspect a user may want to search just with a model and without
any search terms, but in those cases it may be best to classify data as
it comes in. I guess it's a toss up between what is more important -
high probability from the classifier vs high rank from the search engine.
Thanks Joel.
-Joe
On 8/23/2017 3:08 PM, Joel Bernstein wrote:
Can you describe the weather model?
In general the idea is to rerank the top N docs, because it will be too
slow to classify the whole result set.
In this scenario the search engine ranking will already be returning
relevant candidate documents and the model is only used to get a better
ordering of the top docs.
Joel Bernstein
http://joelsolr.blogspot.com/
On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:
Hi All - One of the really neat features of solr 6 is the ability to
create machine learning models (information gain) and then use those models
as a query. If I want a user to be able to execute a query for the text
Hawaii and use a machine learning model related to weather data, how can I
correctly rank the results? It looks like I would need to classify all the
documents in some date range (assuming the query is date restricted), look
at the probability_d and pick the top n documents. Is there a better way
to do this?
I'm using a stream like this:
classify(model(models,id="WeatherModel",cacheMillis=5000),
search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
DocTimestamp:[2017-07-23T04:00:00Z TO
2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
asc",rows="10000"),field="ClusterText")
This sends this to all the shards who can return at most 10,000 docs each.
Thanks!
-Joe
---
This email has been checked for viruses by AVG.
http://www.avg.com