Joe, To do this correctly, soundly, you will need to sample the data and mark them as threatening or neutral. You can probably expand on this quite a bit, but that would be a good start. You can then draw another set of samples and see how you did. You use one to train and one to validate.
What you are doing is probably just noise, from a model point of view, and it will probably not make too much difference how you index/query/model through the noise. I don't mean this critically, just plainly. Effectively the less mathematically correctly you do this process, the more anecdotal the result. tim On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <joels...@gmail.com> wrote: > I've only tested with the training data in it's own collection, but it was > designed for multiple training sets in the same collection. > > I suspect you're training set is too small to get a reliable model from. > The training sets we tested with were considerably larger. > > All the idfs_ds values being the same seems odd though. The idfs_ds in > particular were designed to be accurate when there are multiple training > sets in the same collection. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger < > joseph.obernber...@gmail.com> wrote: > > > If I put the training data into its own collection and use q="*:*", then > > it works correctly. Is that a requirement? > > Thank you. > > > > -Joe > > > > > > > > On 3/20/2017 3:47 PM, Joe Obernberger wrote: > > > >> I'm trying to build a model using tweets. I've manually tagged 30 > tweets > >> as threatening, and 50 random tweets as non-threatening. When I build > the > >> mode with: > >> > >> update(models2, batchSize="50", > >> train(UNCLASS, > >> features(UNCLASS, > >> q="ProfileID:PROFCLUST1", > >> featureSet="threatFeatures3", > >> field="ClusterText", > >> outcome="out_i", > >> positiveLabel=1, > >> numTerms=250), > >> q="ProfileID:PROFCLUST1", > >> name="threatModel3", > >> field="ClusterText", > >> outcome="out_i", > >> maxIterations="100")) > >> > >> It appears to work, but all the idfs_ds values are identical. The > >> terms_ss values look reasonable, but nearly all the weights_ds are 1.0. > >> For out_i it is either -1 for non-threatening tweets, and +1 for > >> threatening tweets. I'm trying to follow along with Joel Bernstein's > >> excellent post here: > >> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s > >> ystem-with-solrs.html > >> > >> Tips? > >> > >> Thank you! > >> > >> -Joe > >> > >> > > >