Thanks Doug. It is funny that you should mention that. It is very hard trying to convince people that just because words are somehow related, we really don't know how they are related. This is especially true when they are handed the results of a shallow neural net that took a research team a few weeks to put together.
I am always happy to have the reminder about common and rare words. Honestly, I am not that happy with the size of our corpus but it might be just enough. Alternatively, we weight the results of the embeddings really low for the search engine when it comes to displaying most relevant to least. Oh, given the lack of text being a problem, is there a problem with doing this on twitter data? I assume that running vector relationships over Twitter data is probably not going to do much. Thank you so much for the feedback. ~Ben On Tue, Oct 30, 2018 at 5:59 PM Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > You may already know this, but just be very careful. Embeddings are useful, > but people often think of them as detecting synonyms, but really just > encode contexts. For example antonyms and words with similar functions > often are seen as similar. > > There's also issues with terms that occur in sparsely (you don't get enough > contexts to get a good embedding) > and issues with terms that occur very commonly (they tend to clump together > despite different meanings) > > Older form of embedding, but the lessons still apply > > https://opensourceconnections.com/blog/2016/03/29/semantic-search-with-latent-semantic-analysis/ > > I'd also recommend my talk at Activate that spends a ton of time on > building/customizing embeddings for your use case > > https://docs.google.com/presentation/d/1-nPKX5VYUR7uue5IL0tm7M2YH0agb0aRO1y9sMKl1Hs/edit#slide=id.g3abdd68a3e_0_192 > > -Doug > > On Tue, Oct 30, 2018 at 5:37 PM Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > > Oh very cool. I will have to look into this more. This is something up > and > > coming I take it? > > > > Thanks, > > ~Ben > > > > On Tue, Oct 30, 2018 at 4:36 PM Alexandre Rafalovitch < > arafa...@gmail.com> > > wrote: > > > > > Simon Hughes presentation on just finished Activate may be relevant: > > > > > > > > > https://www.slideshare.net/SimonHughes13/vectors-in-search-towards-more-semantic-matching > > > The video will be available in a couple of weeks, I am guessing from > > > LucidWorks channel. > > > > > > Related repos: > > > *) https://github.com/DiceTechJobs/VectorsInSearch > > > *) https://github.com/DiceTechJobs/ConceptualSearch (older) > > > *) https://github.com/kojisekig/word2vec-lucene - something else quite > > old > > > > > > These are just keyword matches on your question. I am sure others may > > > have some more real details. > > > > > > Regards, > > > Alex. > > > On Tue, 30 Oct 2018 at 16:09, Benedict Holland > > > <benedict.m.holl...@gmail.com> wrote: > > > > > > > > Hello all, > > > > > > > > We came up with a fascinating question. We actually have for our > > corpora, > > > > word2vec, doc2vec, and GloVe results. Is it possible to use these > > > datasets > > > > within the search engine? If so, could you please point me to > > > documentation > > > > on how to get Solr to use them? > > > > > > > > Thank you so much, > > > > ~Ben > > > > > > -- > CTO, OpenSource Connections > Author, Relevant Search > http://o19s.com/doug >