That's interesting. I've been really reluctant to "hard code" n-grams, but it's probably the best way to go.
On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote: > > One thing you can do is index 1, 2, 3...n-grams and use a simple & fast > key-value store (like leveldb etc.) e.g., you could have entries like > > "aunt rhodie" -> song-9, song-44 > "woman" -> song-12, song-65, song-96 > > > That's basically how I made the Metafilter N-gram Viewer > <http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer > <https://books.google.com/ngrams>. > > Another possibility is using Lucene. Just be aware that Lucene calls > n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of > words ("that the", "the old", "old gray") shingles. So you would end up > using (I think, I haven't done this) the ShingleFilter > <https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html> > . > > You might also find this article by Russ Cox interesting, where he > describes building and using an inverted trigram index: > http://swtch.com/~rsc/regexp/regexp4.html > > > John > > > > > > Three things that you might find interesting: > > Russ Cox' explanation of doing indexing and retrieval with an inverted > trigram index: http://swtch.com/~rsc/regexp/regexp4.html > > > On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <[email protected] > <javascript:>> wrote: > >> A lot of guys would use Lucene. Lucene calls n-grams of words >> "shingles". [1] >> >> As for "architecture", here is a suggestion to use Lucene to find keys to >> records in your "real" database. [2] >> >> [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ >> >> [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ >> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to [email protected] >> <javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to [email protected] Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
