One thing you can do is index 1, 2, 3...n-grams and use a simple & fast key-value store (like leveldb etc.) e.g., you could have entries like
"aunt rhodie" -> song-9, song-44 "woman" -> song-12, song-65, song-96 That's basically how I made the Metafilter N-gram Viewer <http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer <https://books.google.com/ngrams>. Another possibility is using Lucene. Just be aware that Lucene calls n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of words ("that the", "the old", "old gray") shingles. So you would end up using (I think, I haven't done this) the ShingleFilter <https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html> . You might also find this article by Russ Cox interesting, where he describes building and using an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html John Three things that you might find interesting: Russ Cox' explanation of doing indexing and retrieval with an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <[email protected]> wrote: > A lot of guys would use Lucene. Lucene calls n-grams of words "shingles". > [1] > > As for "architecture", here is a suggestion to use Lucene to find keys to > records in your "real" database. [2] > > [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ > > [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ > > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to [email protected] > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to [email protected] Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
