One thing you can do is index 1, 2, 3...n-grams and use a simple & fast
key-value store (like leveldb etc.)  e.g., you could have entries like

"aunt rhodie" -> song-9, song-44
"woman" -> song-12, song-65, song-96


That's basically how I made the Metafilter N-gram Viewer
<http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer
<https://books.google.com/ngrams>.

Another possibility is using Lucene.  Just be aware that Lucene calls
n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of
words ("that the", "the old", "old gray") shingles.  So you would end up
using (I think, I haven't done this) the ShingleFilter
<https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html>
.

You might also find this article by Russ Cox interesting, where he
describes building and using an inverted trigram index:
http://swtch.com/~rsc/regexp/regexp4.html


John





Three things that you might find interesting:

Russ Cox' explanation of doing indexing and retrieval with an inverted
trigram index: http://swtch.com/~rsc/regexp/regexp4.html


On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <[email protected]> wrote:

> A lot of guys would use Lucene.  Lucene calls n-grams of words "shingles".
> [1]
>
> As for "architecture", here is a suggestion to use Lucene to find keys to
> records in your "real" database. [2]
>
> [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/
>
> [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ
>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to