On 3-Jul-08, at 5:13 PM, Chris Harris wrote:
That's pretty much impossible (way too small). Double check those
numbers.
I don't know where I got the above numbers. Sorry. Here are the real
numbers:
.tis file: 730MB
.frq files: 10.1 GB
.prx file: 43.2 GB
Now keeping all *that* in RAM, that sounds like a challenge.
It doesn't have to be *all* in RAM... the OS will figure out what
parts are needed.
One alternative you might consider is using a flash hard drive.
Another is to index bigrams as terms, and do phrase queries using the
conjunction of the bigrams of a phrase. This should make phrase
queries only a few times slower than term queries, and probably
inflate your .frq to "only" 25GB (.prx could be ignored).
Some other tricks, like stop word removal, also speed up phrase queries.
-Mike