You may well have already seen this, but in case not: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
FWIW, Erick On Wed, Oct 24, 2012 at 9:51 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 10/24/2012 6:29 PM, Aaron Daubman wrote: >> >> Let me be clear that that I am not interested in RAMDirectory. >> However, I would like to better understand the oft-recommended and >> currently-default MMapDirectory, and what the tradeoffs would be, when >> using a 64-bit linux server dedicated to this single solr instance, >> with plenty (more than 2x index size) of RAM, of storing the index >> files on SSDs versus on a ramfs mount. >> >> I understand that using the default MMapDirectory will allow caching >> of the index in-memory, however, my understanding is that mmaped files >> are demand-paged (lazy evaluated), meaning that only after a block is >> read from disk will it be paged into memory - is this correct? is it >> actually block-by-block (page size by page size?) - any pointers to >> decent documentation on this regardless of the effectiveness of the >> approach would be appreciated... > > > You are correct that the data must have just been accessed to be in the disk > cache.This does however include writes -- so any data that gets indexed will > be in the cache because it has just been written. I do believe that it is > read in one page block at a time, and I believe that the blocks are 4k in > size. > > >> My concern with using MMapDirectory for an index stored on disk (even >> SSDs), if my understanding is correct, is that there is still a large >> startup cost to MMapDirectory, as it may take many queries before even >> most of a 20G index has been loaded into memory, and there may yet >> still be "dark corners" that only come up in edge-case queries that >> cause QTime spikes should these queries ever occur. >> >> I would like to ensure that, at startup, no query will incur >> disk-seek/read penalties. >> >> Is the "right" way to achieve this to copy the index to a ramfs (NOT >> ramdisk) mount and then continue to use MMapDirectory in Solr to read >> the index? I am under the impression that when using ramfs (rather >> than ramdisk, for which this would not work) a file mmaped on a ramfs >> mount will actually share the same address space, and so would not >> incur the typical double-ram overhead of mmaping a file in memory just >> o have yet another copy of the file created in a second memory >> location. Is this correct? If not, would you please point me to >> documentation stating otherwise (I haven't found much documentation >> either way). > > > I am not familiar with any "double-ram overhead" from using mmap. It should > be extroardinarily efficient, so much so that even when your index won't fit > in RAM, performance is typically still excellent. Using an SSD instead of a > spinning disk will increase performance across the board, until enough of > the index is cached in RAM, after which it won't make a lot of difference. > > My parting thoughts, with a general note to the masses: Do not try this if > you are not absolutely sure your index will fit in memory! It will tend to > cause WAY more problems than it will solve for most people with large > indexes. > > If you actually do have considerably more RAM than your index size, and you > know that the index will never grow to where it might not fit, you can use a > simple trick to get it all cached, even before running queries. Just read > the entire contents of the index, discarding everything you read. There are > two main OS variants to consider here, and both can be scripted, as noted > below. Run the command twice to see the difference that caching makes for > the second run. Note that an SSD would speed the first run of these > commands up considerably: > > *NIX (may work on a mac too): > cat /path/to/index/files/* > /dev/null > > Windows: > type C:\Path\To\Index\Files\* > NUL > > Thanks, > Shawn >