On Sun, Aug 26, 2012 at 11:14 AM, Alex Schuster <wo...@wonkology.org> wrote:
> Whatever. Then align to 8K instead. But what does this have to do with the
> erasable page size?

Short answer: Any page written to a block already containing data, the
whole block must be erased. This is the "erase block size" people talk
about. Block size is always divisible by page size. So if you align to
the erase block size, you will always be okay.

Long answer: NAND flash cells do not operate like a normal HDD
storage, they can only be written to when they are empty. Empty
meaning null, devoid of data, unallocated, not just "filled with
zeroes" or "ignored by filesystem". So, any time you want to write to
a block that already contains data, it must be erased and re-written
by the drive controller.

On most current-generation SSD the block size is 512k and contains 128
pages (4k each page). In older/slower SSD, or other kind of flash
devices like CompactFlash or SD cards, often the erase block is
larger, usually 4M or sometimes even up to 16M. (Definitely check the
specs for your specific model of SSD to find the correct values.)

SSD can write at page-size chunks of data, which is very fast, but
only in an empty block. So if the block has data, that data must be
relocated or erased and rewritten. TRIM feature tells the SSD that
these pages are not used anymore, and allows it to do better garbage
collection and combine pages/deallocate those unused blocks. Next time
you write to one of those blocks, it will be very fast because erase
already happened at TRIM time and these unused blocks are available
for writing.

This is why SSD without TRIM feature become slower once they have
filled up. The drive controller has no knowledge of your filesystem,
erase overhead is added to every write once the internal NAND free
space is used up. So instead of writing a 4k page now it's potentially
erasing 512k data then writing 512k data. 256 times more data touched
for the same 4k write! (For a case where you have no TRIM support the
only possible way to improve performance once a full drive worth of
data has been written is to backup, perform ATA Secure Erase, which
will clear the SSD allocation metadata, then restore your backup.)

Now imagine if the alignment is not correct for both page size and
erase block size, then when you write data it could overlap, causing
two blocks to be erased and written instead of only one. In the
example from the previous paragraph you can see now how the
performance degrades even worse, as well as causing extra erases and
writes which will potentially reduce the lifetime of your drive.

Additional complexity is added by any further layers, filesystem block
size, filesystem alignment (I'm looking at you, FAT32), LVM, RAID
stripe size, etc...

A good article giving more information about the subject is in the
English version of Wikipedia:
https://en.wikipedia.org/wiki/Write_amplification

(disclaimer: all above info is AFAIK, please correct me if I got any
facts or advice wrong)

Reply via email to