What Does a Database for SSDs Look Like?

(brooker.co.za)

60 points | by charleshn 3 hours ago

12 comments

gethly 1 minute ago
> I’d move durability, read and write scale, and high availability into being distributed
So, essentially just CQRS, which is usually handled in the application level with event sourcing and similar techniques.
pmontra 6 minutes ago
A tangent:
> Companies are global, businesses are 24/7
Only a few companies are global, so only a few of them should optimize for those kind of workload. However maybe every startup in SV must aim to becoming global, so probably that's what most of them must optimize for, even the ones that eventually fail to get traction.
24/7 is different because even the customers of local companies, even B2B ones, mighty feel like doing some work at midnight once in a while. They'll be disappointed to find the server down.
mrkeen 2 hours ago
> Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random.
Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.
And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.
[-]
- crazygringo 1 hour ago
  Can you clarify? I thought a major benefit of SSDs is that there isn't any difference between sequential and random access. There's no physical head that needs to move.
  Edit: thank you for all the answers -- very educational, TIL!
  [-]
  - threeducks 1 hour ago
    Lets take the Samsung 9100 Pro M.2 as an example. It has a sequential read rate of ~6700 MB/s and a 4k random read rate of ~80 MB/s:
    https://i.imgur.com/t5scCa3.png
    https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)
    That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.
    [-]
    - szundi 11 minutes ago
      [dead]
  - yyyk 1 hour ago
    SSD controllers and VFSs are often optimized for sequential access (e.g. readahead cache) which leads to software being written to do sequential access for speed which leads to optimization for that access pattern, and so on.
  - formerly_proven 37 minutes ago
    SSDs have three block/page sizes:
    - The access block size (LBA size). Either 512 bytes or 4096 bytes modulo DIF. Purely a logical abstraction.
    - The programming page size. Something in the 4K-64K range. This is the granularity at which an erased block may be programmed with new data.
    - The erase block size. Something in the 1-128 MiB range. This is the granularity at which data is erased from the flash chips.
    SSDs always use some kind of journaled mapping to cope with the actual block size being roughly five orders of magnitude larger than the write API suggests. The FTL probably looks something like an LSM with some constant background compaction going on. If your writes are larger chunks, and your reads match those chunks, you would expect the FTL to perform better, because it can allocate writes contiguously and reads within the data structure have good locality as well. You can also expect for drives to further optimize sequential operations, just like the OS does.
    (N.b. things are likely more complex, because controllers will likely stripe data with the FEC across NAND planes and chips for reliability, so the actual logical write size from the controller is probably not a single NAND page)
  - PunchyHamster 1 hour ago
    SSD block size is far bigger than 4kB. They still benefit from sequential write
  - lazide 20 minutes ago
    It depends on the side of read - most SSD’s have internal block sizes much larger than a typical (actual) random read, so they internally have to do a lot more work for a given byte of output in a random read situation than they would in a sequential one.
    Most filesystems read in 4K chunks (or sometimes even worse, 512 byes), and internally the actual block is often multiple MB in size, so this internal read multiplication is a big factor in performance in those cases.
    Note the only real difference between a random read and a sequential one is the size of the read in one sequence before it switches location - is it 4K? 16mb? 2G?
  - b112 1 hour ago
    Read up on IOPS, conjoined with requests for sequential reads.
    [-]
PunchyHamster 1 hour ago
> WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).
And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".
I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.
[-]
- sreekanth850 31 minutes ago
  The biggest lie we’ve been told is that databases require global consistency and a global clock. Traditional databases are still operating with Newtonian assumptions about absolute time, while the real world moves according to Einstein’s relativistic theory, where time is local and relative. You dont need global order, you dont need global clock.
- dist1ll 1 hour ago
  > "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?"
  That would be asynchronous replication. But IIUC the author is instead advocating for a distributed log with synchronous quorum writes.
  [-]
  - formerly_proven 34 minutes ago
    But we know this is not actually robust because storage and power failures tend to be correlated. The most recent Jepsen analysis again highlights that it's flawed thinking: https://jepsen.io/analyses/nats-2.12.1
- lazide 32 minutes ago
  Happens all the time (the ignores best practices because it’s convenient or ‘just because’ to do something different), literally everywhere including normal society.
  Frankly, it’s shocking anything works at all.
zokier 2 hours ago
Author could have started by surveying current state of art instead of just falsely assuming that DB devs have just been resting on the laurels for past decades. If you want to see (relational) DB for SSD just check out stuff like myrocks on zenfs+; it's pretty impressive stuff.
[-]
- lazide 6 minutes ago
  But then how would they have anything to do?
londons_explore 2 hours ago
Median database workloads are probably doing writes of just a few bytes per transaction. Ie 'set last_login_time = now() where userid=12345'.
Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.
[-]
- Sesse__ 1 hour ago
  A write-ahead log isn't a performance tool to batch changes, it's a tool to get durability of random writes. You write your intended changes to the log, fsync it (which means you get a 4k write), then make the actual changes on disk just as if you didn't have a WAL.
  If you want to get some sort of sub-block batching, you need a structure that isn't random in the first place, for instance an LSM (where you write all of your changes sequentially to a log and then do compaction later)—and then solve your durability in some other way.
  [-]
  - throw0101a 1 hour ago
    > A write-ahead log isn't a performance tool to batch changes, it's a tool to get durability of random writes.
    ¿Por qué no los dos?
    [-]
    - Sesse__ 42 minutes ago
      Because it is in addition to your writes, not instead of them. That's what “ahead” points to.
      [-]
      - _bohm 27 minutes ago
        The actual writes don’t need to be persisted on transaction commit, only the WAL. In most DBs the actual writes won’t be persisted until the written page is evicted from the page cache. In this sense, writing WAL generally does provide better perf than synchronously doing a random page write
- formerly_proven 31 minutes ago
  WALs are typically DB-page-level physical logs, and database page sizes are often larger than the I/O page size or the host page size.
- esperent 2 hours ago
  Don't some SSDs have 512b page size?
  [-]
  - digikata 1 hour ago
    I would guess by now none have that internally. As a rule of thumb every major flash density increase (SLC, TLC, QLC) also tended to double internal page size. There were also internal transfer performance reasons for large sizes. Low level 16k-64k flash "pages" are common, and sometimes with even larger stripes of pages due to the internal firmware sw/hw design.
    [-]
    - Sesse__ 1 hour ago
      Also due to error correction issues. Flash is notoriously unreliable, so you get bit errors _all the time_ (correcting errors is absolutely routine). And you can make more efficient error-correcting codes if you are using larger blocks. This is why HDDs went from 512 to 4096 byte blocks as well.
  - zokier 2 hours ago
    They might present 512 blocks to host, but internally the ssd almost certainly manages data in larger pages
    [-]
    - cm2187 2 hours ago
      And the filesystem will also likely be 4k block size.
ljosifov 1 hour ago
Not for SSD specifically, but I assume the compact design doesn't hurt: duckdb saved my sanity recently. Single file, columnar, with builtin compression I presume (given in columnar even simplest compression maybe very effective), and with $ duckdb -ui /path/to/data/base.duckdb opening a notebook in browser. Didn't find a single thing to dislike about duckdb - as a single user. To top it off - afaik can be zero-copy 'overlayed' on the top of a bunch of parquet binary files to provide sql over them?? (didn't try it; wd be amazing if it works well)
sscdotopen 32 minutes ago
Umbra: A Disk-Based System with In-Memory Performance, CIDR'20
https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
dist1ll 1 hour ago
Is there more detail on the design of the distributed multi-AZ journal? That feels like the meat of the architecture.
danielfalbo 2 hours ago
Reminds me of: Databases on SSDs, Initial Ideas on Tuning (2010) [1]
[1] https://www.dr-josiah.com/2010/08/databases-on-ssds-initial-...
raggi 1 hour ago
It may not matter for clouds with massive margins but there are substantial opportunities for optimizing wear.
sreekanth850 43 minutes ago
Unpopular Opinion: Database were designed for 1980-90 mechanics, the only thing that never innovates is DB. It still use BTree/LSM tree that were optimized for spinning disc. Inefficiency is masked by hardware innovation and speed (Moores Law).
[-]
- nly 10 minutes ago
  Optimising hardware to run existing software is how you sell your hardware.
  The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding
  High performance networking is another area like this. High performance NICs still go to great lengths to provide a BSD socket experience to devs. You can still get 80-90% of the performance advantages of kernel bypass without abandoning that model.