Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fixing bug 109595 makes MySQL almost 4X faster on the Insert Benchmark (smalldatum.blogspot.com)
75 points by romac on Nov 15, 2023 | hide | past | favorite | 27 comments


[Knee jerk reaction] Be careful now. Be very careful.

> Godbolt’s Law — if any single optimisation makes a routine run two or more times faster, then you’ve broken the code.

<https://xania.org/200504/godbolt's-law>


It’s a good law. Probably deserving of the term law in almost any circumstance, although I’ve definitely been “lucky” enough to find exceptions (ie optimizing horribly inefficient code where the optimizations really couldn’t have been premature at any point in prior development). Edit: even so, I’ve found the law holds often enough in exactly the same code. Then it’s a very good opportunity to expand test coverage.


In a more idiotic vein, I accidentally added "RUN apt install rust-all" to the wrong Dockerfile. Removing that made the image 14x smaller.


What nonsense is this. Someone did a big unnecessary big memcpy in a loop and you removed it and see 3x faster, how is this breaking the code?


The point of Matt Godbalt's "law" is that eventually you'll discover that the undocumented, mysterious memcpy wasn't unnecessary. Ie. It was a hacky resolution for a weird race condition edge case that couldn't be solved some other way without introducing some other, major regression.

Off course, he doesn't mean to say you can never optimize to gain speed increases of > 2x. It's a hyperbole, build on his experience working with well written optimized time-critical software (think games, high volume/low latency trading), where on a good day you're happy to improve performance by 3%.


Mark is one of the world's top experts on practical MySQL performance at scale, having spent a huge amount of time optimizing MySQL at Google and Facebook. There's a question in this thread about whether this has real world impact... yes, if Mark noticed it, yes, yes it does. This will materially improve many common workloads for InnoDB.


4x perf on inserts... Those kinds of post make me both scared and depressed by the state of our industry. I feel like we're all hitting rocks to make fire in a cave.


We're picking up pennies on the sidewalk. Each individually is valuable, but there are just so many of them that every day we go home with the overwhelming majority of them go uncollected. And occasionally one is made of gold.

(or the old story about throwing back starfish)


FWIW... if it takes you all of (say) two seconds to pick up a penny and put it in your pocket, that works out to 18$ an hour, tax-free. Not bad. Nor world-changing.



I've seen that before! Excellent art. And as an employer, not obnoxious.


The fact that we're still relying on a human brain having a perfect mental model of the tens of thousands of lines of code he wrote, interacting with the million ones other people wrote, is baffling.

At this point i consider it a miracle things are still working about fine.


"4x perf on inserts" on a microbenchmark - there's nothing to be depressed (or, conversely, enthusiastic) about.

the microbenchmark has value (¹), but in a very specific context.

if one wanted to observe performance increase for non-trivial usage pattern, they should use at least TPC-C, where I guess changes like this make a negligible difference, or none at all.

¹: https://smalldatum.blogspot.com/2023/05/updates-to-insert-be...


> I assume this problem was limited to InnoDB because I did not see problems with MyRocks.

This would certainly explain why InnoDB is insanely slow at loading mysqldumps compared to MyISAM - we hit a wall in some systems and were unable to switch because of it. There's tons of questions online about how to speed this up, people were generally aware of the problem but assumed it was because InnoDB is more reliable with the data (like with foreign keys) or something about how it structures the data on disk that couldn't be changed.


Is someone more experienced able to shed light on how these benchmarks compare to real world use? Writes tend to be pretty resource intensive, is 4X faster going to show up as 2-4x faster writes on production environments?


Are those bug ID's sequential? 100 thousand bugs fixed is quite a number


Inserts ought to be able to run at the storage mediums write speed.

Ie. If I insert 1 million records of 1 kilobyte each. And my SSD can do 1GB/s of writes, then I should be able to do it in 1 second.

How close are we to that?

Transactions shouldn't slow this down (it's possible to write all the data to new areas of diak, and then the final 'commit' is a metadata update to say those new areas are active).

Indexes might slow it down depending on the index design. But it's possible to design an index where updates are coalesced, and therefore the changed parts of the index are only written once as a single bulk write. Assuming the inde. Size is only 10% of the data size, that's a 10% slowdown.


It fine for a simple write scenario, but we ask a lot more from our databases.

What if there are triggers? Those could do slower updates. What about indexes? What about foreign keys?

Anderson what about hardware failure? A log to be able to get back where you should is fine, but then you use some of the write capacity for it.

Etc etc...

On of the most important thing we ask of our database is to be reliable, and half it's performance is used doing that.


True - but this benchmark is for simple inserts, so it should be possible to reach the hardware write speed.

If those other things are used, then sure, it decreases the theoretical throughput.

A 'log' in modern databases is the database. Ie. the data gets written only once. A partial write can be safely ignored (rolling back the entire transaction and all writes in one go). Data is never overwritten in-place, so there is never a time when the database can be inconsistent. The downside with many storage engines is that an entire row must be read and rewritten in order to set one field (imagine a "UPDATE people SET name='FOO';" query - every single record needs to be read and rewritten, when theortically you could just write 'FOO' to disk and remember that all people in the database now have that name without even needing to read the people records)


> the data gets written only once

This is essentially never true. It _could_ be true if you disabled many safety features in an RDBMS, but you then lose ACID guarantees.

MySQL (with InnoDB), for example, undergoes the following for an INSERT/UPDATE:

1. Write current state of pages to the undo log

2. Write changes to the buffer pool

3. Write changes to the redo log

4. Write changes to the binary log

5. Write changes to the double write buffer

6. Write changes to disk

There are a dizzying amount of variables affecting each of those, including delays, maximum allowable size, sync/async flushing, outright disabling of some steps (most commonly the double write buffer – which is only safe for CoW filesystems like ZFS – but you can also disable the redo log if you really want to), etc.

To your point of writes to disk, most (all?) of the disk writes are batched. Even if InnoDB wasn't doing so, the kernel can and will batch writes on its own.

> so there is never a time when the database can be inconsistent

If the hardware suffers a failure during a write, inconsistency can occur. This is precisely why the double write buffer (for InnoDB) and full page writes (for Postgres) exist. Also, the hardware usually reports back that it's flushed before it's actually done durably written it. If you have SSDs with Power Loss Protection capacitors, this isn't a problem, but no consumer drives that I'm aware of have those.

> downside with many storage engines is that an entire row must be read and rewritten

Actually, it can be even worse than this: if you're using UUIDv4 (or anything else random) as your PK, any UPDATE on more than one row is going to touch _a lot_ of pages. All of those pages have to be written out. This affects not only RDBMS with clustered indices (InnoDB, SQL Server) but also Postgres due to its WAL. Changing one byte can result in a 16 KiB (InnoDB) or 8 KiB (Postgres) write, which is quite a bit of amplification.


Yup, switching to sequential uuids reduced the wal traffic in my SaaS database significantly when we did it ~6 years ago.


I'm a big fan of UUIDv7 if you need them. I maintain most places could just use an auto-incrementing int, and rewrite the API to not expose it, but barring that UUIDv7s are great.


Think about what the code looks like.

What you say about the log isn't true. A log is also the restore. The database must make sure that if it's written to the log, it's written to the database, and vice versa. Both for restore in case of crash, and for sync across databases, and ...

This alone will ensure performances are not those of the disk, you need to write twice and use synced writes instead of async.


There are a lot of complexities, and solutions aren't as simple as one would imagine.

I'm going to assume MySQL, but other relational dbs work similarly.

> Transactions shouldn't slow this down (it's possible to write all the data to new areas of diak, and then the final 'commit' is a metadata update to say those new areas are active).

This is done already. DBs don't write directly to the buffer pages, they write to a sequential log (the WAL), because sequential writes are faster. A separate thread reads the WAL and applies the changes.

This is actually not even the whole story. Pages are written twice, to the so-called doublewrite buffer, because "filesystems are complicated", and writing a page is not in itself a guarantee of atomicity (this is a way more complex discussion than it seems).

> But it's possible to design an index where updates are coalesced, and therefore the changed parts of the index are only written once as a single bulk write

This is done already as well, by the so-called "change buffer" (https://dev.mysql.com/doc/refman/8.0/en/innodb-change-buffer...).

---

The most important concept is: sequential writes (as in: big batch of writes) are not possible in ACID dbs, because changes must be written immediately. Due to the durability guarantee, one can't buffer them in memory, and pretend that they have been written.

Battery-backed controllers can be used to improve write throughput, but that's another story.


You've forgotten transaction latency. You might be able to get that for doing all the inserts in one go, but in normal operation what you care about is INSERTs of one item at a time from different writers, and each of those has to do a whole round trip for durability.

Anyway, if you know how to do it I'm sure mysql will welcome the pull request.


>Ie. If I insert 1 million records of 1 kilobyte each. And my SSD can do 1GB/s of writes, then I should be able to do it in 1 second.

Your are measuring throughput and not IOPS. 1 Million IOPS, especially random IOPS is very difficult to achieve.


> Inserts ought to be able to run at the storage mediums write speed.

Sometimes "faster" than IO because of compression, at the cost of CPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: