Fixing bug 109595 makes MySQL almost 4X faster on the Insert Benchmark

repelsteeltje · on Nov 16, 2023

[Knee jerk reaction] Be careful now. Be very careful.

> Godbolt’s Law — if any single optimisation makes a routine run two or more times faster, then you’ve broken the code.

<https://xania.org/200504/godbolt's-law>

eyelidlessness · on Nov 16, 2023

It’s a good law. Probably deserving of the term law in almost any circumstance, although I’ve definitely been “lucky” enough to find exceptions (ie optimizing horribly inefficient code where the optimizations really couldn’t have been premature at any point in prior development). Edit: even so, I’ve found the law holds often enough in exactly the same code. Then it’s a very good opportunity to expand test coverage.

askiiart · on Nov 16, 2023

In a more idiotic vein, I accidentally added "RUN apt install rust-all" to the wrong Dockerfile. Removing that made the image 14x smaller.

up2isomorphism · on Nov 16, 2023

What nonsense is this. Someone did a big unnecessary big memcpy in a loop and you removed it and see 3x faster, how is this breaking the code?

repelsteeltje · on Nov 17, 2023

The point of Matt Godbalt's "law" is that eventually you'll discover that the undocumented, mysterious memcpy wasn't unnecessary. Ie. It was a hacky resolution for a weird race condition edge case that couldn't be solved some other way without introducing some other, major regression.

Off course, he doesn't mean to say you can never optimize to gain speed increases of > 2x. It's a hyperbole, build on his experience working with well written optimized time-critical software (think games, high volume/low latency trading), where on a good day you're happy to improve performance by 3%.

ctur · on Nov 16, 2023

Mark is one of the world's top experts on practical MySQL performance at scale, having spent a huge amount of time optimizing MySQL at Google and Facebook. There's a question in this thread about whether this has real world impact... yes, if Mark noticed it, yes, yes it does. This will materially improve many common workloads for InnoDB.

bsaul · on Nov 16, 2023

4x perf on inserts... Those kinds of post make me both scared and depressed by the state of our industry. I feel like we're all hitting rocks to make fire in a cave.

pjc50 · on Nov 16, 2023

We're picking up pennies on the sidewalk. Each individually is valuable, but there are just so many of them that every day we go home with the overwhelming majority of them go uncollected. And occasionally one is made of gold.

(or the old story about throwing back starfish)

euroderf · on Nov 16, 2023

FWIW... if it takes you all of (say) two seconds to pick up a penny and put it in your pocket, that works out to 18$ an hour, tax-free. Not bad. Nor world-changing.

pjc50 · on Nov 17, 2023

https://www.blakefallconroy.com/18.html

euroderf · on Nov 17, 2023

I've seen that before! Excellent art. And as an employer, not obnoxious.

bsaul · on Nov 16, 2023

The fact that we're still relying on a human brain having a perfect mental model of the tens of thousands of lines of code he wrote, interacting with the million ones other people wrote, is baffling.

At this point i consider it a miracle things are still working about fine.

pizza234 · on Nov 16, 2023

"4x perf on inserts" on a microbenchmark - there's nothing to be depressed (or, conversely, enthusiastic) about.

the microbenchmark has value (¹), but in a very specific context.

if one wanted to observe performance increase for non-trivial usage pattern, they should use at least TPC-C, where I guess changes like this make a negligible difference, or none at all.

¹: https://smalldatum.blogspot.com/2023/05/updates-to-insert-be...

Izkata · on Nov 16, 2023

> I assume this problem was limited to InnoDB because I did not see problems with MyRocks.

This would certainly explain why InnoDB is insanely slow at loading mysqldumps compared to MyISAM - we hit a wall in some systems and were unable to switch because of it. There's tons of questions online about how to speed this up, people were generally aware of the problem but assumed it was because InnoDB is more reliable with the data (like with foreign keys) or something about how it structures the data on disk that couldn't be changed.

JCharante · on Nov 16, 2023

Is someone more experienced able to shed light on how these benchmarks compare to real world use? Writes tend to be pretty resource intensive, is 4X faster going to show up as 2-4x faster writes on production environments?

koromak · on Nov 16, 2023

Are those bug ID's sequential? 100 thousand bugs fixed is quite a number

londons_explore · on Nov 16, 2023

Inserts ought to be able to run at the storage mediums write speed.

Ie. If I insert 1 million records of 1 kilobyte each. And my SSD can do 1GB/s of writes, then I should be able to do it in 1 second.

How close are we to that?

Transactions shouldn't slow this down (it's possible to write all the data to new areas of diak, and then the final 'commit' is a metadata update to say those new areas are active).

Indexes might slow it down depending on the index design. But it's possible to design an index where updates are coalesced, and therefore the changed parts of the index are only written once as a single bulk write. Assuming the inde. Size is only 10% of the data size, that's a 10% slowdown.

nolok · on Nov 16, 2023

It fine for a simple write scenario, but we ask a lot more from our databases.

What if there are triggers? Those could do slower updates. What about indexes? What about foreign keys?

Anderson what about hardware failure? A log to be able to get back where you should is fine, but then you use some of the write capacity for it.

Etc etc...

On of the most important thing we ask of our database is to be reliable, and half it's performance is used doing that.

londons_explore · on Nov 16, 2023

True - but this benchmark is for simple inserts, so it should be possible to reach the hardware write speed.

If those other things are used, then sure, it decreases the theoretical throughput.

A 'log' in modern databases is the database. Ie. the data gets written only once. A partial write can be safely ignored (rolling back the entire transaction and all writes in one go). Data is never overwritten in-place, so there is never a time when the database can be inconsistent. The downside with many storage engines is that an entire row must be read and rewritten in order to set one field (imagine a "UPDATE people SET name='FOO';" query - every single record needs to be read and rewritten, when theortically you could just write 'FOO' to disk and remember that all people in the database now have that name without even needing to read the people records)

sgarland · on Nov 16, 2023

> the data gets written only once

This is essentially never true. It _could_ be true if you disabled many safety features in an RDBMS, but you then lose ACID guarantees.

MySQL (with InnoDB), for example, undergoes the following for an INSERT/UPDATE:

1. Write current state of pages to the undo log

2. Write changes to the buffer pool

3. Write changes to the redo log

4. Write changes to the binary log

5. Write changes to the double write buffer

6. Write changes to disk

There are a dizzying amount of variables affecting each of those, including delays, maximum allowable size, sync/async flushing, outright disabling of some steps (most commonly the double write buffer – which is only safe for CoW filesystems like ZFS – but you can also disable the redo log if you really want to), etc.

To your point of writes to disk, most (all?) of the disk writes are batched. Even if InnoDB wasn't doing so, the kernel can and will batch writes on its own.

> so there is never a time when the database can be inconsistent

If the hardware suffers a failure during a write, inconsistency can occur. This is precisely why the double write buffer (for InnoDB) and full page writes (for Postgres) exist. Also, the hardware usually reports back that it's flushed before it's actually done durably written it. If you have SSDs with Power Loss Protection capacitors, this isn't a problem, but no consumer drives that I'm aware of have those.

> downside with many storage engines is that an entire row must be read and rewritten

Actually, it can be even worse than this: if you're using UUIDv4 (or anything else random) as your PK, any UPDATE on more than one row is going to touch _a lot_ of pages. All of those pages have to be written out. This affects not only RDBMS with clustered indices (InnoDB, SQL Server) but also Postgres due to its WAL. Changing one byte can result in a 16 KiB (InnoDB) or 8 KiB (Postgres) write, which is quite a bit of amplification.

Tostino · on Nov 16, 2023

Yup, switching to sequential uuids reduced the wal traffic in my SaaS database significantly when we did it ~6 years ago.

sgarland · on Nov 16, 2023

I'm a big fan of UUIDv7 if you need them. I maintain most places could just use an auto-incrementing int, and rewrite the API to not expose it, but barring that UUIDv7s are great.

nolok · on Nov 16, 2023

Think about what the code looks like.

What you say about the log isn't true. A log is also the restore. The database must make sure that if it's written to the log, it's written to the database, and vice versa. Both for restore in case of crash, and for sync across databases, and ...

This alone will ensure performances are not those of the disk, you need to write twice and use synced writes instead of async.

pizza234 · on Nov 16, 2023

There are a lot of complexities, and solutions aren't as simple as one would imagine.

I'm going to assume MySQL, but other relational dbs work similarly.

> Transactions shouldn't slow this down (it's possible to write all the data to new areas of diak, and then the final 'commit' is a metadata update to say those new areas are active).

This is done already. DBs don't write directly to the buffer pages, they write to a sequential log (the WAL), because sequential writes are faster. A separate thread reads the WAL and applies the changes.

This is actually not even the whole story. Pages are written twice, to the so-called doublewrite buffer, because "filesystems are complicated", and writing a page is not in itself a guarantee of atomicity (this is a way more complex discussion than it seems).

> But it's possible to design an index where updates are coalesced, and therefore the changed parts of the index are only written once as a single bulk write

This is done already as well, by the so-called "change buffer" (https://dev.mysql.com/doc/refman/8.0/en/innodb-change-buffer...).

---

The most important concept is: sequential writes (as in: big batch of writes) are not possible in ACID dbs, because changes must be written immediately. Due to the durability guarantee, one can't buffer them in memory, and pretend that they have been written.

Battery-backed controllers can be used to improve write throughput, but that's another story.

pjc50 · on Nov 16, 2023

You've forgotten transaction latency. You might be able to get that for doing all the inserts in one go, but in normal operation what you care about is INSERTs of one item at a time from different writers, and each of those has to do a whole round trip for durability.

Anyway, if you know how to do it I'm sure mysql will welcome the pull request.

ksec · on Nov 17, 2023

>Ie. If I insert 1 million records of 1 kilobyte each. And my SSD can do 1GB/s of writes, then I should be able to do it in 1 second.

Your are measuring throughput and not IOPS. 1 Million IOPS, especially random IOPS is very difficult to achieve.

hu3 · on Nov 16, 2023

> Inserts ought to be able to run at the storage mediums write speed.

Sometimes "faster" than IO because of compression, at the cost of CPU.