Hacker Newsnew | past | comments | ask | show | jobs | submit | maypok86's commentslogin

Putting aside performance metrics (latency, throughput, hit rate, memory usage), here's what I don't like:

1. I don't really see how the API is simpler. ccache has tons of methods like `GetsPerPromote`, `PercentToPrune`, `Buckets`, `PromoteBuffer`, `DeleteBuffer`. How is a user supposed to know what values to set here? Honestly, even with all the time I've spent digging through cache implementations, I don't fully understand what should be configured there. Otter simply doesn't need any of these - you just specify the maximum size and the cache works.

2. Numerous methods like `tracking` and `promote` are again unnecessary for otter. Just `getIfPresent` and `set`/`setIfAbsent` and you're good to go.

3. The lack of loading and refreshing features seems like a significant drawback, as they typically provide major benefits for slow data sources.


I don't disagree. It's like 13 years old. `GetWithoutPromote` was added in 2022, I assume someone asked for it, so I added it. That kind of stuff happens, especially when you stop building it for your own needs.

For the most part, you use a default config and use Get/Fetch/Set. Besides the excuse of its age, and not being seriously worked on for a long time (a decade?), I do think we both have a bias towards what's more familiar. What are the `ExpiryCalculator`, `Weigher`, etc... configuration options of Otter? (or `GetEntryQuietly`, `SetRefreshableAfter` ...)


I believe `ExpiryCalculator` is fairly self-explanatory. For example, `ExpiryWriting` returns an `ExpiryCalculator` that specifies the entry should be automatically evicted from the cache after the given duration from either its creation or value update. The expiration time isn't refreshed on reads.

`Weigher` is also likely clear from its doc. Many developers are at least familiar with this concept from other languages or libraries like ristretto and ttlcache.

`GetEntryQuietly` retrieves the cache entry for a key without any side effects - it doesn't update statistics or influence eviction policies (unlike `GetEntry`). I genuinely think this is reasonably clear.

I'm completely baffled why `SetRefreshableAfter` made this list. If you understand refreshing, it's obviously just `SetTTL` but for the refresh policy.

Honestly, I mostly disagree about the options being unclear. I suspect `Executor` is the only one that might confuse users after reading the docs, and it's mainly for testing anyway. My core complaint is the first point in my comment - tuning the cache requires deep understanding of its internals. Take ristretto's `NumCounters` parameter: users don't understand it and often just set it to `maxCost * 10` like the README example. But this completely breaks when using custom per-entry costs (like byte sizes).

But as I mentioned when reviewing sturdyc, it probably comes down to personal preference.


No, what do you want to verify? Any network calls make Redis significantly slower than an on-heap cache. I'd even argue these are tools for different purposes and don't compare well directly.

A common pattern, for example, is using on-heap caches together with off-heap caches/dedicated cache servers (like Redis) in L1/L2/Lx model. Here, the on-heap cache serves as the service's first line of defense, protecting slower, larger caches/databases from excessive load.


Yes, I assume that otter etc are vastly faster, but I suspect there's people who are not aware of that and, consequently, don't have such a layered approach. So, the idea was to show how much faster to further promote adoption of such tools.

And, to clarify, I was only thinking about localhost/Unix sockets to mostly eliminate network latency. Anything external to the server would obviously be incomparably slower.

I also suppose that it would be perhaps even more interesting/useful to compare the speed of these in-memory caches to Golang caches/kv stores that have disk persistence, and perhaps even also things like sqlite. Obviously the type of "disk" would matter significantly, with nvme being closer in perf (and probably sufficient for most applications).

Anyway, it was just a thought.


I benchmarked ccache for throughput [1], memory consumption [2], and hit rate [3]. For hit rate simulations, I used golang-lru's LRU implementation, though I doubt a correct LRU implementation would show meaningful hit rate differences.

Note that otter's simulator results were repeatedly compared against both W-TinyLFU's (Caffeine) and S3-FIFO's (Libcachesim) simulators, showing nearly identical results with differences within hundredths of a percent.

[1]: https://maypok86.github.io/otter/performance/throughput/

[2]: https://maypok86.github.io/otter/performance/memory-consumpt...

[3]: https://maypok86.github.io/otter/performance/hit-ratio/


What exactly do you mean by a "more specialized simple cache"? Just a map, mutex and LRU/LFU/ARC as eviction policies?

1. Even using sync.RWMutex and specialized policies won't really help you outperform a well-implemented BP-Wrapper in terms of latency/throughput.

2. I've never seen cases where W-TinyLFU loses more than 2-3% hit rate compared to simpler eviction policies. But most simple policies are vulnerable to attacks and can drop your hit rate by dozens of percentage points under workload variations. Even ignoring adversarial workloads, you'd still need to guess which specific policy gives you those extra few percentage points. I question the very premise of this approach.

3. When it comes to loading and refreshing, writing a correct implementation is non-trivial. After implementing it, I'm not sure the cache could still be called "simple". And at the very least, refreshing can reduce end-to-end latency by orders of magnitude.


You're correct on all points. I should not have used the word "outperform" and should have said a simple cache could be sufficient. If for example you know you have more than enough memory to cache all items you receive in 60 seconds and items strictly expire after 60 seconds, then a sync.RWMutex with optional lock striping is going to work just fine. You don't need to reach for one of these libraries in that case (and I have seen developers do that, and at that point the risk becomes misconfiguration/misuse of a complex library).


Yeah, I basically agree with that.


No, I haven't looked into it, but the combination of "lock-free" and "S3-FIFO" raises some red flags for me :)

I don't quite understand the specific rationale for replacing segmented LRU with S3-FIFO. If I remember correctly, even the original authors stated it doesn't provide significant benefits [1].

Regarding TinyUFO - are you using lock-free queues? Has the algorithmic complexity of TinyLFU changed? (In the base version, S3-FIFO is O(n)). How easy is it to add new features? With lock-free queues, even implementing a decent expiration policy becomes a major challenge.

[1]: https://github.com/Yiling-J/theine/issues/21


To be honest, I'm not sure I can recommend anything specific here.

1. How much data do you have and how many entries? If you have lots of data with very small records, you might need an off-heap based cache solution. The only ready-made implementation I know is Olric [1].

2. If you can use an on-heap cache, you might want to look at groupcache [2]. It's not "blazingly-fast", but it's battle-tested. Potential drawbacks include LRU eviction and lack of generics (meaning extra GC pressure from using `interface{}` for keys/values). It's also barely maintained, though you can find active forks on GitHub.

3. You could implement your own solution, though I doubt you'd want to go that route. Architecturally, segcache [3] looks interesting.

[1]: https://github.com/olric-data/olric

[2]: https://github.com/golang/groupcache

[3]: https://www.usenix.org/conference/nsdi21/presentation/yang-j...


Olric is awesome, we've been using it for 2 years in prod. No complaints.


Otter can be used as the backing store with groupcache-go, which is a fork of the original groupcache: https://github.com/groupcache/groupcache-go#pluggable-intern...


Today I'm excited to announce the release of Otter v2 - a high-performance caching library for Go.

Otter v2 has been almost completely reworked since its first release to provide a richer API and new features while maintaining high performance.

Key Improvements:

- Completely rethought API for greater flexibility.

- Added the ability to create a cache with any combination of features.

- Added loading and refreshing features.

- Added entry pinning.

- Replaced eviction policy with adaptive W-TinyLFU, enabling Otter to achieve one of the highest hit rates across all workloads.

- Added HashDoS protection against potential attacks.

- The task scheduling mechanism has been completely reworked, allowing users to manage it themselves when needed.

- Added more efficient write buffer.

- Added auto-configurable lossy read buffer.

- Optimized hash table.

- Test coverage increased to 97%

I hope this library will prove useful to some of you! :)


Hmm, I'd like to know more about the bugs and not updated results. It seems that all the bugs that were there long ago have been fixed (and even with some improvements). And the results show the latest version's performance. Especially since S3-FIFO shows very good results, in fact, inferior only to the adaptive version of W-TinyLFU on S3 and DS1 traces.

And about lock-free queues I don't agree, what they do with the implementation can be seen on the example of theine Reads 100% graph in the otter repository, if it wasn't there, it would be equal in speed to ristretto. And the BP-Wrapper technique actually has a huge engineering plus: it allows you to focus on optimizing different components of the system individually and easily make changes to the eviction policy that lock-free queues can't give to the developer.


Hi Alexey, I was referring to the bug you fixed two weeks ago. This comment was for the weird results that Ben's cited. I apologize here because I did not check the commit carefully, and I thought you had not updated the results (because the results on Zipf are worse than I expected). Thank you for clarifying this!

Can you clarify what you disagree about on lock-free queues? Do you mean that S3-FIFO cannot be implemented without locks?


In fact, lock-free queues have several problems at once, which prompted me to give up on them almost immediately.

1. Yes, S3-FIFO can be implemented using lock-free queues, but the problem is that each write to a filled cache using this design will cause a large number of additional atomic operations not friendly to the processor's cache, while bp-wrapper on the contrary amortizes this load. And reading with frequency update on hot entries can have a bad effect on performance. In many ways this is exactly what the last posts in my discussion with Ben are about (not really about this, but the current problem with otter read speed is caused by a similar problem). https://github.com/Yiling-J/theine-go/issues/29#issuecomment...

2. But the main problem for me is not even that. Lock-free queues work fine as long as you only need to support Get and Set operations, but as soon as you want to add features to your cache, the complexity of the implementation starts to increase, and some features are very hard to add to such a structure. Also, improving the eviction policy is under a big question mark, because not only do you have to think about how to improve the eviction policy, but also how to avoid locks while doing so or how not to slow down the implementation with your improvements. BP-Wrapper has no such problems at all, allows you to use any eviction policy and focus on improving different parts of your cache independently of each other.


I like the discussions with you. I have a few comments and questions.

1. why does it require multiple atomic operations at each insert? Our implementation in cachelib only used one CAS operation. I will get back to you with an illustration. Maybe you can show me something I missed.

2. I like the bp-wrapper idea. It is intuitive and can be used with most/all algorithms. Batching does reduce/amortize the overhead of locking for LRU-based algorithms. But IIRC, all the promotions are still serialized. Because the promotions are still protected by the lock, and only one core can promote objects at the same time. Caffeine achieves amazing scalability because it drops all the elements that need to be updated/promoted under high contention, which means it effectively becomes a FIFO. I don't think this is the correct way to claim good scalability.

3. "reading with frequency update on hot entries can have a bad effect on performance" I don't understand this. The frequency of hot entries should quickly reach 1 (if using a one-bit counter) or 3 (if using a two-bit counter) and will not be updated anymore. Why would read and update conflict?

4. Yes, implementing get and set with locking is trivial; adding support for delete requires a bit of work (10s lines) but should be manageable. But why would lock-free queues make other features hard to implement? Can you give an example?


1. Here I don't understand how you can do with one cas on a filled cache. You need to evict items on every insert, reinsert items into the main queue, etc., and if you add support for item weights, the number of xadd and cas operations will be even greater.

2. It's a bit non-obvious here. Caffeine does not lose a single insert, update, delete operation and moreover, keeps track of their correct execution with fantastic scrupulosity. Also the loss of some reads is inherent in many eviction policies. And caffeine's lossy buffers also have a great ability: the number of these buffers is automatically adjusted at runtime based on contention, which minimizes losses.

3. I specifically mentioned here that it's not the same, but it is possible to design a load that will force as many xadd operations as possible and as few load operations as possible. But in principle you can ignore this point, as it is rather artificial.

4. Here I rather mean that all features should live in the lock-free model or somehow be separately integrated into the architecture (which only complicates things). Plus, for example, a person wants to modify the eviction policy a bit, for example, by adding the lfu part to it and then it's not clear how to do it.


1. I wrote a bit on how to implement lock-free FIFO queues here https://blog.jasony.me/system/cache/2023/12/28/fifo, let me know if any part is not clear or not correct. Moving an object from the small queue to the main queue can be viewed as two operations: evicting the item from the small queue and inserting it into the main queue.

2. I am not criticizing the design decision in Caffeine. It is good because most people just don't need crazy scalability (>16 threads) or do not run it at such high throughput. I understand that critical operations are not lost. But the promotions are lost under contention, which makes comparison unfair. S3-FIFO is still the same algorithm under contention, but The W-TinyLFU in Caffeine becomes FIFO under high contention. I do not think that it can claim a low miss ratio and good scalability at the same time. The relationship is rather a low miss ratio OR good scalability. But as I mentioned, it does not matter for most use cases, and the design is elegant.

3. skipped

4. It is true that if one wants to port a lock-based new algorithm, it would require extra effort. But I guess this is a decision you have to make: only support lock-free eviction algorithms (FIFO-based) or support more algorithms with locks. But I do want to mention that S3-FIFO is not the only lock-free algorithm and there are and will be more.


In fact, to say that sieve is better is fundamentally wrong.

1. You only test the hit ratio on a synthetic zipf distribution and claim an advantage without saying a word about the problems (which there are).

2. Checking implementation speed looks very questionable on such benchmarks, because you spend a huge amount of CPU time on synchronization and item generation.

3. Your implementation has at least three problems that otter doesn't: worst case O(n) time complexity of the eviction policy, disgusting scalability, also golang-fifo simply doesn't have as many features as otter, and adding each of them will only slow golang-fifo down even more.

4. Yes, when increasing contention otter sacrifices 1-2 percent (I'll have to see if I can reduce this loss) hit ratio to maintain response speed, but that's what caffeine, ristretto and other famous cache libraries do, because in this case it's more important to maintain scalability than to keep hundreds of goroutines waiting for a mutex to unlock.

5- This structure looks highly questionable to support ghost queue: https://github.com/scalalang2/golang-fifo/blob/main/s3fifo/b.... You are wasting a very large number of bytes just to maintain an item footprint.


Hello, Thank you for replying here :)

Many of answers you replied are reasonable and good.

---------

And I just want to add more comments for others.

1. SIEVE is not scan-resistant, so that, I think it should only be applied for web cache workloads (typlically follows power-law distribution)

2. SIEVE is somewhat scalalbe for read-intensive applications (e.g. blog, shop and etc), because it doesn't require to hold a lock on cahce hit.

3. The purpose of golang-fifo is to provide simple and efficient cache implementation (e.g. hashicorp-lru, groupcache).

-> I don't want to beat high-performant, bla-bla, specialized in-memory caches (bigcache, fastcache and etc).

-> Both S3-FIFO and SEIVE have O(n) worst-case cache eviction. and it only happens when all of objects in the cache are hit, which means, a perfect(100%) hit rate in sequences.

4. when increasing contention otter sacrifices 1-2 percent

-> I think that the statement is not enough. The hit rate varies depending on the total number of objects and the size of the cache, so it should be compared relatively. for example, otter's efficiency decreased by 5% compared to single-threaded when lock contention increased (decreased efficiency makes a mean network latency higher, because it may need to conduct heavy operation e.g. re-calculation, database access and so on)

-> Therefore, in certain situations, the efficiency of a cache library may be more important than its QPS. This is not a matter of which one is better, but rather that other people should be able to choose a cache library that suits the specific needs of their application.

5. ghost queue : honetly at that time of writing the code, I didn't deep dive into the bucket table implementation, it may not work same as actual bucket hash table (see here: https://github.com/scalalang2/golang-fifo/issues/16)

---

If anyone here want to raise issues or give contributions for golang-fifo, please visit here https://github.com/scalalang2/golang-fifo


1. That's not the problem, yes, most web traces aim for zipf distribution, but you can't say it will work well in real life just based on this check (this was also mentioned in the S3-FIFO discussion here). Because there are many ways to break this best case.

2. There is a small problem here, yes, on 100% reads workload this will show good results, but there is a catch here, you should add even one percent of writes and this cache degrades very badly.

3. This makes perfect sense! But the O(n) worst case is a bit scarier than you describe :(. There it is quite enough to have just a large enough sequence of elements in the main queue with a frequency greater than zero (100000 for example will suffice), in this case you will just reinsert all these elements with global locking, although hashicorp-lru or groupcache will not do this, but just move one element.

4. it's not quite true, losses depend on contention and nothing else, and in your benchmarks you make otter withstand 4 million qps and at such a load it lost only 2% (probably this value can be reduced a lot, I'll have to look at it at the holidays), which at such a load hardly matters much.

5. My point is that you store at least 2 * key size + 8*2 + 8 bytes just to store the fingerprint. And if you also count the pointers that are held for the sake of it, it becomes quite sad.


No, and it's not planned (otter tries to avoid additional pressure on gc if possible, but that's not always possible). And I'm extremely frustrated that I had to include bigcache and fastcache in these benchmarks, since many people don't know the differences at all and just focus on performance and other metrics. Especially since bigcache and fastcache will only have an advantage when storing a huge number of items (well over 10 million). So I'll probably just add a P.S. in the README about it.


Yeah, they're fundamentally solving different problems. I was confused to see it benchmarking against them in the first place.

The number of items isn't the best metric alone, making it harder to demonstrate in a benchmark. I've reaped benefits from bigcache with ~600k-900k items — but there's a lot of eviction, each item is a pointer and can have a bunch of nested references itself. (protobufs)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: