#9 (downgrading mut refs to shared refs) is a big one. It makes things quite a bit more complicated in the context of our work on the OCaml-Rust interface (more precisely the safe interface for the GC). As I understand it, this is not a sacrifice we make at the "Altar of Memory Safety", but one we make at the Altar of Mutex::get_mut and Cell::get_mut, which is a much smaller altar (how often do you find yourself in possession precisely of a mutable borrow of a Mutex or of a Cell?).
This is about the classic trick of speculating on values using branch prediction (if value == expected then f(expected) else f(value)), which is always fun to see. But be careful in OCaml, as the memory model relies on the memory ordering of data dependencies (e.g. on Arm) to ensure memory-safety, so I suspect that this trick might be in general memory-unsafe in the presence of data races (more precisely values from other threads might be seen before their initialization). (OCaml memory-safety claims do not apply here because it uses Obj.magic.)
Forgive the silly question, but isn't the speculative execution system responsible for making sure that speculation doesn't break your program semantics? In general, would it not be a bug in the speculative execution system to (e.g.) dereference a null pointer if your program isn't actually going to do that?
More generally, where can I read about this kind of thing to find out exactly what the speculative execution system is or isn't allowed to do? Arm documents the speculation barrier instruction, and I've found a paper which models speculative execution formally, but no actual design semantics.
Note that the approach in the original post is also wrong if the GC sees the expected value, because it would try to access it regardless of whether the prediction is valid.
To clear up any misconception, out of the box OCaml will behave like OCaml 4 with a single domain and a "domain lock". Programs currently using multiple threads for concurrency will remain single-core for the time being, as they will need to opt-in to parallelism features. In this sense, adding parallelism to OCaml does not break existing programs, but they still might have to be audited for thread-safety depending on how they want to use parallelism. There is no magic.
The development of the OCaml-Rust interface is a bit organic. `ocaml-interop` aims to be a low-level but safe interface. At Inria we worked with the author to integrate ideas about using Rust's type system to safely work with a GC. We have more ideas to make it better; if someone wants to commit engineering time in this area they should feel encouraged to contact me.
Baker's paper is visionary in that it announces a link between linear logic and move semantics for resources more than a decade in advance. As we know this was put into practice by C++ and Rust. Compared to linear types, it amounts to operating a change of perspective from the quantitative interpretation (is it copied? is it dropped?) to a qualitative one (how is it copied? how is it dropped?). This is very much in spirit of linear logic (though it takes a bit of work to relate these ideas to Girard's mathematics). It answers questions that linear types were never able to, such as how to mix linearity and control. I take this paper as an example that there is more to linear logic than linear types!
As the paper clarifies, this is for the performance of non-atomic read/writes in sequential setting. The paper left the performance evaluation for atomic read/writes to future work. Is there any indication yet regarding the performance in a parallel setting compared to weaker guarantees?
That's right -- we first wanted to establish that existing OCaml code wouldn't be adversely impacted. There are various efforts ongoing to build interested (locked and lock-free) concurrent data structures, so that will inform the parallel performance results. Nothing published yet.
We taught Rust at Masters-level as part of a course meant to show engineering students a variety of concepts from innovative programming language (there was also Haskell and Scala). Learning Rust is a must (at least for our kind of students), but not because "everything is being rewritten in Rust".
First, I think knowing concepts from various languages makes better programmers. But Rust in particular is part of a family of languages along with C++ that are very powerful and versatile, but that require a strong discipline in order to write correct programs. And while C++ is more widespread currently, learning Rust teaches you this discipline, simply because the compiler will keep rejecting your programs until you understand it. C++ compilers on the other hand happily accept nonsensical programs, and it is hard to find modern C++ experts to teach this skill (and sometimes it is even hard to get experts agree on what the discipline is!). Yet in some areas where C++ has few alternatives, unreliable software can have huge consequences. So learning Rust is important because it will also make of you a better C++ programmer.
Regarding latency, beyond the kind of benchmarks proposed in the paper (from which this performance claim comes), keep in mind that the behaviour is qualitatively different between the two alternatives.
Systems programmers in OCaml can apply techniques to achieve a low latency, but these are unlikely to scale as well under ParMinor. Indeed, with the latter, it is not going to be possible to have a low-latency domain and a non-low-latency domain coexist: the whole program has to be written in the low-latency style.
This is clear from the qualification “stop-the-world” for ParMinor, but worth to note for people who would read the claim out of context. There can be further qualitative differences in the GC design that are not measured by the benchmarks.
> but these are unlikely to scale as well under ParMinor
To be clear, the said systems programs are not going to see slowdown when run in a sequential setting (which is what I expect the majority of use cases will be). It is unlikely that these systems programs will immediately want to take advantage of parallelism. Moreover, Multicore OCaml aims to add support for shared-memory parallel programming to increase throughput of the program. Getting high throughput and maintaining very low latency is a big challenge generally, and can't just be solved by the runtime system. It needs a different way of writing programs altogether.
> Indeed, with the latter, it is not going to be possible to have a low-latency domain and a non-low-latency domain coexist: the whole program has to be written in the low-latency style.
This is wrong. If the low-latency code is written as it is (no allocations), then in ParMinor the low-latency domain will be responsive at the cost of increasing the latency on non-low-latency domain. Of course, no GC safe points should be inserted in those low-latency code. This design is not ossified.
Another details to remember is that ParMinor is a stop-the-world parallel minor collector. The major collection is concurrent mark-and-sweep which keeps the latency low. I'd recommend checking out the new concurrent GC for GHC which uses concurrent collection for the major heap, and parallel collection for the minor heap [1], which is similar to ParMinor.
Ultimately, Multicore OCaml aims to offer an easy way of helping 95% of the programs to take advantage of parallel execution for increasing throughput without compromising latency. It is quite hard to fully support different expert cases especially since they rely on the details of the existing runtime system, which may no longer hold when the runtime system changes.
Indeed, OCaml multicore preserves the sequential low-latency setting, and this is important. Generalising this style is indeed a challenge, and I believe that there is something to learn from those expert cases.
I do not find realistic the suggestion that low-latency code could simply avoid safe points. To get a low latency, one in general avoids promoting values, but one still allocates on the minor heap. One can achieve much less without allocating at all, and moreover OCaml multicore adds extra safe points for a reason, even if they could be turned off.
In the end, what you write about details of the runtime system illustrates my point: the two minor collectors have qualitative differences regarding latency, that affect available programming styles, and this is not measured by the benchmarks.