NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The Isolation Trap: Erlang (causality.blog)
JackC 4 hours ago [-]
The article argues that shared memory and message passing are the same thing because they share the same classes of potential failure modes.

Isn't it more like, message passing is a way of constraining shared memory to the point where it's possible for humans to reason about most of the time?

Sort of like rust and c. Yes, you can write code with 'unsafe' in rust that makes any mistake c can make. But the rules outside unsafe blocks, combined with the rules at module boundaries, greatly reduce the m * n polynomial complexity of a given size of codebase, letting us reason better about larger codebases.

alberth 23 minutes ago [-]
Tangentially related: I haven’t seen DragonflyBSD talked about on HN in a long while but wasn’t it a split from FreeBSD to be built entirely around messaging passing as the core construct.

And with the tiny team working on it, it has remarkable performance.

https://www.dragonflybsd.org/performance/

toast0 2 hours ago [-]
> Isn't it more like, message passing is a way of constraining shared memory to the point where it's possible for humans to reason about most of the time?

That's a good way to look at it. A processes's mailbox is shared mutable state, but restrictions and conventions make a lot of things simpler when a given process owns its statr and responds to requests than when the requesters can access the state in shared memory. But when the requests aren't well thought out, you can build all the same kinds of issues.

Let's say you have a process that holds an account balance. If requests are deposit X or withdrawl Y, no problem (other than two generals). If instead requestors get balance, adjust and then send a set balance, you have a classic race condition.

ETS can be mentally modeled as a process that owns the table (even though the implementation is not), and the same thing applies... if the mutations you want to do aren't available as atomic requests or you don't use those facilities, the mutation isn't atomic and you get all the consequences that come with that.

Circular message passing can be an easy mistake to make in some applications, too.

dnautics 1 hours ago [-]
> ETS can be mentally modeled as a process that owns the table (even though the implementation is not)

the API models it that way, so i'd say its a bit more than just a mental model.

tonnydourado 37 minutes ago [-]
Thank god I found this page: https://causality.blog/series/, now I can relax knowing that at least there's a plan for a conclusion. Looking forward to the next posts
NeutralForest 24 minutes ago [-]
Yeah I was looking for the next one!
rdtsc 1 hours ago [-]
> But an escape hatch is still an escape hatch. These mechanisms bypass the process isolation model entirely. They are shared state outside the process model, accessible concurrently by any process, with no mailbox serialization, no message copying, no ownership semantics. And when you introduce shared state into a system built on the premise of having none, you reintroduce the bugs that premise was supposed to eliminate.

Not they are not. I don't know what "Technical Program Managers at Google" do but they don't seem to be using a lot of Erlang ;-) it seems. ETS tables can be modeled as a process which stores data and then replies to queries. Every update and read is equivalent to sending a message. The terms are still copied (see note * below). You're not going to read half a tuple and then it will mutate underneath as another process updates it. Traversing an ETS table is logically not that different than asking a process for individual kvs using regular message passing.

What is different is what these are optimized for. ETS tables are great for querying and looking up data. They even have a mini query language for it (https://www.erlang.org/doc/apps/stdlib/qlc.html). Persistent terms are great for configuration values. None of them break the isolated heap and immutable data paradigm, they just optimize for certain access patterns.

Even dictionary fields they mention, when a process reads another process' dictionary it's still a signal being sent to a process and a reply needing to be received.

* Immutable binary blocks >64B can be referenced, but they are referenced when sending data using explicit messages between processes anyway.

IsTom 3 hours ago [-]
> Forget to set a timeout on a gen_server:call?

Default timeout is 5 seconds. You need to set explicit infinity timeout to not have one.

__turbobrew__ 2 minutes ago [-]
I work on infrastructure at bigco and we landed on a 5 second default timeout for our RPC framework which is interesting.

Sometimes I think there should be a list of sane and tested production configs: default rpc timeout, default backoff exponent, default initial backoff, default max backoff, health check frequency, health check timeout, process restart delay, process restart backoff, etc…

johnisgood 5 hours ago [-]
> This isn’t obviously wrong

I thought it was obviously wrong. Server A calls Server B, and Server B calls server A. Because when I read the code my first thought was that it is circular. Is it really not obvious? Am I losing my mind?

The mention of `persistent_term` is cool.

loloquwowndueo 4 hours ago [-]
It wasn’t obvious to the AI that wrote the article. There’s still hope for humans :)
4 hours ago [-]
bluGill 5 hours ago [-]
It is too common / useful. Not everything is a tree.
allreduce 4 hours ago [-]
Most things are a dag tho. :)
bluGill 4 hours ago [-]
Most is not all. And those exceptions are annoying.
lukeasrodgers 5 hours ago [-]
I don’t have much experience with pony but it seems like it addresses the core concerns in this article by design https://www.ponylang.io/discover/why-pony/. I wish it were more popular.
jen20 4 hours ago [-]
I don’t know enough about pony to know for sure, but nothing on that page sums to suggest that deadlocks of the form the article discusses are resolved?
gf000 1 hours ago [-]
I don't think there is a generic computational model that would prevent deadlocks, so no, pony also doesn't solve it.
Twey 53 minutes ago [-]
Message passing is a type of mutable shared state — but one that's restricted in some important way to eliminate a certain class of errors (in Erlang's case, to a thread-safe queue with pairwise ordering guarantees so that all processing on a particular actor's state is effectively atomic). You can also pick other structures that give different guarantees, e.g. LVars or CRDTs make operations commutative so that the ordering problems go away (but by removing your ability to write non-commutative operations). The big win for the actor model is (just) that it linearizes all operations on a particular substate of the program while allowing other actors' states to be operated on concurrently.

Nobody argues that any of these approaches is a silver bullet for all concurrency problems. Indeed most of the problems of concurrency have direct equivalents in the world of single-threaded programming that are typically hard and only partially solved: deadlocks and livelocks are just infinite loops that occur across a thread boundary, protocol violations are just type errors that occur across a thread boundary, et cetera. But being able to rule out some of these problems in the happy case, even if you have to deal with them occasionally when writing more fiddly code, is still a big win.

If you have an actor Mem that is shared between two other actors A and B then Mem functions exactly as shared memory does between colocated threads in a multithreaded system: after all, RAM on a computer is implemented by sending messages down a bus! The difference is just that in the hardware case the messages you can pass to/from the actor (i.e. the atomicity boundaries) are fixed by the hardware, e.g. to reads/writes on particular fixed-sized ranges of memory, while with a shared actor Mem is free to present its own set of software-defined operations, with awareness of the program's semantics. Memory fences are a limited way to bring that programmability to hardware memory, but the programmer still has the onerous and error-prone task of mapping domain operations to fences.

_mrinalwadhwa_ 5 minutes ago [-]
> a thread-safe queue with pairwise ordering guarantees so that all processing on a particular actor's state is effectively atomic

> The big win for the actor model is (just) that it linearizes all operations on a particular substate of the program while allowing other actors' states to be operated on concurrently.

Came here to say exactly those two things. Your comment is as clear as it could be.

aeonfox 6 hours ago [-]
A real interesting read as someone who spends a bit of time with Elixir. Wasn't aware of the atomic and counter Erlang features that break isolation.

Though they do say that race conditions are purely mitigated by discipline at design time, but then mention race conditions found via static analysis:

> Maria Christakis and Konstantinos Sagonas built a static race detector for Erlang and integrated it into Dialyzer, Erlang’s standard static analysis tool. They ran it against OTP’s own libraries, which are heavily tested and widely deployed.

> They found previously unknown race conditions. Not in obscure corners of the codebase. Not in exotic edge cases. In the kind of code that every Erlang application depends on, code that had been running in production for years.

I imagine that the 4th issue of protocol violation could possibly be mitigated by a typesafe abstracted language like Gleam (or Elixir when types are fully implemented)

WJW 6 hours ago [-]
> They found previously unknown race conditions. Not in obscure corners of the codebase. Not in exotic edge cases. In the kind of code that every Erlang application depends on, code that had been running in production for years.

If these race conditions are in code that has been in production for years and yet the race conditions are "previously unknown", that does suggest to me that it is in practice quite hard to trigger these race conditions. Bugs that happen regularly in prod (and maybe I'm biased, but especially bugs that happen to erlang systems in prod) tend to get fixed.

aeonfox 6 hours ago [-]
True. And that the subtle bugs were then picked up by static analysis makes the safety proposition of Erlang even better.

> Bugs that happen regularly in prod

It depends on how regular and reproducible they are. Timing bugs are notoriously difficult to pin down. Pair that with let-it-crash philosophy, and it's maybe not worth tracking down. OTOH, Erlang has been used for critical systems for a very long time – plenty long enough for such bugs to be tracked down if they posed real problems in practice.

thesz 5 hours ago [-]
Erlang has "die and be restarted" philosophy towards process failures, so these "bugs that happen to erlang systems in prod" may not be fixed at all, if they are rare enough.
toast0 3 hours ago [-]
As of now, the post you're replying to says "bugs that regularly happen ... in prod"

Now, if it crashes every 10 years, that is regular, but I think the meaning is that it happens often. Back when I operated a large dist cluster, yes, some rare crashes happened that never got noticed or the triage was 'wait and see if it happens again' and it didn't happen. But let it crash and restart from a known good state is a philosophy about structuring error checking more than an operational philosophy: always check for success and if you don't know how to handle an error fail loudly and return to a good state to continue.

Operationally, you are expected to monitor for crashes and figure out how to prevent them in the future. And, IMHO, be prepared to hot load fixes in response... although a lot of organizations don't hot load.

dnautics 1 hours ago [-]
not all races are bugs. here's an example that probably happens in many systems that people just don't notice: sometimes you don't care and, say, having database setup race against setup of another service that needs the database means that in 99% of cases you get a faster bootup and in 1% of cases the database setup is slow and the dependent server gets restarted by your application supervisor and connects on the second try.
kamma4434 5 hours ago [-]
The 4th issue is a feature- it’s what allows zero downtime hot updates.
pshirshov 3 hours ago [-]
I believe it's more correct to reference circular calls as "livelocks", not "deadlocks" - something is happening but the whole computation cannot progress.

For the rest - pure untyped actors come with a lot of downsides and provoke engineers to make systems unnecessarily distributed (with all the consistency and timeout issues). There aren't that many problems which can be mapped well directly to actors. I personally find async runtimes with typed front-ends (e.g. Cats/ZIO in Scala, async in Rust, etc) much more robust and much less error-prone.

toast0 3 hours ago [-]
If process A is waiting for a reply from process B and process B is waiting for a reply from process A; that is deadlock. There is no way those processes can continue (unless there's a timeout or one process gets killed). Other processes may progress as long as they don't need a reply from process A or B ... which sometimes is fine. (Edit: nevermind, I forgot the 5 second timeout if you use gen_server:call/2; you will end up in livelock if it happens continuously, but a mostly ok system if it works out)

Livelock is something like you've got 1000 nodes that all want to do X, which requires an exclusive lock and the method to get an exclusive lock is:

Broadcast request to cluster

If you got the lock on all nodes, proceed

If you get the lock on all nodes, release and try again after a timeout

This procedure works in practice, when there is low contention. If the cluster is large and many processes contend for the lock, progress is rare. It's not impossible to progress, so the system is not deadlocked; but it takes an inordinate amount of time, mostly waiting for locks: the system is livelocked. In this case, whenever progress happens, future progress is easier.

This is a rough description of an actual incident with nodes joining pg2, I think around 2018... the new pg module avoids that lock (and IMHO, the lock was not needed anyway; it was there to provide consistent order in member lists across nodes, but member lists would no longer be consistent when dist distonects happened and resolved, so why add locks to be consistent sometimes). As an Erlang user with I think the largest clusters anywhere, we ran into a good number of these kinds of things in OTP. Ericsson built dist for telecom switches with two nodes in a single enclosure in a rack. It works over tcp and they didn't put explicit limits, so you can run a dist cluster with thousands of nodes in locations across the globe and it mostly works, but there will be some things to debug from time to time. Erlang is fairly easy to debug... All the new nodes have a process waiting to join pg2, what's the pg2 process doing, why does that lock not have the consensus building algorithm, can we add it? In the meantime, let's kill some nodes so others can progreas and then we'll run a sequenced start of the rest.

gzread 3 hours ago [-]
It's a deadlock because two threads are each waiting for the other.
anonymous_user9 6 hours ago [-]
This seems interesting, but the sheer density of LLM-isms make it hard to get through.
rando1234 3 hours ago [-]
I actually disagree, thought it read reasonably well and didn't feel LLMy at all.
loloquwowndueo 3 hours ago [-]
It stinks of LLM - sections with headers beginning with “The”, a lot of “it’s not just X, it’s Y” etc etc.

The content is good and interesting though. Just hard to wade through with all the thorny LLM bushes getting in the way.

Looks like the author had a draft with the core content and ideas and asked an LLM to embellish it. Maybe because author wasn’t confident in their writing skills? Whatever the reason, I’d honestly prefer something human-written.

boxed 5 hours ago [-]
I think at this point comments like this are equivalent to saying "I didn't like this article, because it's written in too good English".
andrelaszlo 4 hours ago [-]
I would edit sentences like this:

"Erlang is the strongest form of the isolation argument, and it deserves to be taken seriously, which is why what happens next matters."

It doesn't add much, and it has this condescending and pretentious LLM tone. For me as a reader, it distracts from an otherwise interesting article.

layer8 1 hours ago [-]
That what the only place that made me stumble, because “what happens next” doesn’t really make sense in that context.
Linux-Fan 2 hours ago [-]
I liked the content of the article enough to read it to the end, but I did have a hard time due to inflation with LLM-isms. Then again I am not a native so how would I know if this is good English? I can only tell that to me, it is hard to read despite interesting content.
loloquwowndueo 4 hours ago [-]
Sorry, good English is good grammatically and structurally while being unique and feeling creative. and AI-written English is not good. It’s correct but totally repetitive, formulaic and circular. It’s like expecting a pizza and finding it’s made of cardboard.
MarkusQ 1 hours ago [-]
Or maybe more like expecting Italian food and getting pizza?
trashburger 4 hours ago [-]
It shows a lack of care for the reader. Use your own words.
cyberpunk 6 hours ago [-]
Eh maybe. I work on a big, mature, production erlang system which has millions of processes per cluster and while the author is right in theory, these are quite extreme edge cases and i’ve never tripped over them.

Sure, if you design a shit system that depends on ETS for shares state there are dangers, so maybe don’t do that?

I’d still rather be writing this system in erlang than in another language, where the footguns are bigger.

dnautics 1 hours ago [-]
in ten years of BEAM ive written a deadlock once. and zero times in prod.

id say its better to default to call instead of pushing people to use cast because it won't lock.

myrak 2 hours ago [-]
[dead]
worthless-trash 3 hours ago [-]
Could be wrong, but that wont deadlock because 5 seconds later, you're going to have call/2 fail.
5 hours ago [-]
felixgallo 4 hours ago [-]
This is agitslop.
robutsume 27 minutes ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 16:29:10 GMT+0000 (Coordinated Universal Time) with Vercel.