The Steam client update on November 5th mentions "Fixed some miscellaneous common crashes." in the Linux notes, which I wanted to give a bit of background on. There's more than one fix that made it in under the somewhat generic header, but the one change that made the most significant impact to Steam client stability on Linux has been a revamping of how we are approaching the setenv
and getenv
functions.
One of my colleagues rightly dubbed setenv
"the worst Linux API". It's such a simple, common API, available on all platforms that it was a little difficult to convince ourselves just how bad it is. I highly encourage anyone who writes software that will run on Linux at some point to read through "RachelByTheBay"'s very engaging post on the subject.
Being the consummate Linux developers that we are, of course we already knew that setenv
and getenv
aren't safe to use in multithreaded environments. Our policy up to now had been to minimize usage of these functions, and hope for the best.
The Steam client collects basic crash information. The reports have a backtrace of where the crash happened, and what the other threads were doing. On Linux this data is very noisy, there is so much variation across distributions, driver versions, window manager choices, extensive user customization etc., that the reports do not bucket as nicely as they do on Windows.
After a concerted effort to improve our grouping, a pattern emerged. It turns out that if you call setenv
in a multithreaded program, sometimes you will crash outright, but that's pretty rare. That happens, but the volumes we saw were always low.
We found that other threads would blow up though, usually with a SIGABRT, shortly after calling getenv
themselves. The backtraces for such crashes were all over the place and could not be easily tied to a single cause, but there were several orders of magniture more of those that we had direct setenv
crashes.
There is no silver bullet to address this. These APIs are thread safe on Windows and Mac, so developers use them. Mac opted to leak the strings rather than crash (see BUGS in the Mac OS X manual page for getenv).
In the latest release of the Steam client we changed several things:
- We removed the majority of
setenv
calls. It was mostly used when spawning processes, and refactoring to useexevpe
to pass down a prepared environment is an all around improvement. - We reduced how much we rely on
getenv
, mostly by caching the calls. There is still an uncomfortable amount of it, but it's in OS libraries at this point (x11, xcb, dbus etc.) and we continue reducing it's usage. - For the few remaining
setenv
use cases that could not be easily refactored, we introduced an 'environment manager' that pre-allocates large enough value buffers at startup for fixed environment variable names, before any threading has started.
This last change is what really made a difference for us. Large enough buffers are preallocated at startup, the targeted environment variables exist for the whole process lifetime but start as an empty string. Wherever a setenv
would previously happen, we call getenv
first to make sure the buffer hasn't moved, and use a direct string copy to update the value. That's not a completely reliable fix, a third party library could still call setenv
and trigger crashes, there's still a risk of data races, but we've observed a significant reduction in SIGABRT volumes.
If this can be addressed in glibc, it may involve a tradeoff on features, maybe an opt-in mechanism with a slight departure from the "impossible" POSIX spec. That's something we may pursue in the long term if we can propose something sensible.
It is being addressed on glibc https://patchwork.sourceware.org/project/glibc/list/?series=36745
The main issue is how to keep the API async-signal-safe, and to synchronize with uses of envp like execve and posix_spawn. You can see that Florian’s work is far from simple: it uses a SeqLock and a temporary buffer for envp usage, and the rationale of of internal safeness is still not clear to me. But I hope to get this done for 2.41 and hopefully since it is not an ABI breaker we can backport it.
Posted by: Adhemerval | 11/11/2024 at 05:23 PM
Very happy to learn that this is being worked on in glibc. The patch series roughly coincides with when we started wrangling with this, but of course I never found it then.
Posted by: TTimo | 11/11/2024 at 06:01 PM
> We reduced how much we rely on getenv, mostly by caching the calls. There is still an uncomfortable amount of it, but it's in OS libraries at this point (x11, xcb, dbus etc.) and we continue reducing it's usage.
Would a native Wayland version of the steam client alleviate any of these concerns?
Posted by: A_nitsuj | 11/12/2024 at 08:01 AM
I've fixed this problem in Firefox some time back by using function interposition:
https://bugzilla.mozilla.org/show_bug.cgi?id=1752703
https://hg.mozilla.org/mozilla-central/rev/1760c9f902bf
https://hg.mozilla.org/mozilla-central/rev/79dc5e93cef4
Since we can't change all our dependencies (they are too many) I've slotted some thread-safe environment manipulation functions between the code calling them and the libc implementations. This is achieved by linking our own library (mozglue) before libc on Linux, and by hooking dlsym() on Android where libc might have been loaded before we get a chance to load our code. I've eliminated all crashes related to environment manipulation with this trick, with the only exception of cases where glibc calls those functions internally, but that's a genuine glibc bug that needs to be fixed upstream.
Posted by: Gabriele Svelto | 11/13/2024 at 02:18 AM
We considered intercepting the calls, but we were concerned about the performance implications of marshalling so much getenv traffic. I understand that in some cases it may be the only way though.
Posted by: TTimo | 11/13/2024 at 08:40 AM