Skip to content

libstore: split DerivationBuilder::unprepareBuild to fix flock self-deadlock (#15846)#15859

Open
RazYang wants to merge 1 commit into
NixOS:masterfrom
RazYang:fix-pathlocks-self-deadlock
Open

libstore: split DerivationBuilder::unprepareBuild to fix flock self-deadlock (#15846)#15859
RazYang wants to merge 1 commit into
NixOS:masterfrom
RazYang:fix-pathlocks-self-deadlock

Conversation

@RazYang
Copy link
Copy Markdown

@RazYang RazYang commented May 15, 2026

Motivation

Fixes #15846.

PathLocks::lockPaths uses flock(2), whose locks are associated with the
open file description rather than the process. When a goal running
registerOutputs tries to flock the actual output path while a sibling
goal in the same daemon already holds that path's lock via a different
open file description, the flock blocks even though both file descriptors
live in the same process. Because registerOutputs runs synchronously
from the build goal coroutine, that flock freezes the daemon's worker
thread; the sibling goal can no longer make progress (its builder's
stdout pipe fills and the build blocks on write), and the daemon
deadlocks. The reporter's lsof shows the same -source.lock file open
on two file descriptors of one process — the smoking gun for this
pattern.

Approach

Split DerivationBuilder output registration into two phases so the
dynamic output locks can be acquired from the goal coroutine, where a
non-blocking try-lock can yield instead of blocking the worker:

  • unprepareBuild() tears down the build environment, validates the
    builder exit status, canonicalises the scratch outputs and computes
    their content-addressed paths. It now returns the set of output store
    paths that the per-derivation pre-build locks do not already cover
    (floating CA outputs, and fixed-output derivations whose actual hash
    differs from the declared one). It no longer moves outputs.

  • registerOutputs() moves the prepared outputs into the store,
    registers them as valid, and applies output checks. The caller is
    responsible for holding PathLocks on the paths returned by
    unprepareBuild().

DerivationBuildingGoal::buildLocally acquires the dynamic output locks
between the two calls using the same try-lock + co_await waitForAWhile() retry pattern already used for the pre-build locks in
acquireResources. The worker thread yields the coroutine instead of
blocking on flock, so any sibling goal holding the lock can progress and
release it.

The set of paths locked, and the order in which they are locked, is
unchanged versus the pre-existing code; only the call site (and thus
the blocking discipline) moves.

Notes on behavioural preservation

  • The finish() call that updates outputRewrites for subsequent
    iterations now runs at the end of phase 1 (before the move) instead of
    at the end of phase 2 (after the move). The update is guarded with
    if (buildMode != bmCheck) so that bmCheck does not call finish()
    for newly-built outputs, matching the prior behaviour.

  • Per-output state that previously lived in the third loop's stack frame
    (newInfo, actualPath, references, tempDirFd, delTempDir) is
    stashed on the builder as pendingOutputs. The RAII guards are
    cleared after the move loop in phase 2 (success) or by the
    DerivationBuilder destructor (failure paths).

  • ChrootLinuxDerivationBuilder::unprepareBuild is updated to the new
    signature; its namespace-reset behaviour is preserved.

  • I deliberately avoided fcntl OFD locks: they have the same per-OFD
    semantics as flock and would not fix the in-process collision.
    Plain POSIX F_SETLK is process-wide but has well-known "close any fd
    drops all locks" semantics that interact badly with how PathLocks
    manages file descriptors. Moving the lock acquisition into the
    coroutine seemed like the smallest principled fix.

Test plan

  • nix build on the patched tree builds clean. The full test suite
    derivations (nix-store-tests-run, nix-expr-tests-run,
    nix-flake-tests-run, nix-fetchers-tests-run,
    nix-functional-tests-...) all build green.
  • The reproducer from deadblock in libstore #15846 (nix build -j3 'github:RazYang/nix_deadlock') no longer hangs. Three fetchers whose
    declared and actual hashes disagree now serialise at the dynamic-lock
    stage instead of deadlocking the worker thread.
  • Reviewers, I'd especially appreciate eyes on:
    1. The finish() re-timing in phase 1 vs phase 2 (derivation-builder.cc
      in the third loop). I believe the bmCheck guard preserves the prior
      semantics exactly, but a second opinion would be welcome.
    2. Whether the public DerivationBuilder interface change
      (unprepareBuild() return type, new registerOutputs()) is
      acceptable, or whether you'd prefer a non-breaking internal
      refactor with a wrapper. Only ChrootLinuxDerivationBuilder
      overrides this in-tree.

Known follow-up

A pre-existing benign error (ignored): cannot rename: Permission denied
from ChrootDerivationBuilder::cleanupBuild's rescue-rename loop becomes
visible after this fix (it was previously masked by the deadlock — the
build never reached the post-build cleanup path). The rescue path uses an
initialOutputs snapshot of validity rather than querying the store, so
on a successful build it still tries to rename the newly-built outputs
over their (now-registered) destinations. The build result is unaffected
and the error is logged with the (ignored) prefix. Happy to address it
as a follow-up commit on this PR or as a separate PR — let me know which
you'd prefer.


Disclosure

Parts of this change were drafted with the assistance of an LLM (Anthropic
Claude, via Claude Code). The LLM helped with: navigating the existing
registerOutputs body to find a clean split point, drafting the
two-phase API, and explaining the flock(2) per-OFD semantics that
caused the original deadlock.

All architectural decisions, the choice between candidate fixes,
behavioural-preservation reasoning, build/test verification, and the
final code were reviewed and verified by me. The reproducer is mine; the
backtrace and lsof evidence in the linked issue are real and were
captured before this work began. I am responsible for the correctness of
the patch.

…eadlock

Fixes NixOS#15846.

`PathLocks::lockPaths` uses `flock(2)`, whose locks are associated with
the open file description rather than the process. When a goal running
`registerOutputs` tries to flock the actual output path while a sibling
goal in the same daemon already holds that path's lock via a different
open file description, the flock blocks even though both file
descriptors live in the same process. Because `registerOutputs` runs
synchronously from the build goal coroutine, that flock freezes the
daemon's worker thread; the sibling goal can no longer make progress
(its builder's stdout pipe fills and the build blocks on write), and
the daemon deadlocks. The reporter's `lsof` shows the same
`-source.lock` file open on two file descriptors of one process.

Split `DerivationBuilder` output registration into two phases:

  * `unprepareBuild()` tears down the build environment, validates the
    builder exit status, canonicalises the scratch outputs and computes
    their content-addressed paths. It returns the set of output store
    paths that the per-derivation pre-build locks do not already cover
    (floating CA outputs, and fixed-output derivations whose actual
    hash differs from the declared one). It no longer moves outputs.

  * `registerOutputs()` moves the prepared outputs into the store,
    registers them as valid, and applies output checks. The caller is
    responsible for holding `PathLocks` on the paths returned by
    `unprepareBuild()`.

`DerivationBuildingGoal::buildLocally` acquires the dynamic output
locks between the two calls using the same try-lock + `co_await
waitForAWhile()` retry pattern already used for the pre-build locks in
`acquireResources`. The worker thread yields the coroutine instead of
blocking on flock, so any sibling goal holding the lock can progress
and release it.
@RazYang RazYang requested a review from Ericson2314 as a code owner May 15, 2026 03:26
@RazYang
Copy link
Copy Markdown
Author

RazYang commented May 15, 2026

@xokdvium — would appreciate your eyes on this if you have a moment, given the libstore/coroutine angle. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deadblock in libstore

1 participant