Thoughts on "Dependency hell is NP-complete"

People have been talking about a post about package management dependency resolution by Russ Cox that opens "Dependency hell is NP-complete. But maybe we can climb out."

It does a pretty good formal analysis of the problem of traditional dependency resolution (and its equivalence to the well-known SAT problem). It also has some nice proofs of the logical equivalence.

The interesting part of the post comes at the end, when Russ talks about ways to make the problem more tractable.

First, he describes the proof of NP-completeness as coming from 4 assumptions:

  1. "A package can list zero or more packages or specific package versions as dependencies."
  2. "To install a package, all its dependencies must be installed."
  3. "Each version of a package can have different dependencies."
  4. "Two different versions of a package cannot be installed simultaneously."

I think these assumptions more-or-less match the assumptions of system package managers like apt, and application package managers that predate npm.

He says:

The conventional wisdom, as I suggested above, is that these are roughly the “the bare minimum for a package manager to be useful,” but maybe we can find a way to reduce them after all.

Russ did a great job of making some of the key constraints of package management more precise in throughout the article, but at this point I start to quibble with his description of the state of the art.

One way to avoid NP-completeness is to attack assumption 1: what if, instead of allowing a dependency to list specific package versions, a dependency can only specify a minimum version? ... As the examples already hint at, if packages follow semantic versioning, a package manager might automatically use the newest version of a dependency within a major version but then treat different major versions as different packages.

This is approximately the solution that we use in Cargo. I talked about it in the post announcing the initial alpha release of Cargo in June 2014:

By default, we will encourage package authors to comply with semantic versioning and not introduce breaking changes in minor versions by using the single highest available minor version for each depended-on major version of a package from the Cargo registry.

We also talked about it on the Rust blog post in Cargo: Predictable Depenency Management.

Russ also suggests attacking assumption 4:

Another way to avoid NP-completeness is to attack assumption 4: what if two different versions of a package could be installed simultaneously? Then almost any search algorithm will find a combination of packages to build the program; it just might not be the smallest possible combination (that’s still NP-complete).

That, of course, is the strategy that npm used. In version 2, npm simply duplicated any package it found. In version 3, npm attempts some amount of de-duplication, but the ecosystem relies heavily on the duplication ability of npm (Yarn, an alternative Node package manager I collaborated on, supports the same kind of duplication).

Cargo chose to attempt the hybrid model that Russ suggests (allowing duplication, but using semver to restrict its effect), and I agree with the rationale he gives for this hybrid approach:

One rationale for such restrictions is that developers are likely not thinking about the entire space of all possible package combinations when building or testing software. It would help for the developers and their tools to agree about how software is built.

We chose not to go with the simple duplication model in Cargo because Rust's compiler will not allow multiple versions of the same package to have a single unified type (for obvious reasons: struct { x: u32 } is not the same type as struct { x: u32, y: 32 }.

The same problem exists in Node, but the dynamic and structural nature of JavaScript allows programs to cope better, while making failure cases even more head-scratching.

If any of these approaches can be made to work in practice, it could go a long way toward simplifying the operation and understandability of language package managers.

I think it's safe to say that these approaches work in practice: npm has a massive package ecosystem and Cargo has 7,000 crates (same ballpark as Haskell, R, only about 15x smaller than Python) with almost 100 million downloads, which certainly counts as practice.

(as an aside, I sketched out an algorithm that should minimize duplicates in Yarn, while maintaining support for duplicates when needed)

While we have good evidence that the approaches do work in practice, there are still some open issues about how to make them even more frictionless for package ecosystems.

I am personally very interested in the idea of allowing packages to mark their dependencies as either "shared" or "internal". We could then disallow shared dependencies from duplicating, but allow internal dependencies to duplicate freely.

In practice, the strongest argument for allowing duplication in the first place is the fact that so many packages are used as "internal helpers," and have no need to insist on only a single global copy. However, the smaller group of shared dependencies really do need a single shared copy.

I gave a concrete example of the problem of "shared dependencies" in the Cargo post, and have some slides on the problem in Zen and the Art of Package Management.

The package.json format in npm already has a way of marking shared dependencies by declaring them as peerDependencies, but it's not very well understood by the Node ecosystem. That said, it could likely be retrofitted.

We've also been talking about making it possible to mark dependencies as "internal" in Cargo, and enforce that internal-ness using type information known by the compiler.

I want to thank Sam Boyer, a member of the Go community who Russ cites in his article, for helping to cross-pollinate ideas about package management across ecosystems.