Asymmetric Distributed Trust

Download the Audio (Right-click, Save-As)

If you've worked on a distributed system before, you might have encountered the concept of Byzantine fault tolerance (BFT). It's the ability of a decentralized network to function and continue to reach consensus even after some nodes fail. It also works if those nodes are acting maliciously, or for some other reason can't be trusted at the moment. It involves something called Byzantine quorum systems, where you need agreement from a certain number of nodes (but not all the nodes) before you trust a decision.

Traditionally, these kinds of systems assume that everyone trusts the same set of nodes. If I trust nodes A, B, and C, then you also trust nodes A, B, and C. In other words: we all have the same global view of who's trustworthy and who isn't. In the real world, though, trust isn't always symmetric like that. You might trust your bank's servers but not mine. I might trust mine but not yours. I might trust entities that you don't, and vice versa. Picture any federated system where different organizations need to collaborate but don't necessarily trust each other equally. I'd wager that's probably more common than the alternative.

So how do we reconcile these two ideas together? A byzantine quorum but with asymmetric trust? That's where today's paper comes in. In it, the authors introduce something called...you guessed it...asymmetric Byzantine quorum. And it's exactly what it sounds like: a way for different processes in a distributed system to have their own individual trust assumptions. Each process gets to decide for itself which other processes it trusts, rather than everyone following the same global trust model. On today's episode, we'll walk through how their system works. We'll start with the fundamentals of trust and consistency, then examine the specific protocols the authors developed.

Let's start with the basics. A quorum system defines which subsets of nodes you need to hear from before you can trust a decision. The foundation is something called a fail-prone system. This captures all the maximal subsets of processes that might jointly fail during an execution. If you're using the threshold model, you might say "any subset of processes can fail, as long as fewer than one-third of the total are faulty." This fail-prone system is then used to construct a Byzantine quorum system that satisfies two key properties: consistency and availability.

Consistency requires that the intersection of any two quorums contains at least one process that is guaranteed to be correct. This prevents conflicting decisions because there's always at least one honest process that participated in both quorum decisions. Think of it like having overlapping committees where at least one honest person sits on multiple committees and can prevent contradictory decisions.
Availability ensures that for any expected failure pattern, there exists at least one quorum that is completely unaffected by those failures. This means that even when failures occur according to your assumptions, you can still find a quorum of entirely correct processes to make progress. It's like having backup committees that can still function even when your primary committee members are compromised.

Here's the thing: this trust assumption is traditionally global. Every process in the system operates under the same assumption about which combinations of failures are possible. The system relies on a mathematical condition that ensures no three potential failure sets can together affect the entire system. This condition is both necessary and sufficient for the existence of a Byzantine system. So that brings us back to this paper. Asymmetric Byzantine systems solve this by letting each process define its own trust assumptions. Instead of one global fail-prone system, you have an array of fail-prone systems, where each process gets its own list of which combinations of other processes it thinks might fail together. Each process gets to define its own quorums based on its own trust assumptions. Process A might form quorums that include banks and government agencies but exclude startups. Process B might form quorums that include other startups and some banks but exclude government agencies.

The tricky part is maintaining safety properties across these different trust models. The consistency condition becomes a bit more nuanced. For any two processes, if you take a quorum from each of them, their intersection must contain at least one process that at least one of them trusts not to be faulty. This is more complex than the symmetric case because you're reasoning about the intersection of different trust models, but it ensures that conflicting decisions can't be made even when participants have different trust assumptions. The availability condition however remains local to each process. That is: every process must be able to form quorums that avoid the failure patterns it considers possible. This ensures that each process can make progress according to its own risk assessment.

A key concept in this paper is the distinction between "nave" and "wise" processes. In any execution where some processes actually fail, the remaining correct processes fall into categories based on whether their trust assumptions anticipated the failure.

Wise processes correctly anticipated the actual failure pattern. Their trust assumptions were broad enough to include the processes that actually failed. These are the cautious participants who assumed a worst-case scenario that turned out to be accurate or at least encompassing enough.
Nave processes didn't anticipate it. They trusted some processes that turned out to be faulty. These are the optimistic participants who assumed fewer failures than actually occurred, or who trusted the wrong specific processes.

This creates an asymmetry in guarantees. The protocols described in the paper typically provide strong guarantees for wise processes only, and weaker guarantees for nave ones. It's a bit like real life: if you trusted the wrong people, you don't get the same level of protection as those who were more cautious. This isn't a bug in the system, it's a feature that reflects the reality that better risk assessment should lead to better outcomes. For protocols to work effectively, you often need something called a "guild," which is a set of wise processes that satisfies two properties.

First, every process in the guild must be wise, meaning they all correctly anticipated the actual failure pattern.
Second, the guild must be self-sufficient in the sense that it contains a quorum for each of its members.

Think of it as a coalition of well-connected, cautious processes that can make progress even when other parts of the system are compromised or nave. The guild concept is crucial because it ensures liveness. Without a guild, you might have wise processes scattered throughout the system who can't coordinate effectively because they don't share enough trusted nodes. And importantly: all guilds overlap. If you have two different guilds, there must be at least one process that belongs to both. This follows from the consistency property of asymmetric quorum systems and ensures that guilds can't make conflicting decisions. It also means there's a unique maximal guild in any execution, representing the largest coalition of wise, well-connected processes.

This paper also introduces the concept of a "tolerated" system. A failure pattern is tolerated if there still exists a non-empty guild that can function despite those failures. The collection of all tolerated failure patterns forms the tolerated system. Remarkably, for the canonical construction of asymmetric quorum systems, this tolerated system satisfies the same mathematical properties as a symmetric Byzantine quorum system, providing a bridge between asymmetric and symmetric trust models when needed.

The authors do adapt these ideas to real life, and walk through a proof-of-concept, but the core of this study is theoretical. It's about proposing an idea that releases all of us from the constraints of symmetric trust. That being said, the practical implications for engineers are still significant. When designing systems that span organizational boundaries, we should consider whether a single global trust model makes sense or whether participants should be allowed to express their own trust preferences. If so, asymmetric trust systems might be a viable way forward. That being said, they are quite a bit more complex to reason about and implement. They require careful consideration of which processes get which guarantees under which conditions. You need to identify which participants are likely to form effective guilds and ensure your system can make progress through them. This is a shift from "one size fits all" trust models to more nuanced approaches that better match real-world trust relationships, though the added complexity may only be worthwhile for systems where participants have genuinely different risk tolerances and institutional constraints.

If you want to see the authors' formal proofs of their main theorems, or dive into the detailed protocol specifications for the consensus and broadcast algorithms they demo'd, I highly recommend that you download the paper. The authors provide not only the theoretical foundations for their system but comprehensive protocol descriptions that go far beyond what we could cover here.