Last updated on

Consistent Hashing

💡 Basic Hashing for Load Balancing

Basic hashing (like using the modulus operator) is a simple method for load balancing, offering an advantage over simple Round Robin in specific scenarios.

The Mechanism

  1. Take a user identifier (e.g., IP address).
  2. Hash it, typically by using the modulus function (e.g., $IP \text{ address} \pmod{N}$, where $N$ is the number of servers).
  3. The remainder determines which server a request is routed to (e.g., if $N=3$, the remainder is $0, 1,$ or $2$, mapping to Server 0, 1, or 2).

The Key Benefit: Consistency

The primary benefit of this approach is request consistency: a user with the same IP address will always be routed to the same server.

The Caching Scenario

This consistency is critical when servers are not stateless and maintain a local, individual cache (like a small Redis cache).

  • If a user is consistently routed to Server A, Server A can effectively cache the user’s data.
  • In Round Robin, the user might hit Server A, B, and C sequentially. Each server would have to re-fetch or re-process the data because the cached information is on a different server, leading to poor cache utilization.

📉 The Flaw in Basic Hashing (The Reshuffling Problem)

The basic modulo hashing approach fails disastrously when the number of servers ($N$) changes (i.e., a server is added or removed).

The Problem

When a server is removed (e.g., $N$ goes from 3 to 2), the divisor in the modulus function changes.

  • Original: $IP \pmod{3}$
  • New: $IP \pmod{2}$

Because the divisor changes, the resulting remainder changes for nearly all existing user IP addresses. This causes a massive remapping or reshuffling of users to different servers.

The Resulting Inefficiency

  • Cache Invalidation: Users are now routed to a server they have never used before. All the data cached for them on their previous server is useless. The new server must re-process the work, leading to a massive, temporary drop in performance and cache hit ratio.
  • Goal Failure: The goal of maintaining consistency to leverage caching is completely undermined.

⭕ Consistent Hashing: The Solution

Consistent Hashing is designed to minimize the number of keys (users) that need to be remapped when a server (node) is added or removed.

The Mechanism (High-Level)

  1. The Ring: Instead of direct modulus calculation, both the servers (nodes) and the user requests (keys) are mapped onto a conceptual circular space or hash ring (e.g., from $0^\circ$ to $360^\circ$).
  2. Server Placement: Servers are placed intelligently onto this ring using an industry-grade hash function (not simple modulo $N$).
  3. Key Mapping: A user’s key (e.g., IP address) is also mapped onto a position on this ring using the same hash function.
  4. Assignment Rule: A key is assigned to the first server/node found by moving clockwise from the key’s position on the ring.

The Benefit of Node Changes

When a server is removed (e.g., Server 2):

  • Stable Keys: Users originally mapped to Servers 0 and 1 remain mapped to Servers 0 and 1 because their clockwise assignment rule has not changed. The local caches on Server 0 and 1 are preserved.
  • Minimal Reshuffling: Only the keys (users) that were previously mapped to the removed server are affected. They are simply re-assigned to the next server clockwise (in this case, Server 0).
  • Minimal Remapping: The total number of affected keys is limited to the portion of the ring managed by the removed server (ideally $\approx 1/N$).

Load Distribution and Optimization

  • Initial Imbalance: A simple removal might cause the adjacent clockwise server to handle a disproportionate amount of traffic (e.g., 2/3 of the load).
  • Virtual Nodes (V-Nodes): To solve this, servers often use “Virtual Nodes”, where each physical server is represented by multiple points on the ring. This helps distribute the keys more uniformly and ensures that when a server is removed, its load is split relatively evenly among all the remaining nodes.

🌐 Applications of Consistent Hashing

While complex, Consistent Hashing is a powerful tool used when maintaining consistency is crucial, even amidst changing infrastructure.

ComponentWhy Consistent Hashing is UsedKey Hashing/Partitioning By
Load BalancingTo preserve per-server caching and session data for individual users.User IP Address or Session ID
CDNs (Content Delivery Networks)To ensure a user’s request for a piece of content (a file) consistently hits the same nearby edge server to maximize local caching.Content ID or User Geolocation
Distributed Databases (Sharding)To determine which physical database node (shard) holds a specific user’s data. If the number of shards changes, we want to minimize the data movement across the entire cluster.User ID or Primary Key

🔑 Summary of Components

The key elements defining a Consistent Hashing system are:

  1. Hash Key: The unique identifier being mapped (e.g., IP address, User ID, Content ID).
  2. Hash Function: An industry-grade algorithm (like SHA-1) used to map both the keys and the nodes onto the ring.
  3. Nodes: The servers, database shards, or CDN caches that are receiving the requests. Their number is expected to change over time.

这是一个很好的问题。虽然在结果上相似,但在表达和侧重点上,它们代表了对Consistent Hashing (一致性哈希)优势的两种不同描述方式,各有侧重。


🔍 概念区分

描述方式侧重和解释
Minimal Reshuffling (最小重新洗牌)侧重于用户(Key)的行为。它描述的是,当节点(服务器)变化时,只有原本分配给被移除节点的那些用户(Key)需要被重新分配。其他用户保持不变,继续访问它们原来的服务器。
Minimal Remapping (最小重新映射)侧重于变化的量化。它提供了一个数量级的估计,即变化的部分占总体的比例。它量化了受影响的 Keys,表明受影响的 Keys 总量大约为 $1/N$($N$ 为节点总数)。

🎯 为什么需要两种说法?

  1. 最小重新洗牌 (Minimal Reshuffling):

    • 这个描述强调了一致性哈希的机制是如何工作的。它告诉你哪个群体是受影响的:只有从被移除节点顺时针方向的第一个节点接管了这部分流量。
    • 这是一个定性的描述,强调了“只有”这一群 Key 受影响,是理解一致性哈希工作原理的基础。
  2. 最小重新映射 (Minimal Remapping):

    • 这个描述强调了一致性哈希的性能优势。它告诉你受影响的规模
    • 这是一个定量的描述。在理想的均匀分布情况下,如果移除了一个节点,受影响的 Key 数量占总 Key 数量的比例约为 $1/N$。

总结

它们描述的是同一件事的不同侧面

  • Minimal Reshuffling 告诉你:需要移动。
  • Minimal Remapping 告诉你:多少需要移动($1/N$)。

用一个类比来说明:

  • Minimal Reshuffling 就像说:“换教室时,只有三年二班的学生需要移动。”
  • Minimal Remapping 就像说:“这次换教室,总共只有**5%**的学生需要移动。”

在系统设计面试中,通常会用 “只有 $K/N$ 个 Key 会被重新映射” (即 $1/N$ 的量级) 来概括一致性哈希的核心优势