Design Requirements

🎯 The Core Functions of a Large System

No matter how complex a large-scale system is, its functionality can be broken down into three fundamental operations:

Moving Data: This involves the transfer of data between different machines—either within the same data center or across the globe. This is more complex than moving data within a single machine (e.g., RAM to CPU) due to network latencies and reliability concerns.
Storing Data: This is about persisting data efficiently. Just as data structures (like arrays vs. binary search trees) have different trade-offs, so do large-scale storage solutions like databases, blob stores, and distributed file systems. The choice depends on the specific needs of the application.
Transforming Data: This is the manipulation or processing of data to produce useful results. Examples include aggregating server logs to calculate success rates or encoding videos on a platform like YouTube. Application code typically handles this part.

✅ Measuring System Design Quality

Since system design is about trade-offs, we use key metrics to evaluate and compare design choices.

Availability (Measured in Nines)

Availability is the percentage of time a service is operational and responsive to users. It is mathematically defined as: $$\text{Availability} = \frac{\text{Total Uptime}}{\text{Total Uptime} + \text{Total Downtime}}$$

Nines	Availability	Downtime per Year (Approx.)
2 Nines	99%	3.65 days
3 Nines	99.9%	8.76 hours
4 Nines	99.99%	52.6 minutes
5 Nines	99.999%	5.26 minutes

SLO (Service Level Objective): A goal set by developers for a system’s required availability (e.g., five nines of availability).
SLA (Service Level Agreement): A contractual agreement with a customer that typically includes an SLO plus a defined consequence (like a partial refund) if the objective is not met.

🛡️ Reliability, Fault Tolerance, and Redundancy

These concepts are closely related and are essential for maintaining system stability.

Reliability: The probability that a system will not fail over a given period. It’s increased by having more robust and fault-tolerant components.
Fault Tolerance: The ability of a system to continue operating successfully even when one or more of its components fail (e.g., one server crashes, but another takes over).
Redundancy: Having duplicate components (e.g., two servers running the same code) so that if one fails, the system has a backup. Redundancy enables fault tolerance, which in turn increases reliability.
- This is crucial for preventing a Single Point of Failure (SPOF), where the failure of one component brings down the entire system.
- Distributing redundant components across different geographical locations (data centers) is key to protecting against region-wide issues like natural disasters.

Reliability (可靠性)
    ├── Objective: The system operates without failure within a specified time period.
    │
    ├── Key Implementation Strategies
    │   ├── Fault-Tolerance (容错)
    │   │   └── Dependencies → Redundancy (冗余) + Detection/Switching/Isolation
    │   └── Other Means (Monitoring, Testing, Operations/Maintenance, Disaster Recovery, etc.)
    │
    └── Measurement Metrics: Availability, Error Rate, MTBF/MTTR, SLO Achievement Rate

🚀 Performance Metrics: Throughput and Latency

These two metrics define the speed and capacity of a system.

Throughput

Throughput is the amount of data or number of operations a system can handle over a specific period of time.

Requests Per Second (RPS): Common for web servers, measuring how many user requests can be processed each second.
Queries Per Second (QPS): Used specifically for databases, measuring how many read/write operations (queries) can be handled per second.
Data Per Second (Bytes/Second): Used for data pipelines or systems where the amount of data being processed is the most relevant metric (e.g., gigabytes per second).

Latency

Latency is the time delay for an operation to complete.

It’s typically measured from the user’s perspective (end-to-end time from request to response).
High latency can be caused by network distance (user is far from the server) or slow processing by the server.
Techniques like deploying servers in multiple regions and using Content Delivery Networks (CDNs) help reduce latency for globally distributed users.

⚖️ Scaling Strategies: Vertical vs. Horizontal

When a system needs to handle more load (increase throughput), two main scaling strategies come with different trade-offs:

Strategy	Description	Pros	Cons
Vertical Scaling	Increasing the resources (CPU, RAM, Disk) of a single machine (“getting a bigger computer”).	Simpler to implement and manage.	Finite limits (you can’t infinitely scale a single server); creates a Single Point of Failure (SPOF).
Horizontal Scaling	Adding more machines (servers or databases) to distribute the load (“getting more computers”).	Near-unlimited capacity for scaling; provides redundancy and fault tolerance.	Significantly increases complexity (e.g., managing load balancing, synchronizing distributed data across multiple databases).