Applications | Github Designing Data-intensive

The Apache Spark and Apache Flink repositories provide the blueprint for processing petabytes of "data at rest."

The "GitHub way" of learning data models involves looking at how major open-source projects implement them. github designing data-intensive applications

Second, and more radically, GitHub implemented (horizontal partitioning) using a custom middleware layer called gh-ost (GitHub Online Schema Transfers) and later, their Vitess-inspired system. They split the massive issues and pull_requests tables by repository ID. This meant that data for a single repository always lived on one shard. This is a thoughtful choice: most queries (e.g., “list all issues in this repo”) are naturally local to a shard, avoiding costly distributed joins. The downside, as Kleppmann warns, is the loss of cross-shard transactional guarantees. For example, moving an issue from one repository to another becomes a complex distributed transaction, something GitHub handles with asynchronous workflows and idempotent retries. The Apache Spark and Apache Flink repositories provide