Sharding Pinterest: How we scaled our MySQL fleet - pinterest enginnering

8台の物理サーバに1つずつMySQL インスタンス。それぞれのインスタンスは master - master replicated onto a backup host in case the primary fails

use master, not slave...

Our production servers only read/write to the master. I recommend you do the same. It simplifies everything and avoids lagged replication bugs.

各MySQL インスタンスは512個位ずつdatabaseを持つ。

shard

We made a design decision that once a piece of data lands in a shard, it never moves outside that shard. However, you can get more capacity by moving shards to other machines (we’ll discuss this later).

[{“range”:     (0,511), “master”: “MySQL001A”, “slave”: “MySQL001B”},
 {“range”: (512, 1023), “master”: “MySQL002A”, “slave”: “MySQL002B”},
    ...
 {“range”: (3584, 4095), “master”: “MySQL008A”, “slave”: “MySQL008B”}]

Each shard contains the same set of tables: pins, boards, users_has_pins, users_likes_pins, pin_liked_by_user, etc. I’ll expand on that in a moment.

how do we distribute our data to these shards

We created a 64 bit ID that contains the shard ID, the type of the containing data, and where this data is in the table (local ID). The shard ID is 16 bits, type ID is 10 bits and local ID is 36 bits.

question

例えばuserID % shardの値によってshardを決定している場合、shard数を4096 --> 8192に増加させた場合計算結果が変わるよね？つまり、データが格納されるべきshardが変わってしまう。つまり、どうやってpinterestではどのようにshardを決定しているのか。

→　最初に決定したら固定。IDの中に「どのシャードに保存されているか」という情報も含んでいるためImmutableにできる。これ真似しようと思ったら、UserID（というかshard key）に必ずシャード番号を含ませるとか？

固定の場合、何も考えずにmod計算などでINSERTしていくと古いシャードほどデータ量が多くなるよね？現在のシャード負荷状況を考慮して格納シャードを決定しないかぎり

→　新しくシャードを追加した後、新たにINSERTするレコードは新しいシャードの中のみで分配するようなロジックをアプリケーション側で頑張って実装しているのかな。。。？

→　完全ランダムっぽい？ ( > New users are randomly distributed across shards. )

The Mod Shard

Scaling Pinterest - From 0 To 10s Of Billions Of Page Views A Month In Two Years

ID Structure

64 bits: shard ID: 16 bits type : 10 bits - Pin, Board, User, or any other object type local ID - rest of the bits for the ID within the table. Uses MySQL auto increment.

Enough shards IDs for 65536 shards, but they only opened 4096 at first, they’ll expand horizontally. When the user database gets full they’ll open up more shards and allow new users to go to the new shards.

New users are randomly distributed across shards.

ということは、別に古いシャードに割り振られても気にしないということか。単純にシャードが一杯一杯になってきたら、シャードごと別のDBサーバに移してしまう。みたいな。

All data (pins, boards, etc) for a user is collocated on the same shard. Huge advantage. Rendering a user profile, for example, does not take multiple cross shard queries. It’s fast

Objects And Mappings

Queries are primary key or index lookups (no joins).

Data doesn’t move across database as it does with clustering. Once a user lands on shard 20, for example, and all the user data is collocated, it will never move. The 64 bit ID has contains the shard ID so it can’t be moved. You can move the physical data to another database, but it’s still associated with the same shard.

つまり、一旦データが配置されたらシャード番号はそれ以降決して変更されない。物理的に別のDBサーバに移動されることはあっても。