在前文《多IDC的數(shù)據(jù)分布設計(一)》中介紹了多IDC數(shù)據(jù)一致性的幾種實現(xiàn)原理,遺憾的是,目前雖然有不少分布式產品,但幾乎都沒有開源的產品專門針對IDC來優(yōu)化。本文從實踐的角度分析各種方法優(yōu)缺點。
背景資料 Latency差異
Jeff Dean提到不同數(shù)據(jù)訪問方式latency差異
Numbers Everyone Should Know
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
這個數(shù)據(jù)對于我們設計多IDC數(shù)據(jù)訪問策略具有關鍵的指導作用,我們可以用這個數(shù)據(jù)來衡量數(shù)據(jù)架構來如何設計才能滿足高并發(fā)低延遲的目標。
這份數(shù)據(jù)實際上對所有網絡應用及分布式應用開發(fā)者都具有很大借鑒作用,數(shù)據(jù)需要根據(jù)訪問頻率盡量放在latency小的地方。
1. 2PC/3PC/Paxos模式
在上文中提到,2PC/3PC相比Paxos有明顯的缺點,因此最好不用于生產環(huán)境,這里就不再詳述。
Paxos選擇了CAP理論中的”Consistency, Partition”, 需要犧牲availability。它可以在多個IDC之間實現(xiàn)強一致性復制。
Paxos缺點
- IDC之間需要高速穩(wěn)定網絡
- 一個2f+1個節(jié)點的網絡中,需要f+1個節(jié)點完成事務才能成功。
- Throughput低,不適合高請求量的場合。所以大部分分布式存儲產品并不直接使用Paxos算法來同步數(shù)據(jù)。
2. Dynamo模式
Dynamo論文中并未專門描述Dynamo算法是否適合多IDC場景,只有少量文字提到
In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers. These datacenters are connected through high speed network links. This scheme of replicating across multiple datacenters allows us to handle entire data center failures without a data outage.
從上文看到,前提條件是“high speed network links” 可能對國內的情況不太適用。假如IDC之間網絡不穩(wěn)定,那會發(fā)生哪些情況呢?
Quorum 算法中,如果要考慮高可用性,則數(shù)據(jù)需要分布在多個機房。雙機房如NRW=322由于單機房故障后可能會發(fā)生3個點中2個點都在故障機房,導致出現(xiàn)數(shù)據(jù)不 可用的情況,所以合適的部署是NRW=533,需要3個機房。大部分請求需要2個機房節(jié)點返回才能成功,考慮到多IDC的帶寬及l(fā)atency,性能自然會很差。
Quorum算法在讀寫的時候都要從quorum中選取一個coordinator,算法如下
A node handling a read or write operation is known as the
coordinator. Typically, this is the first among the top N nodes in
the preference list. If the requests are received through a load
balancer, requests to access a key may be routed to any random
node in the ring. In this scenario, the node that receives the
request will not coordinate it if the node is not in the top N of the
requested key’s preference list. Instead, that node will forward the
request to the first among the top N nodes in the preference list.
如果嚴格按照Dynamo協(xié)議,coodinator一定要在N中第一個節(jié)點,那在3個機房中將有2/3的請求需要forward到異地機房的 coordinator執(zhí)行,導致latency增大。如果對coodinator選擇做優(yōu)化,讓client選取preference list中前N個節(jié)點中在本地機房的一個節(jié)點作為coordinator,這樣會一定程度降低latency,但是會存在相同的key選擇不同節(jié)點作為 coordinator的概率增大,導致數(shù)據(jù)conflict的概率增大。
同時在多機房模式下,F(xiàn)ailure detection容易產生混亂。Dynamo并沒有使用一致性的failure view來判斷節(jié)點失效。而是由每個節(jié)點獨自判斷。
Failure detection in Dynamo is used to avoid attempts to
communicate with unreachable peers during get() and put()
operations and when transferring partitions and hinted replicas.
For the purpose of avoiding failed attempts at communication, a
purely local notion of failure detection is entirely sufficient: node
A may consider node B failed if node B does not respond to node
A’s messages (even if B is responsive to node C’s messages).
而最近非常流行的Cassandra基本上可以看作是開源的Dynamo clone, 它在Facebook Inbox Search項目中部署在150臺節(jié)點上,并且分布在美國東西海岸的數(shù)據(jù)中心。
The system(Facebook Inbox Search) currently stores about 50+TB of data on a 150 node cluster, which is spread out between east and west coast data centers.
雖然在它的JIRA中有一個提案 CASSANDRA-492 是講”Data Center Quorum”,但是整體看來Cassandra并沒有特別的針對對IDC的優(yōu)化,它的paper[5]中提到
Data center failures happen due to power outages, cooling
failures, network failures, and natural disasters. Cassandra
is configured such that each row is replicated across multiple
data centers. In essence, the preference list of a key is con-
structed such that the storage nodes are spread across mul-
tiple datacenters. These datacenters are connected through
high speed network links. This scheme of replicating across
multiple datacenters allows us to handle entire data center
failures without any outage.
跟Dynamo中的描述幾乎是相同的。
3. PNUTS模式
PNUTS模式是目前最看好的多IDC數(shù)據(jù)同步方式。它的算法大部分是為多IDC設計。
PNUTS主要為Web應用設計,而不是離線數(shù)據(jù)分析(相比于Hadoop/HBase)。
- Yahoo!的數(shù)據(jù)基本都是用戶相關數(shù)據(jù),典型的以用戶的username為key的key value數(shù)據(jù)。
- 統(tǒng)計數(shù)據(jù)訪問的特征發(fā)現(xiàn)85%的用戶修改數(shù)據(jù)經常來源自相同的IDC。
根據(jù)以上的數(shù)據(jù)特征,Yahoo!的PNUTS實現(xiàn)算法是
- 記錄級別的master, 每一條記錄選擇一個IDC作為master,所有修改都需要通過master進行。即使同一個表(tablet)中不同的記錄master不同。
- master上的數(shù)據(jù)通過Yahoo! Message Broker(YMB)異步消息將數(shù)據(jù)復制到其他IDC。
- master選擇具有靈活的策略,可以根據(jù)最新修改的來源動態(tài)變更master IDC, 比如一個IDC收到用戶修改請求,但是master不在本地需要轉發(fā)到遠程master修改,當遠程修改超過3次則將本地的IDC設成master。
- 每條記錄每次修改都有一個版本號(per-record timeline consisitency),master及YMB可以保證復制時候的順序。
Yahoo!的PNUTS實際可理解為master-master模式。
一致性:由于記錄都需通過master修改,master再復制到其他IDC, 因此可達到所有IDC數(shù)據(jù)具有最終一致性。
可用性:
- 由于所有IDC都有每條記錄的本地數(shù)據(jù),應用可以根據(jù)策略返回本地cache或最新版本。
- 本地修改只要commit到YMB即可認為修改成功。
- 任一IDC發(fā)生故障不影響訪問。
論文中提到的其他優(yōu)點
hosted, notifications, flexible schemas, ordered records, secondary indexes, lowish latency, strong consistency on a single record, scalability, high write rates, reliability, and range queries over a small set of records.
總之,PNUTS可以很好的適合geographic replication模式。
- 記錄publish到本地YMB則認為成功,免除Dynamo方式需要等待多個Data Center返回的latency。
- 如果發(fā)生master在異地則需要將請求forward到異地,但是由于存在master轉移的策略,需要forward的情況比較少。
極端情況,當record的master不可用時候,實現(xiàn)上似乎有些可疑之處,讀者可自行思考。
Under normal operation, if the master copy of a record fails, our system has protocols to fail over to another replica. However, if there are major outages, e.g. the entire region that had the master copy for a record becomes unreachable, updates cannot continue at another replica without potentially violating record-timeline consistency. We will allow applications to indicate, per-table, whether they want updates to continue in the presence of major outages, potentially branching the record timeline. If so, we will provide automatic conflict resolution and notifications thereof. The application will also be able to choose from several conflict resolution policies: e.g., discarding one branch, or merging updates from branches, etc.
初步結論
低帶寬網絡
PNUTS record-level mastering模式最佳。
高帶寬低延遲網絡
(1Gbps, Latency < 50ms)
1. 用Dynamo Quorum, vector clock算法實現(xiàn)最終一致性
2. 用Paxos實現(xiàn)強一致性
后記
本文從開始準備到發(fā)布時間較長,由于在多IDC數(shù)據(jù)訪問方面目前業(yè)界并無統(tǒng)一的成熟方案,相關資料和文獻也相對較少,而且對這方面有興趣且有相應環(huán)境的人不多,短時間要提出自己成熟獨立的見解也具有一定難度,本文僅包含一些不成熟的想法的整理,由于自己對文中的觀點深度也不是滿意,所以一直沒有最終完稿發(fā)布。但考慮到最近工作較忙,暫時沒有精力繼續(xù)深入研究,所以希望公開文章拋磚引玉,同時也歡迎對這方面課題有興趣者進一步交流探討。
Resource
- Ryan Barrett, Transactions Across Datacenters
- Jeff Dean, Designs, Lessons and Advice from Building Large Distributed Systems (PDF)
- PNUTS: Yahoo!’s Hosted Data Serving Platform (PDF)
- Thoughts on Yahoo’s PNUTS distributed database
- Cassandra – A Decentralized Structured Storage System (PDF)
- Yahoo!的分布式數(shù)據(jù)平臺PNUTS簡介及感悟
億恩-天使(QQ:530997) 電話 037160135991 服務器租用,托管歡迎咨詢。 本文出自:億恩科技【1tcdy.com】
服務器租用/服務器托管中國五強!虛擬主機域名注冊頂級提供商!15年品質保障!--億恩科技[ENKJ.COM]
|