[Beowulf] Spanner: synchronizing the largest global database

Eugen Leitl Tue, 27 Nov 2012 05:25:25 -0800

http://www.wired.com/wiredenterprise/2012/11/google-spanner-time/all/



(it's about time)

Exclusive: Inside Google Spanner, the Largest Single Database on Earth

By Cade Metz
11.26.12
6:30 AM

Each morning, when Andrew Fikes sat down at his desk inside Google headquarters 
in Mountain View, California, he turned on the “VC” link to New York.

VC is Google shorthand for video conference. Looking up at the screen on his 
desk, Fikes could see Wilson Hsieh sitting inside a Google office in Manhattan, 
and Hsieh could see him. They also ran VC links to a Google office in Kirkland, 
Washington, near Seattle. Their engineering team spanned three offices in three 
different parts of the country, but everyone could still chat and brainstorm 
and troubleshoot without a moment’s delay, and this is how Google built Spanner.

“You walk into our cubes, and we’ve got VC on — all the time,” says Fikes, who 
joined Google in 2001 and now ranks among the company’s distinguished software 
engineers. “We’ve been doing this for years. It lowers all the barriers to 
communication that you typically have.”

‘As a distributed-systems developer, you’re taught from — I want to say 
childhood — not to trust time. What we did is find a way that we could trust 
time — and understand what it meant to trust time.’

— Andrew Fikes

The arrangement is only appropriate. Much like the engineering team that 
created it, Spanner is something that stretches across the globe while behaving 
as if it’s all in one place. Unveiled this fall after years of hints and 
rumors, it’s the first worldwide database worthy of the name — a database 
designed to seamlessly operate across hundreds of data centers and millions of 
machines and trillions of rows of information.

Spanner is a creation so large, some have trouble wrapping their heads around 
it. But the end result is easily explained: With Spanner, Google can offer a 
web service to a worldwide audience, but still ensure that something happening 
on the service in one part of the world doesn’t contradict what’s happening in 
another.

Google’s new-age database is already part of the company’s online ad system — 
the system that makes its millions — and it could signal where the rest of the 
web is going. Google caused a stir when it published a research paper detailing 
Spanner in mid-September, and the buzz was palpable among the hard-core 
computer systems engineers when Wilson Hsieh presented the paper at a 
conference in Hollywood, California, a few weeks later.

“It’s definitely interesting,” says Raghu Murty, one of the chief engineers 
working on the massive software platform that underpins Facebook — though he 
adds that Facebook has yet to explore the possibility of actually building 
something similar.

Google’s web operation is significantly more complex than most, and it’s forced 
to build custom software that’s well beyond the scope of most online outfits. 
But as the web grows, its creations so often trickle down to the rest of the 
world.

Before Spanner was revealed, many didn’t even think it was possible. Yes, we 
had “NoSQL” databases capable of storing information across multiple data 
centers, but they couldn’t do so while keeping that information “consistent” — 
meaning that someone looking at the data on one side of the world sees the same 
thing as someone on the other side. The assumption was that consistency was 
barred by the inherent delays that come when sending information between data 
centers.

But in building a database that was both global and consistent, Google’s 
Spanner engineers did something completely unexpected. They have a history of 
doing the unexpected. The team includes not only Fikes and Hsieh, who oversaw 
the development of BigTable, Google’s seminal NoSQL database, but also 
legendary Googlers Jeff Dean and Sanjay Ghemawat and a long list of other 
engineers who worked on such groundbreaking data-center platforms as Megastore 
and Dremel.

This time around, they found a new way of keeping time.

“As a distributed systems developer, you’re taught from — I want to say 
childhood — not to trust time,” says Fikes. “What we did is find a way that we 
could trust time — and understand what it meant to trust time.”

Time Is of the Essence

On the net, time is of the essence. Yes, in running a massive web service, you 
need things to happen quickly. But you also need a means of accurately keeping 
track of time across the many machines that underpin your service. You have to 
synchronize the many processes running on each server, and you have to 
synchronize the servers themselves, so that they too can work in tandem. And 
that’s easier said than done.

Typically, data-center operators keep their servers in sync using what’s called 
the Network Time Protocol, or NTP. This is essentially an online service that 
connects machines to the official atomic clocks that keep time for 
organizations across the world. But because it takes time to move information 
across a network, this method is never completely accurate, and sometimes, it 
breaks altogether. In July, several big-name web operations experienced 
problems — including Reddit, Gawker, and Mozilla — because their software 
wasn’t prepared to handle a “leap second” that was added to the world’s atomic 
clocks.

‘We wanted something that we were confident in. It’s a time reference that’s 
owned by Google.’

— Andrew Fikes

But with Spanner, Google discarded the NTP in favor of its own time-keeping 
mechanism. It’s called the TrueTime API. “We wanted something that we were 
confident in,” Fikes says. “It’s a time reference that’s owned by Google.”

Rather than rely on outside clocks, Google equips its Spannerized data centers 
with its own atomic clocks and GPS (global positioning system) receivers, not 
unlike the one in your iPhone. Tapping into a network of satellites orbiting 
the Earth, a GPS receiver can pinpoint your location, but it can also tell time.

These time-keeping devices connect to a certain number of master servers, and 
the master servers shuttle time readings to other machines running across the 
Google network. Basically, each machine on the network runs a daemon — a 
background software process — that is constantly checking with masters in the 
same data center and in other Google data centers, trying to reach a consensus 
on what time it is. In this way, machines across the Google network can come 
pretty close to running a common clock.

‘The System Responds — And Not a Human’

How does this bootstrap a worldwide database? Thanks to the TrueTime service, 
Google can keep its many machines in sync — even when they span multiple data 
centers — and this means they can quickly store and retrieve data without 
stepping on each other’s toes.

“We can commit data at two different locations — say the West Coast [of the 
United States] and Europe — and still have some agreed upon ordering between 
them,” Fikes says, “So, if the West Coast write happens first and then the one 
in Europe happens, the whole system knows that — and there’s no possibility of 
them being viewed in a different order.”

‘By using highly accurate clocks and a very clever time API, Spanner allows 
server nodes to coordinate without a whole lot of communication.’

— Andy Gross

According to Andy Gross — the principal architect at Basho, an outfit that 
builds an open source database called Riak that’s designed to run across 
thousands of servers — database designers typically seek to synchronize 
information across machines by having them talk to each other. “You have to a 
do a whole lot of communication to decide the correct order for all the 
transactions,” he told us this fall, when Spanner was first revealed.

The problem is that this communication can bog down the network — and the 
database. As Max Schireson — the president of 10gen, maker of the NoSQL 
database MongoDB — told us: “If you have large numbers of people accessing 
large numbers of systems that are globally distributed so that the delay in 
communications between them is relatively long, it becomes very hard to keep 
everything synchronized. If you increase those factors, it gets even harder.”

So Google took a completely different tack. Rather than struggle to improve 
communication between servers, it gave them a new way to tell time. “That was 
probably the coolest thing about the paper: using atomic clocks and GPS to 
provide a time API,” says Facebook’s Raghu Murty.

In harnessing time, Google can build a database that’s both global and 
consistent, but it can also make its services more resistant in the face of 
network delays, data-center outages, and other software and hardware snafus. 
Basically, Google uses Spanner to accurately replicate its data across multiple 
data centers — and quickly move between replicas as need be. In other words, 
the replicas are consistent too.

When one replica is unavailable, Spanner can rapidly shift to another. But it 
will also move between replicas simply to improve performance. “If you have one 
replica and it gets busy, your latency is going to be high. But if you have 
four other replicas, you can choose to go to a different one, and trim that 
latency,” Fikes says.

One effect, Fikes explains, is that Google spends less money managing its 
system. “When there are outages, things just sort of flip — client machines 
access other servers in the system,” he says. “It’s a much easier service 
story…. The system responds — and not a human.”

Spanning Google’s Footsteps

Some have questioned whether others can follow in Google’s footsteps — and 
whether they would even want to. When we spoke to Andy Gross, he guessed that 
even Google’s atomic clocks and GPS receivers would be prohibitively expensive 
for most operations.

Yes, rebuilding the platform would be a massive undertaking. Google has already 
spent four and half years on the project, and Fikes — who helped build Google’s 
web history tool, its first product search service, and Google Answers, as well 
as BigTable — calls Spanner the most difficult thing he has ever worked on. 
What’s more, there are countless logistical issues that need dealing with.

‘The important thing to think about is that this is a service that is provided 
to the data center. The costs of that are amortized across all the servers in 
your fleet. The cost per server is some incremental amount — and you weigh that 
against the types of things we can do for that.’

— Andrew Fikes

As Fikes points out, Google had to install GPS antennas on the roofs of its 
data centers and connect them to the hardware below. And, yes, you do need two 
separate types of time keepers. Hardware always fails, and your time keepers 
must fail at, well, different times. “The atomic clocks provide stability if 
there is a GPS issue,” he says.

But according to Fikes, these are relatively inexpensive devices. The GPS units 
aren’t as cheap as those in your iPhone, but like Google’s atomic clocks, they 
cost no more than a few thousand dollars apiece. “They’re sort of in the order 
of the cost of an enterprise server,” he says, “and there are a lot of 
different vendors of these devices.” When we discussed the matter with Jeff 
Dean — one of Google primary infrastructure engineers and another name on the 
Spanner paper — he indicated much the same.

Fikes also makes a point of saying that the TrueTime service does not require 
specialized servers. The time keepers are kept in racks onside the servers, and 
again, they need only connect to some machines in the data center.

“You can think of it as only a handful of these devices being in each data 
center. They’re boxes. You buy them. You plug them into your rack. And you’re 
going to connect to them over Ethernet,” Fikes says. “The important thing to 
think about is that this is a service that is provided to the data center. The 
costs of that are amortized across all the servers in your fleet. The cost per 
server is some incremental amount — and you weigh that against the types of 
things we can do for that.”

No, Spanner isn’t something every website needs today. But the world is moving 
in its general direction. Though Facebook has yet to explore something like 
Spanner, it is building a software platform called Prism that will run the 
company’s massive number crunching tasks across multiple data centers.

Yes, Google’s ad system is enormous, but it benefits from Spanner in ways that 
could benefit so many other web services. The Google ad system is an online 
auction — where advertisers bid to have their ads displayed as someone searches 
for a particular item or visits particular websites — and the appearance of 
each ad depends on data describing the behavior of countless advertisers and 
web surfers across the net. With Spanner, Google can juggle this data on a 
global scale, and it can still keep the whole system in sync.

As Fikes put it, Spanner is just the first example of Google taking advantage 
of its new hold on time. “I expect there will be many others,” he says. He 
means other Google services, but there’s a reason the company has now shared 
its Spanner paper with the rest of the world.

Illustration by Ross Patton
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Spanner: synchronizing the largest global database

Reply via email to