Hadoop is not well-suited to the online, interactive data processing required for truly real-time data insights. Or is it?
Apache Hadoop,
the open source software framework at the heart of big data, is a batch
computing engine. It is not well-suited to the online, interactive data
processing required for truly real-time data insights.
Or is it? Doug Cutting, creator of Hadoop and founder of the Apache Hadoop Project (and chief architect at
Cloudera) says he believes Hadoop has a future beyond batch.
"I think batch has its place," Cutting says. "If you're moving bulk
amounts of data and you need to really analyze everything, that's not
about interactive. But the combination of batch and online computation
is what I think people will really appreciate."
"I really see Hadoop becoming the kernel of the mainstream data processing system that businesses will be using," he adds.
Where Hadoop stands now
Speaking at the O'Reilly Strata Conference + Hadoop World in New York
City, Cutting explains his thoughts on the core themes of the Hadoop
stack and where it's heading.
"Hadoop is known as a batch computing engine and indeed that's where
we started, with MapReduce," Cutting says. "MapReduce is a wonderful
tool. It's a simple programming metaphor that has found many
applications. There are books on how to implement a variety of
algorithms on MapReduce."
MapReduce is a programming model, designed by
Google
for batch processing massive datasets in parallel using distributed
computing. MapReduce takes an input and breaks it down into many smaller
sub-problems, which are distributed to nodes to process in parallel. It
then reassembles the answers to those sub-problems to form the output.
"It's also very efficient," Cutting says. "It permits you to move
your computation to your data, so you're not copying data around as
you're processing it. It also forms a shared platform. Building a
distributed system is a complicated process, not something you can do
overnight. So we don't want to have to re-implement it again and again.
MapReduce has proved itself a solid foundation. We've seen the
development of many tools on top of it such as Pig and Hive."
"But, of course, this platform is not just for batch computing," he adds. "It's a much more general platform, I believe."
Defining characteristics of the Hadoop platform
To illustrate this, Cutting lays out what he considers the two core
themes of Hadoop as it exists today, together with a few other things
that he considers matters of "style."
First and foremost, he says, the Hadoop platform is defined by its
scalability. It works just fine on small datasets stored in-memory, but
is capable of scaling massively to handle huge datasets.
"A big component of scalability that we don't hear a lot talked about
is affordability," he says. "We run on commodity hardware because it
allows you to scale further. If you can buy 10 times the amount of
storage per dollar, then you can store 10 times the amount of data per
dollar. So affordability is key, and that's why we use commodity
hardware, because it is the most affordable platform."
Just as important, he notes, Hadoop is open source.
"Similarly, open source software is very affordable," he adds. "The
core platform that folks develop their applications against is free. You
may pay vendors, but you pay vendors for the value they deliver, you
don't keep paying them year after year even though you're not getting
anything fundamentally new from them. Vendors need to earn your trust
and earn your confidence by providing you with value over time."
Beyond that, he says, there are what he considers elements of Hadoop's style.
"There's this notion that you don't need to constrain your data with a
strict schema at the time you load it," he says. "Rather, you can
afford to save your data in a raw form and then, as you use it, project
it to various schemas. We call this schema on read.
Another popular theme in the big data space is that oftentimes simply
having more data is a better way to understand your problem than to
have a more clever algorithm. It's often better to spend more time
gathering data than to fine-tune your algorithm on a smaller data set.
Intuitively, this is much like having a higher-resolution image. If
you're going to try to analyse it, you'd rather zoom in on the
high-resolution image than the low-resolution image."
HBase is an example of online computing in Hadoop
Batch processing, he notes, is not a defining characteristic of Hadoop. As proof he points to
Apache HBase,
the highly successful open source, nonrelational distributed
database-modeled on Google's BigTable-that is part of the Hadoop stack.
HBase is an online computing system, not a batch computing system.
"It performs interactive puts and gets of individual values," Cutting
explains. "But it also supports batch. It shares storage with HDFS and
with every other component of the stack. And I think that's really
what's led to its popularity. It's integrated into the rest of the
system. It's not a separate system on the side that you need to move
data in and out of. It can share other aspects of the stack: It can
share availability, security, disaster recovery. There's a lot of room
to permit folks to only have one copy of their data and only one
installation of this technology stack."
Looking ahead to the Hadoop holy grail
But if Hadoop is not defined by batch, if it is going to be a more
general data processing platform, what will it look like and how will it
get there?
"I think there are a number of things we'd like to see in the sort of
"Holy Grail" big data system," Cutting says. "Of course we want it to
be open source, running on commodity hardware. We also want to see
linear scaling: If you need to store ten times the data, you'd like to
just buy ten times the hardware and have that work automatically, no
matter how big your dataset gets.
Similarly with performance, Cutting says, for both batch performance
if you need greater batch throughput or short, smaller batch latency,
you'd like to increase the amount of hardware. As for interactive
queries, the same thing holds. Increased hardware should give you linear
scalability in both performance and magnitude of data process."
"There are other things we'd like to see," he adds. "We'd like to see
complex transactions, joins, a lot of technologies which this platform
has lacked. I think, classically, folks have believed that they weren't
ever going to be present in this platform, that when you adopted a big
data platform, you were giving up certain things. I don't think that's
the case. I think there's very little that we're going to have to need
to give up in the long term."
Google provided a map
The reason, Cutting says, is that Google has shown the way to establish these elements in the Hadoop stack.
"Google has given us a map," he says. "We know where we're going.
They started out publishing their GFS and MapReduce papers, which we
quickly cloned in the Hadoop Project. Through the years, Google has
produced a succession of publications that have in many ways inspired
the open source stack. The Sawzall system was a precursor to Pig and
Hive; BigTable directly inspired HBase, and so on. And I was very
excited to see this year Google publish a paper called Spanner about a
system that implements transactions in a distributed system-multitable
transactions running on a database at a global scale. This is something
that I think a lot of us didn't think we'd see anytime soon, and it
really helps us to see that the sky's the limit for this platform."
Spanner, Cutting notes, is complicated technology and no one should
expect to see it as part of Hadoop next spring. But it provides a route
to the Holy Grail, he says. In the meantime, he points to Impala, a new
database engine released by Cloudera at the conference this week, which
can query datasets stored in HBase using SQL.
"Impala is a huge step down this path toward the Holy Grail," he
says. "Now, no longer can you [only] do online puts and gets of values,
you can do online queries interactively with Impala. And Impala follows
some work from Google, again, that was published a few years ago, and
it's very exciting. It's a fundamental new capability in this platform
that I think is a tremendously valuable step on its own and will help
you build more and better applications on this platform. But also I
think it helps to make this point, that this platform isn't a niche. It
isn't a one-point technology. It's a general purpose platform."
We know where we're going with it," Cutting says, "and moreover we
know how to get there in many cases. So I encourage you to be
comfortable adopting it now and know that you can expect more in it
tomorrow. We're going to keep this thing advancing."