-Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
Jonathan Gray 2009-04-15, 15:22
I agree with you, Andy.
This seems to be a great look into what Hadoop MapReduce is not good at.
Over in the HBase world, we constantly deal with comparisons like this to
RDBMSs, trying to determine if one is better than the other. It's a false
choice and completely depends on the use case.
Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database! It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing. The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority.
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense. Yes, B-Tree indexes are a wonderful breakthrough in data
In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of
applications including batch log processing, web crawling, and number of
machine learning and natural language processing jobs... These may not be
tasks that DBMS-X or Vertica would be good at, if even capable of them,
but all things that I would include under "Large-Scale Data Analysis".
Would have been really interesting to see how things like Pig, Hive, and
Cascading would stack up against DBMS-X/Vertica for very complex,
multi-join/sort/etc queries, across a broad spectrum of use cases and
There are a wide variety of solutions to the problems out there. It's
important to know the strengths and weaknesses of each, so a bit
unfortunate that this paper set the stage as it did.
On Wed, April 15, 2009 6:44 am, Andy Liu wrote:
> Not sure if comparing Hadoop to databases is an apples to apples
> comparison. Hadoop is a complete job execution framework, which
> collocates the data with the computation. I suppose DBMS-X and Vertica do
> that to some certain extent, by way of SQL, but you're restricted to that.
> If you want
> to say, build a distributed web crawler, or a complex data processing
> pipeline, Hadoop will schedule those processes across a cluster for you,
> while Vertica and DBMS-X only deal with the storage of the data.
> The choice of experiments seemed skewed towards DBMS-X and Vertica. I
> think everybody is aware that Map-Reduce is inefficient for handling
> queries and joins.
> It's also worth noting that I think 4 out of the 7 authors either
> currently or at one time work with Vertica (or c-store, the precursor to
> On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
> <[EMAIL PROTECTED]>wrote:
>> (Hadoop is used in the benchmarks)
>> There is currently considerable enthusiasm around the MapReduce
>> (MR) paradigm for large-scale data analysis . Although the
>> basic control ï¬ow of this framework has existed in parallel SQL
>> database management systems (DBMS) for over 20 years, some have called
>> MR a dramatically new computing model [8, 17]. In
>> this paper, we describe and compare both paradigms. Furthermore, we
>> evaluate both kinds of systems in terms of performance and de- velopment
>> complexity. To this end, we deï¬ne a benchmark con- sisting of a
>> collection of tasks that we have run on an open source version of MR as
>> well as on two parallel DBMSs. For each task, we measure each systemâs
>> performance for various degrees of par- allelism on a cluster of 100
>> nodes. Our results reveal some inter- esting trade-offs. Although the
>> process to load data into and tune the execution of parallel DBMSs took
>> much longer than the MR system, the observed performance of these DBMSs
>> was strikingly better. We speculate about the causes of the dramatic
>> performance difference and consider implementation concepts that future
>> sys- tems should take from both kinds of architectures.
>> msn: [EMAIL PROTECTED]