Apart from what Devin has suggested there are other factors which could be
worth while noting when you are running your hadoop cluster on virtual
(1) How many map and reduce slots are there in cluster?
Since you have not mentioned and you are using 4 node hadoop cluster so
total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task
will run parallely and other task have to wait..
(2) Since you have not mentioned anywhere that whether 30GB of data is made
up of lot of smaller files ( less than block size) or bigger file...let us
do a simple calculation assuming only one file of 30GB and assuming a block
size of 64MB
30GB = 30 * 1024 * 1024* 1024 = 32212254720
64MB = 64 * 1024*1024 =67108864
Total Number of blocks the data will be broken = (32212254720) /
(67108864) = 480 Blocks
Now this means you will be running 480 Map tasks ( keeping in mind
inputsplit size = block size)...But since you have only 8 map slots so at a
time only 8 map task will run and others will be pending...
Assuming all the 8map tasks finishes at one time then you will have 480/8 60 map waves
(3) Now you know that each task runs on a separate JVM, that means to say
for every task a jvm is created and then after the task is finished the JVM
is tear down..this is also a bottle neck, creation and destroy of JVM
So try reusing the same JVM. There is option where in you can reuse the JVM
(4) SInce you are working with such big data, try using combiner?
(5) Also try compressing the data and the intermediate output of the
mappers and reducer op
---First try with sequence file
---Then try with snappy compression codec
By the above pointers if you can bring down the timings to atleast 1 hour
Then with the same 4 node cluster and Hadoop running on separate physical
machine you will for sure see the job getting completed in 15-30minutes..[
Please refer Devin's comments ]
My suggestion is get the optimal performance on your virtual machine and
then you go for real hadoop cluster. You will for sure see the performance
Som Shekhar Sharma
On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <[EMAIL PROTECTED]> wrote:
> One of the problems you run into with Hadoop in Virtual Machine
> environments is performance issues when they are all running on the same
> physical host. With a VM, even though you are giving them 4 GB of RAM, and
> a virtual CPU and disk, if the virtual machines are sharing physical
> components like processor and physical storage medium, they compete for
> resources at the physical level. Even if you have the VM on a single host,
> or on a multi-core host with multiple disks and they are sharing as few
> resources as possible, there will still be a performance hit when the VM
> information has to pass through the hypervisor layer - co-scheduling
> resources with the host and things like that.
> Does that make sense?
> It's generally accepted that due to these issues, Hadoop in virtual
> environments does not offer the same performance benefits as a physical
> Hadoop cluster. It can be used pretty well with even low-quality hardware
> though, so so, maybe you can acquire some used desktops and install your
> favorite Linux flavor on them and make a cluster - some people have even
> run Hadoop on Raspberry Pi clusters.
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
> On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <[EMAIL PROTECTED]>wrote:
>> I know this foolish of me to ask this, because there are a lot of factors
>> that affect this,
>> but why is it taking so much time, can anyone suggest possible reasons
>> for it, or if anyone has faced such issue before
>> Nikhil Kandoi
>> P.S – I am Hadoop-1.0.3 for this application, so I wonder if this