-Re: running a job on single-node setup takes less time than running on a cluster
nagarjuna kanamarlapudi 2012-08-22, 03:46
Yes what you have observed is defined to happen that way.
On a single node cluster -- everything is local. There is network transfer
and every thing else vanish. Try to increase the data size .. you will see
the effect of parallel jvm's on the job time.
In your single node cluster, you have one jvm and everything is local.
In multinode , multiple jvm's and mapper ouput to be copied to reducer
Comparing the above two situations.. may be your small data didnot reach
the threshold where you the observer of multinode cluster.
Try increasing the data size and you will see wonders. You know, I worked
on TB of data for table joins. It worked just amazing.
On Tue, Aug 21, 2012 at 12:01 AM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote:
> Thnaks Saurabh
> On Mon, Aug 20, 2012 at 12:15 PM, Saurabh bhutyani <[EMAIL PROTECTED]>wrote:
>> Dear Mahsa,
>> You need to increase the data size to benefit out of Hadoop. Basically
>> hadoop creates splits based on the configured value. The default being
>> 64MB. So if your data size is less than 64MB it would basically run only 1
>> MR job.
>> Thanks & Regards,
>> Saurabh Bhutyani
>> Call : 9820083104
>> Gtalk: [EMAIL PROTECTED]
>> On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote:
>>> I run a simple join (select col_list from table1 join table2 on
>>> (join_condition)) on both single-node and multi-nodes setup. The table
>>> sizes are 1.7 MB and 4.2 MB respectively. It takes more time to execute
>>> the query on the cluster then to run it on a single-node hadoop setup.
>>> I checked to map logs and I saw that both mappings happen on the master
>>> Do I need to increase the data in order to benefit from the multi-nodes
>>> How can I make sure that my data is distributed on all the nodes?
>>> Thank you in advance for your assistance.