|
|
-
running a job on single-node setup takes less time than running on a cluster
Mahsa Mofidpoor 2012-08-20, 13:03
Hello,
I run a simple join (select col_list from table1 join table2 on (join_condition)) on both single-node and multi-nodes setup. The table sizes are 1.7 MB and 4.2 MB respectively. It takes more time to execute the query on the cluster then to run it on a single-node hadoop setup. I checked to map logs and I saw that both mappings happen on the master node. Do I need to increase the data in order to benefit from the multi-nodes capacity? How can I make sure that my data is distributed on all the nodes?
Thank you in advance for your assistance.
Reagrds, Mahsa
-
Re: running a job on single-node setup takes less time than running on a cluster
Rahul Bhattacharjee 2012-08-20, 16:08
I have no answer to your questions , but have some questions though !
What tables are you talking about ? Considering you are talking about datasets/files when you say tables , why using hadoop for such some sized tables.
On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote:
> Hello, > > I run a simple join (select col_list from table1 join table2 on > (join_condition)) on both single-node and multi-nodes setup. The table > sizes are 1.7 MB and 4.2 MB respectively. It takes more time to execute > the query on the cluster then to run it on a single-node hadoop setup. > I checked to map logs and I saw that both mappings happen on the master > node. > Do I need to increase the data in order to benefit from the multi-nodes > capacity? > How can I make sure that my data is distributed on all the nodes? > > Thank you in advance for your assistance. > > Reagrds, > Mahsa >
-
Re: running a job on single-node setup takes less time than running on a cluster
Saurabh bhutyani 2012-08-20, 16:15
Dear Mahsa,
You need to increase the data size to benefit out of Hadoop. Basically hadoop creates splits based on the configured value. The default being 64MB. So if your data size is less than 64MB it would basically run only 1 MR job.
Thanks & Regards, Saurabh Bhutyani
Call : 9820083104 Gtalk: [EMAIL PROTECTED]
On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote:
> Hello, > > I run a simple join (select col_list from table1 join table2 on > (join_condition)) on both single-node and multi-nodes setup. The table > sizes are 1.7 MB and 4.2 MB respectively. It takes more time to execute > the query on the cluster then to run it on a single-node hadoop setup. > I checked to map logs and I saw that both mappings happen on the master > node. > Do I need to increase the data in order to benefit from the multi-nodes > capacity? > How can I make sure that my data is distributed on all the nodes? > > Thank you in advance for your assistance. > > Reagrds, > Mahsa >
-
Re: running a job on single-node setup takes less time than running on a cluster
Mahsa Mofidpoor 2012-08-20, 18:31
Thnaks Saurabh
On Mon, Aug 20, 2012 at 12:15 PM, Saurabh bhutyani <[EMAIL PROTECTED]>wrote:
> Dear Mahsa, > > You need to increase the data size to benefit out of Hadoop. Basically > hadoop creates splits based on the configured value. The default being > 64MB. So if your data size is less than 64MB it would basically run only 1 > MR job. > > Thanks & Regards, > Saurabh Bhutyani > > Call : 9820083104 > Gtalk: [EMAIL PROTECTED] > > > > On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote: > >> Hello, >> >> I run a simple join (select col_list from table1 join table2 on >> (join_condition)) on both single-node and multi-nodes setup. The table >> sizes are 1.7 MB and 4.2 MB respectively. It takes more time to execute >> the query on the cluster then to run it on a single-node hadoop setup. >> I checked to map logs and I saw that both mappings happen on the master >> node. >> Do I need to increase the data in order to benefit from the multi-nodes >> capacity? >> How can I make sure that my data is distributed on all the nodes? >> >> Thank you in advance for your assistance. >> >> Reagrds, >> Mahsa >> > >
-
Re: running a job on single-node setup takes less time than running on a cluster
nagarjuna kanamarlapudi 2012-08-22, 03:46
Dear Mahsa,
Yes what you have observed is defined to happen that way. On a single node cluster -- everything is local. There is network transfer and every thing else vanish. Try to increase the data size .. you will see the effect of parallel jvm's on the job time.
In your single node cluster, you have one jvm and everything is local. In multinode , multiple jvm's and mapper ouput to be copied to reducer (network transfer).
Comparing the above two situations.. may be your small data didnot reach the threshold where you the observer of multinode cluster.
Try increasing the data size and you will see wonders. You know, I worked on TB of data for table joins. It worked just amazing.
On Tue, Aug 21, 2012 at 12:01 AM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote:
> Thnaks Saurabh > > > On Mon, Aug 20, 2012 at 12:15 PM, Saurabh bhutyani <[EMAIL PROTECTED]>wrote: > >> Dear Mahsa, >> >> You need to increase the data size to benefit out of Hadoop. Basically >> hadoop creates splits based on the configured value. The default being >> 64MB. So if your data size is less than 64MB it would basically run only 1 >> MR job. >> >> Thanks & Regards, >> Saurabh Bhutyani >> >> Call : 9820083104 >> Gtalk: [EMAIL PROTECTED] >> >> >> >> On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote: >> >>> Hello, >>> >>> I run a simple join (select col_list from table1 join table2 on >>> (join_condition)) on both single-node and multi-nodes setup. The table >>> sizes are 1.7 MB and 4.2 MB respectively. It takes more time to execute >>> the query on the cluster then to run it on a single-node hadoop setup. >>> I checked to map logs and I saw that both mappings happen on the master >>> node. >>> Do I need to increase the data in order to benefit from the multi-nodes >>> capacity? >>> How can I make sure that my data is distributed on all the nodes? >>> >>> Thank you in advance for your assistance. >>> >>> Reagrds, >>> Mahsa >>> >> >> >
-
Re: running a job on single-node setup takes less time than running on a cluster
Mahsa Mofidpoor 2012-08-23, 16:19
Thank you very much.
On Tue, Aug 21, 2012 at 11:46 PM, nagarjuna kanamarlapudi < [EMAIL PROTECTED]> wrote:
> Dear Mahsa, > > Yes what you have observed is defined to happen that way. > On a single node cluster -- everything is local. There is network transfer > and every thing else vanish. Try to increase the data size .. you will see > the effect of parallel jvm's on the job time. > > In your single node cluster, you have one jvm and everything is local. > In multinode , multiple jvm's and mapper ouput to be copied to reducer > (network transfer). > > Comparing the above two situations.. may be your small data didnot reach > the threshold where you the observer of multinode cluster. > > Try increasing the data size and you will see wonders. You know, I worked > on TB of data for table joins. It worked just amazing. > > > > On Tue, Aug 21, 2012 at 12:01 AM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote: > >> Thnaks Saurabh >> >> >> On Mon, Aug 20, 2012 at 12:15 PM, Saurabh bhutyani <[EMAIL PROTECTED]>wrote: >> >>> Dear Mahsa, >>> >>> You need to increase the data size to benefit out of Hadoop. Basically >>> hadoop creates splits based on the configured value. The default being >>> 64MB. So if your data size is less than 64MB it would basically run only 1 >>> MR job. >>> >>> Thanks & Regards, >>> Saurabh Bhutyani >>> >>> Call : 9820083104 >>> Gtalk: [EMAIL PROTECTED] >>> >>> >>> >>> On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote: >>> >>>> Hello, >>>> >>>> I run a simple join (select col_list from table1 join table2 on >>>> (join_condition)) on both single-node and multi-nodes setup. The table >>>> sizes are 1.7 MB and 4.2 MB respectively. It takes more time to execute >>>> the query on the cluster then to run it on a single-node hadoop setup. >>>> I checked to map logs and I saw that both mappings happen on the master >>>> node. >>>> Do I need to increase the data in order to benefit from the multi-nodes >>>> capacity? >>>> How can I make sure that my data is distributed on all the nodes? >>>> >>>> Thank you in advance for your assistance. >>>> >>>> Reagrds, >>>> Mahsa >>>> >>> >>> >> >
|
|