|
|
-
Re: Which hardware to choose
Oleg Ruchovets 2012-10-02, 14:59
Great , Thank you for the such detailed information, By the way what type of Disk Controller do you use? Thanks Oleg. On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov <[EMAIL PROTECTED]>wrote: > Privet Oleg > > Cloudera and Dell setup the following cluster for my company > Company receives 1.5 TB raw data per day > > 38 data nodes + 2 Name Nodes > > Data Node: > Dell PowerEdge C2100 series > 2 x XEON x5670 > 48 GB RAM ECC (12x4GB 1333MHz) > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD > Intel Gigabit ET Dual port PCIe x4 > Redundant Power Supply > Hadoop CDH3 > max map tasks 24 > max reduce tasks 8 > > Name Node and Secondary Name Node are the similar but > 96GB RAM (not sure why) > 6x600Gb 15 RPM Serial SCSI > RAID10 > > > another config is here > page 298 > > http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA298&lpg=PA298&dq=hadoop+jbod&source=bl&ots=i7xVQBPb_w&sig=8mhq-MtpkRcTiRB1ioKciMxIasg&hl=en&sa=X&ei=AGtqUMK6D8T10gHD4ICQAQ&ved=0CEMQ6AEwAg#v=onepage&q=hadoop%20jbod&f=false> > > you probably need just 1 computer with 10 x 2 TB SATA HDD > > > > On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets <[EMAIL PROTECTED]> > wrote: > > > Hi , > > We are on a very early stage of our hadoop project and want to do a > POC. > > > > We have ~ 5-6 terabytes of row data and we are going to execute some > > aggregations. > > > > We plan to use 8 - 10 machines > > > > Questions: > > > > 1) Which hardware should we use: > > a) How many discs , what discs is better to use? > > b) How many RAM? > > c) How many CPUs? > > > > > > 2) Please share best practices and tips / tricks related to utilise > > hardware using for hadoop projects. > > > > Thanks in advance > > Oleg. > > >
+
Oleg Ruchovets 2012-10-02, 14:59
-
Re: Which hardware to choose
Alexander Pivovarov 2012-10-02, 18:55
Not sure the following options are available Integrated ICH10R on motherboard LSI® 6Gb SAS2008 daughtercard Dell PERC H200 Dell PERC H700 LSI MegaRAID® SAS 9260-8i http://www.dell.com/us/enterprise/p/poweredge-c2100/pdOn Tue, Oct 2, 2012 at 10:59 AM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote: > Great , > > Thank you for the such detailed information, > > By the way what type of Disk Controller do you use? > > Thanks > Oleg. > > > On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov <[EMAIL PROTECTED] > >wrote: > > > Privet Oleg > > > > Cloudera and Dell setup the following cluster for my company > > Company receives 1.5 TB raw data per day > > > > 38 data nodes + 2 Name Nodes > > > > Data Node: > > Dell PowerEdge C2100 series > > 2 x XEON x5670 > > 48 GB RAM ECC (12x4GB 1333MHz) > > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD > > Intel Gigabit ET Dual port PCIe x4 > > Redundant Power Supply > > Hadoop CDH3 > > max map tasks 24 > > max reduce tasks 8 > > > > Name Node and Secondary Name Node are the similar but > > 96GB RAM (not sure why) > > 6x600Gb 15 RPM Serial SCSI > > RAID10 > > > > > > another config is here > > page 298 > > > > > http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA298&lpg=PA298&dq=hadoop+jbod&source=bl&ots=i7xVQBPb_w&sig=8mhq-MtpkRcTiRB1ioKciMxIasg&hl=en&sa=X&ei=AGtqUMK6D8T10gHD4ICQAQ&ved=0CEMQ6AEwAg#v=onepage&q=hadoop%20jbod&f=false> > > > > > you probably need just 1 computer with 10 x 2 TB SATA HDD > > > > > > > > On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets <[EMAIL PROTECTED]> > > wrote: > > > > > Hi , > > > We are on a very early stage of our hadoop project and want to do a > > POC. > > > > > > We have ~ 5-6 terabytes of row data and we are going to execute some > > > aggregations. > > > > > > We plan to use 8 - 10 machines > > > > > > Questions: > > > > > > 1) Which hardware should we use: > > > a) How many discs , what discs is better to use? > > > b) How many RAM? > > > c) How many CPUs? > > > > > > > > > 2) Please share best practices and tips / tricks related to utilise > > > hardware using for hadoop projects. > > > > > > Thanks in advance > > > Oleg. > > > > > >
+
Alexander Pivovarov 2012-10-02, 18:55
-
Re: Which hardware to choose
hadoopman 2012-10-03, 01:14
Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size.
We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance.
On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: > 38 data nodes + 2 Name Nodes > > > > > > Data Node: > > > Dell PowerEdge C2100 series > > > 2 x XEON x5670 > > > 48 GB RAM ECC (12x4GB 1333MHz) > > > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD > > > Intel Gigabit ET Dual port PCIe x4 > > > Redundant Power Supply > > > Hadoop CDH3 > > > max map tasks 24 > > > max reduce tasks 8
+
hadoopman 2012-10-03, 01:14
-
Re: Which hardware to choose
Michael Segel 2012-10-03, 01:40
I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase.
On Oct 2, 2012, at 8:14 PM, hadoopman <[EMAIL PROTECTED]> wrote:
> Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. > > We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. > > > > On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: >> 38 data nodes + 2 Name Nodes >> > > >> > > Data Node: >> > > Dell PowerEdge C2100 series >> > > 2 x XEON x5670 >> > > 48 GB RAM ECC (12x4GB 1333MHz) >> > > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD >> > > Intel Gigabit ET Dual port PCIe x4 >> > > Redundant Power Supply >> > > Hadoop CDH3 >> > > max map tasks 24 >> > > max reduce tasks 8 > >
+
Michael Segel 2012-10-03, 01:40
-
Re: Which hardware to choose
Marcos Ortiz 2012-10-03, 01:57
Which is a reasonable number in this hardware? On 10/02/2012 09:40 PM, Michael Segel wrote: > I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. > Especially if they want to run HBase. > > On Oct 2, 2012, at 8:14 PM, hadoopman <[EMAIL PROTECTED]> wrote: > >> Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. >> >> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. >> >> >> >> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: >>> 38 data nodes + 2 Name Nodes >>>> > >>>> > Data Node: >>>> > Dell PowerEdge C2100 series >>>> > 2 x XEON x5670 >>>> > 48 GB RAM ECC (12x4GB 1333MHz) >>>> > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD >>>> > Intel Gigabit ET Dual port PCIe x4 >>>> > Redundant Power Supply >>>> > Hadoop CDH3 >>>> > max map tasks 24 >>>> > max reduce tasks 8 >> > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu> http://www.facebook.com/universidad.uci> http://www.flickr.com/photos/universidad_uci-- Marcos Luis Ort�z Valmaseda *Data Engineer && Sr. System Administrator at UCI* about.me/marcosortiz < http://about.me/marcosortiz>My Blog < http://marcosluis2186.posterous.com>Tumblr's blog < http://marcosortiz.tumblr.com/>@marcosluis2186 < http://twitter.com/marcosluis2186>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cuhttp://www.facebook.com/universidad.ucihttp://www.flickr.com/photos/universidad_uci
+
Marcos Ortiz 2012-10-03, 01:57
-
Re: Which hardware to choose
Michael Segel 2012-10-03, 02:35
Ah that's the $64,000.00 (USD) question.... I tend to be conservative so this should be a good starting point. You start with 2 things... the amount of memory available and the number of physical cores. Subtract a core for each main process. e.g. DN, TT, and RS if you're running HBase. Take the remaining cores and if you're running on INTEL w HyperThreading multiply them by 2. That's the max number of slots you should use when configuring Hadoop. Note: For each slot, you should have at least 1GB of memory. You may want to plan on 2GB so your child opts can go up to 2GB before reducing the number of slots. So if you have dual hexa-core and run HBase... it looks like the following: 12 cores less DN, TT, and RS = 9 cores. * 2 so you have 18 slots that can be a mix of Mappers and Reducers. That's a good starting position and you can ramp it up based on what you observe. YMMV of course. Note: When I run HBase, I don't want any swapping. So you have to pay attention to the amount of memory on the system and how its being allocated. :-) On Oct 2, 2012, at 8:57 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > Which is a reasonable number in this hardware? > > On 10/02/2012 09:40 PM, Michael Segel wrote: >> I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. >> Especially if they want to run HBase. >> >> On Oct 2, 2012, at 8:14 PM, hadoopman <[EMAIL PROTECTED]> wrote: >> >>> Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. >>> >>> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. >>> >>> >>> >>> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: >>>> 38 data nodes + 2 Name Nodes >>>>> > >>>>> > Data Node: >>>>> > Dell PowerEdge C2100 series >>>>> > 2 x XEON x5670 >>>>> > 48 GB RAM ECC (12x4GB 1333MHz) >>>>> > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD >>>>> > Intel Gigabit ET Dual port PCIe x4 >>>>> > Redundant Power Supply >>>>> > Hadoop CDH3 >>>>> > max map tasks 24 >>>>> > max reduce tasks 8 >>> >> >> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... >> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >> >> http://www.uci.cu>> http://www.facebook.com/universidad.uci>> http://www.flickr.com/photos/universidad_uci> > -- > Marcos Luis Ortíz Valmaseda > Data Engineer && Sr. System Administrator at UCI > about.me/marcosortiz > My Blog > Tumblr's blog > @marcosluis2186 > > >
+
Michael Segel 2012-10-03, 02:35
-
Re: Which hardware to choose
Alexander Pivovarov 2012-10-03, 03:42
All configs are per node. No HBase, only Hive and Pig installed
On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> I think he's saying that its 24 maps 8 reducers per node and at 48GB that > could be too many mappers. > Especially if they want to run HBase. > > On Oct 2, 2012, at 8:14 PM, hadoopman <[EMAIL PROTECTED]> wrote: > > > Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's > right? Sounds VERY low for a cluster that size. > > > > We have only 10 c2100's and are running I believe 140 map and 70 reduce > slots so far with pretty decent performance. > > > > > > > > On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: > >> 38 data nodes + 2 Name Nodes > >> > > > >> > > Data Node: > >> > > Dell PowerEdge C2100 series > >> > > 2 x XEON x5670 > >> > > 48 GB RAM ECC (12x4GB 1333MHz) > >> > > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD > >> > > Intel Gigabit ET Dual port PCIe x4 > >> > > Redundant Power Supply > >> > > Hadoop CDH3 > >> > > max map tasks 24 > >> > > max reduce tasks 8 > > > > > >
+
Alexander Pivovarov 2012-10-03, 03:42
-
Re: Which hardware to choose
J. Rottinghuis 2012-10-03, 17:09
Of course it all depends... But something like this could work:
Leave 1-2 GB for the kernel, pagecache, tools, overhead etc. Plan 3-4 GB for Datanode and Tasktracker each
Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more or less memory per slot. Have 2-3 times as many mappers as reducers (depending on the kinds of jobs you run).
As Micheal pointed out the ratio of cores (hyperthreads) per disk matters.
With those initial rules of thumb you'd arrive somewhere between 10 mappers + 5 reducers and 9 mappers + 4 reducers
Try, test, measure, adjust, rinse, repeat.
Cheers,
Joep
On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov <[EMAIL PROTECTED]>wrote:
> All configs are per node. > No HBase, only Hive and Pig installed > > On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel <[EMAIL PROTECTED] > >wrote: > > > I think he's saying that its 24 maps 8 reducers per node and at 48GB that > > could be too many mappers. > > Especially if they want to run HBase. > > > > On Oct 2, 2012, at 8:14 PM, hadoopman <[EMAIL PROTECTED]> wrote: > > > > > Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's > > right? Sounds VERY low for a cluster that size. > > > > > > We have only 10 c2100's and are running I believe 140 map and 70 reduce > > slots so far with pretty decent performance. > > > > > > > > > > > > On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: > > >> 38 data nodes + 2 Name Nodes > > >> > > > > >> > > Data Node: > > >> > > Dell PowerEdge C2100 series > > >> > > 2 x XEON x5670 > > >> > > 48 GB RAM ECC (12x4GB 1333MHz) > > >> > > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD > > >> > > Intel Gigabit ET Dual port PCIe x4 > > >> > > Redundant Power Supply > > >> > > Hadoop CDH3 > > >> > > max map tasks 24 > > >> > > max reduce tasks 8 > > > > > > > > > > >
+
J. Rottinghuis 2012-10-03, 17:09
-
Re: Which hardware to choose
Michael Segel 2012-10-03, 17:21
Well...
If you're not running HBase, you're less harmed by minimal swapping so you could push the number of slots and over subscribe. The only thing I would have to suggest is that you monitor your system closely as you adjust the number of slots.
You have to admit though, its fun to tune the cluster. :-)
On Oct 3, 2012, at 12:09 PM, J. Rottinghuis <[EMAIL PROTECTED]> wrote:
> Of course it all depends... > But something like this could work: > > Leave 1-2 GB for the kernel, pagecache, tools, overhead etc. > Plan 3-4 GB for Datanode and Tasktracker each > > Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more > or less memory per slot. > Have 2-3 times as many mappers as reducers (depending on the kinds of jobs > you run). > > As Micheal pointed out the ratio of cores (hyperthreads) per disk matters. > > With those initial rules of thumb you'd arrive somewhere between > 10 mappers + 5 reducers > and > 9 mappers + 4 reducers > > Try, test, measure, adjust, rinse, repeat. > > Cheers, > > Joep > > On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov <[EMAIL PROTECTED]>wrote: > >> All configs are per node. >> No HBase, only Hive and Pig installed >> >> On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel <[EMAIL PROTECTED] >>> wrote: >> >>> I think he's saying that its 24 maps 8 reducers per node and at 48GB that >>> could be too many mappers. >>> Especially if they want to run HBase. >>> >>> On Oct 2, 2012, at 8:14 PM, hadoopman <[EMAIL PROTECTED]> wrote: >>> >>>> Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's >>> right? Sounds VERY low for a cluster that size. >>>> >>>> We have only 10 c2100's and are running I believe 140 map and 70 reduce >>> slots so far with pretty decent performance. >>>> >>>> >>>> >>>> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: >>>>> 38 data nodes + 2 Name Nodes >>>>>>> >>>>>>> Data Node: >>>>>>> Dell PowerEdge C2100 series >>>>>>> 2 x XEON x5670 >>>>>>> 48 GB RAM ECC (12x4GB 1333MHz) >>>>>>> 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD >>>>>>> Intel Gigabit ET Dual port PCIe x4 >>>>>>> Redundant Power Supply >>>>>>> Hadoop CDH3 >>>>>>> max map tasks 24 >>>>>>> max reduce tasks 8 >>>> >>>> >>> >>> >>
+
Michael Segel 2012-10-03, 17:21
-
Re: Which hardware to choose
Russell Jurney 2012-10-03, 01:19
I believe he means per node. Russell Jurney http://datasyndrome.comOn Oct 2, 2012, at 6:15 PM, hadoopman <[EMAIL PROTECTED]> wrote: > Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. > > We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. > > > > On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: >> 38 data nodes + 2 Name Nodes >> > > >> > > Data Node: >> > > Dell PowerEdge C2100 series >> > > 2 x XEON x5670 >> > > 48 GB RAM ECC (12x4GB 1333MHz) >> > > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD >> > > Intel Gigabit ET Dual port PCIe x4 >> > > Redundant Power Supply >> > > Hadoop CDH3 >> > > max map tasks 24 >> > > max reduce tasks 8 >
+
Russell Jurney 2012-10-03, 01:19
-
Re: Which hardware to choose
hadoopman 2012-10-03, 01:33
Had to ask :D On 10/02/2012 07:19 PM, Russell Jurney wrote: > I believe he means per node. > > Russell Jurney http://datasyndrome.com> > On Oct 2, 2012, at 6:15 PM, hadoopman<[EMAIL PROTECTED]> wrote: > >> Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. >> >> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. >> >> >> >> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: >>> 38 data nodes + 2 Name Nodes >>>> > >>>> > Data Node: >>>> > Dell PowerEdge C2100 series >>>> > 2 x XEON x5670 >>>> > 48 GB RAM ECC (12x4GB 1333MHz) >>>> > 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD >>>> > Intel Gigabit ET Dual port PCIe x4 >>>> > Redundant Power Supply >>>> > Hadoop CDH3 >>>> > max map tasks 24 >>>> > max reduce tasks 8
+
hadoopman 2012-10-03, 01:33
|
|