|
Nick Jones
2012-09-05, 14:59
Mathias Herberts
2012-09-05, 15:12
zGreenfelder
2012-09-05, 14:56
D'Souza, Clive V
2012-09-05, 14:58
Michael Segel
2012-09-07, 14:00
prabhu K
2012-09-10, 07:40
Steve Loughran
2012-09-10, 09:40
Michael Segel
2012-09-10, 11:50
Gauthier, Alexander
2012-09-10, 16:17
Fabio Pitzolu
2012-09-05, 14:47
prabhu K
2012-09-05, 12:21
Chen He
2012-09-05, 14:03
Shailesh Dargude
2012-09-05, 14:14
Mohammad Tariq
2012-09-05, 14:22
Steve Loughran
2012-09-07, 09:12
Gulfie
2012-09-06, 20:52
Michael Segel
2012-09-10, 19:54
|
-
Re: One petabyte of data loading into HDFS with in 10 min.Nick Jones 2012-09-05, 14:59
Since cost wasn't mentioned as a requirement...
An army of people mounting physical drives with the original dataset to the cluster of machines and M/R copying from local disk would likely be faster. There are also 40Gbps Infiniband solutions available. Also, the replication could be pushed to a separate network and would eventually achieve consistency (presumably not required in 10mins) thus lowering the primary connection bandwidth requirement to 1PB. On 09/05/2012 09:43 AM, Cosmin Lehene wrote: > Here's an extremely naïve ballpark estimation: at theoretical hardware > speed, for 3PB representing 1PB with 3x replication > > Over a single 1Gbps connection (and I'm not sure, you can actually > reach 1Gbps) > (3 petabytes) / (1 Gbps) = 291.271111 days > > So you'd need at least 40,000 1Gbps network cards to get that in 10 > minutes :) - (3PB/1Gbps)/40000 > <http://www.google.ro/search?client=safari&rls=en&q=%283PB/1Gbps%29/40000&ie=UTF-8&oe=UTF-8&redir_esc=&ei=2WRHUNWtGIWo0QW52oDYDw> > > The actual number of nodes would depend a lot on the actual network > architecture, the type of storage you use (SSD, HDD), etc. > Cosmin > From: prabhu K <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > Reply-To: "[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>" > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > Date: Wednesday, September 5, 2012 3:21 PM > To: "[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>" > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > Subject: One petabyte of data loading into HDFS with in 10 min. > > Hi Users, > Please clarify the below questions. > 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how > many slave (Data Nodes) machines required. > 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what > is the configuration setup for cloud computing. > Please suggest and help me on this. > Thanks&Regards, > Prabhu. +
Nick Jones 2012-09-05, 14:59
-
Re: One petabyte of data loading into HDFS with in 10 min.Mathias Herberts 2012-09-05, 15:12
It greatly depends on the form thie PB is stored under, if we're
talking N files with N >> 1 then you might get better performance by sharding the import job on multiple boxes. If it's a single 1PB file then Infiniband might be your best bet, but won't get you close to 10' +
Mathias Herberts 2012-09-05, 15:12
-
Re: One petabyte of data loading into HDFS with in 10 min.zGreenfelder 2012-09-05, 14:56
On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[EMAIL PROTECTED]> wrote:
> Here's an extremely naïve ballpark estimation: at theoretical hardware > speed, for 3PB representing 1PB with 3x replication > > Over a single 1Gbps connection (and I'm not sure, you can actually reach > 1Gbps) > (3 petabytes) / (1 Gbps) = 291.271111 days > > So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes > :) - (3PB/1Gbps)/40000 > > The actual number of nodes would depend a lot on the actual network > architecture, the type of storage you use (SSD, HDD), etc. > > Cosmin ah, I went te other direction with the math, and assumed no replication (completely unsafe and never reasonable for a real, production environment, but since we're all theory and just looking for starting point numbers) 1PB in 10 min =1,000,000gB in 10 min =8,000,000gb in 600 seconds = 80,000/6 ~= 14k machines running at gigabit or about 1.5k machines if you get 10Gb connected machines. all assuming there's no network or cluster sync overhead (of course there would be) that seems like some pretty deep pockets to get to < 10 minute load time for that much data. I could also be off, I just threw some stuff together somewhat quickly.between conf calls. -- Even the Magic 8 ball has an opinion on email clients: Outlook not so good. +
zGreenfelder 2012-09-05, 14:56
-
RE: One petabyte of data loading into HDFS with in 10 min.D'Souza, Clive V 2012-09-05, 14:58
Have you looked at using Infiniband fabric? You can get 4X higher throughput than 10GbE.
Regards, -C -----Original Message----- From: zGreenfelder [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 05, 2012 7:57 AM To: [EMAIL PROTECTED] Subject: Re: One petabyte of data loading into HDFS with in 10 min. On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[EMAIL PROTECTED]> wrote: > Here's an extremely naïve ballpark estimation: at theoretical hardware > speed, for 3PB representing 1PB with 3x replication > > Over a single 1Gbps connection (and I'm not sure, you can actually > reach > 1Gbps) > (3 petabytes) / (1 Gbps) = 291.271111 days > > So you'd need at least 40,000 1Gbps network cards to get that in 10 > minutes > :) - (3PB/1Gbps)/40000 > > The actual number of nodes would depend a lot on the actual network > architecture, the type of storage you use (SSD, HDD), etc. > > Cosmin ah, I went te other direction with the math, and assumed no replication (completely unsafe and never reasonable for a real, production environment, but since we're all theory and just looking for starting point numbers) 1PB in 10 min =1,000,000gB in 10 min =8,000,000gb in 600 seconds = 80,000/6 ~= 14k machines running at gigabit or about 1.5k machines if you get 10Gb connected machines. all assuming there's no network or cluster sync overhead (of course there would be) that seems like some pretty deep pockets to get to < 10 minute load time for that much data. I could also be off, I just threw some stuff together somewhat quickly.between conf calls. -- Even the Magic 8 ball has an opinion on email clients: Outlook not so good. +
D'Souza, Clive V 2012-09-05, 14:58
-
Re: One petabyte of data loading into HDFS with in 10 min.Michael Segel 2012-09-07, 14:00
Sorry, but you didn't account for the network saturation.
And why 1GBe and not 10GBe? Also which version of hadoop? Here MapR works well with bonding two 10GBe ports and with the right switch, you could do ok. Also 2 ToR switches... per rack. etc... How many machines? 150? 300? more? Then you don't talk about how much memory, CPUs, what type of storage... Lots of factors. I'm sorry to interrupt this mental masturbation about how to load 1PB in 10min. There is a lot more questions that should be asked that weren't. Hey but look. Its a Friday, so I suggest some pizza, beer and then take it to a white board. But what do I know? In a different thread, I'm talking about how to tame HR and Accounting so they let me play with my team Ninja! :-P On Sep 5, 2012, at 9:56 AM, zGreenfelder <[EMAIL PROTECTED]> wrote: > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[EMAIL PROTECTED]> wrote: >> Here's an extremely naïve ballpark estimation: at theoretical hardware >> speed, for 3PB representing 1PB with 3x replication >> >> Over a single 1Gbps connection (and I'm not sure, you can actually reach >> 1Gbps) >> (3 petabytes) / (1 Gbps) = 291.271111 days >> >> So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes >> :) - (3PB/1Gbps)/40000 >> >> The actual number of nodes would depend a lot on the actual network >> architecture, the type of storage you use (SSD, HDD), etc. >> >> Cosmin > > ah, I went te other direction with the math, and assumed no > replication (completely unsafe and never reasonable for a real, > production environment, but since we're all theory and just looking > for starting point numbers) > > > 1PB in 10 min => 1,000,000gB in 10 min => 8,000,000gb in 600 seconds => > 80,000/6 ~= 14k machines running at gigabit or about 1.5k machines if you > get 10Gb connected machines. > > all assuming there's no network or cluster sync overhead > (of course there would be) > > > that seems like some pretty deep pockets to get to < 10 minute load > time for that much data. > > I could also be off, I just threw some stuff together somewhat > quickly.between conf calls. > > -- > Even the Magic 8 ball has an opinion on email clients: Outlook not so good. > +
Michael Segel 2012-09-07, 14:00
-
Re: One petabyte of data loading into HDFS with in 10 min.prabhu K 2012-09-10, 07:40
Hi Users,
Thanks for the response. We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration. Each Node (1 machine master, 2 machines are slave) 1. 500 GB hard disk. 2. 4Gb RAM 3. 3 quad code CPUs. 4. Speed 1333 MHz Now, we are planning to load 1 petabyte of data (single file) into Hadoop HDFS and Hive table within 10-20 minutes. For this we need a clarification below. 1. what are the system configuration setup required for all the 3 machine’s ?. 2. Hard disk size. 3. RAM size. 4. Mother board 5. Network cable 6. How much Gbps Infiniband required. For the same setup we need cloud computing environment too? Please suggest and help me on this. Thanks, Prabhu. On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > Sorry, but you didn't account for the network saturation. > > And why 1GBe and not 10GBe? Also which version of hadoop? > > Here MapR works well with bonding two 10GBe ports and with the right > switch, you could do ok. > Also 2 ToR switches... per rack. etc... > > How many machines? 150? 300? more? > > Then you don't talk about how much memory, CPUs, what type of storage... > > Lots of factors. > > I'm sorry to interrupt this mental masturbation about how to load 1PB in > 10min. > There is a lot more questions that should be asked that weren't. > > Hey but look. Its a Friday, so I suggest some pizza, beer and then take it > to a white board. > > But what do I know? In a different thread, I'm talking about how to tame > HR and Accounting so they let me play with my team Ninja! > :-P > > On Sep 5, 2012, at 9:56 AM, zGreenfelder <[EMAIL PROTECTED]> wrote: > > > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[EMAIL PROTECTED]> > wrote: > >> Here's an extremely naïve ballpark estimation: at theoretical hardware > >> speed, for 3PB representing 1PB with 3x replication > >> > >> Over a single 1Gbps connection (and I'm not sure, you can actually reach > >> 1Gbps) > >> (3 petabytes) / (1 Gbps) = 291.271111 days > >> > >> So you'd need at least 40,000 1Gbps network cards to get that in 10 > minutes > >> :) - (3PB/1Gbps)/40000 > >> > >> The actual number of nodes would depend a lot on the actual network > >> architecture, the type of storage you use (SSD, HDD), etc. > >> > >> Cosmin > > > > ah, I went te other direction with the math, and assumed no > > replication (completely unsafe and never reasonable for a real, > > production environment, but since we're all theory and just looking > > for starting point numbers) > > > > > > 1PB in 10 min => > 1,000,000gB in 10 min => > 8,000,000gb in 600 seconds => > > > 80,000/6 ~= 14k machines running at gigabit or about 1.5k machines if > you > > get 10Gb connected machines. > > > > all assuming there's no network or cluster sync overhead > > (of course there would be) > > > > > > that seems like some pretty deep pockets to get to < 10 minute load > > time for that much data. > > > > I could also be off, I just threw some stuff together somewhat > > quickly.between conf calls. > > > > -- > > Even the Magic 8 ball has an opinion on email clients: Outlook not so > good. > > > > +
prabhu K 2012-09-10, 07:40
-
Re: One petabyte of data loading into HDFS with in 10 min.Steve Loughran 2012-09-10, 09:40
On 10 September 2012 08:40, prabhu K <[EMAIL PROTECTED]> wrote:
> Hi Users, > > Thanks for the response. > > > We have loaded 100GB data loaded into HDFS, time taken 1hr.with below > configuration. > > Each Node (1 machine master, 2 machines are slave) > > 1. 500 GB hard disk. > > 2. 4Gb RAM > > 3. 3 quad code CPUs. > > 4. Speed 1333 MHz > > > > Now, we are planning to load 1 petabyte of data (single file) into > Hadoop HDFS and Hive table within 10-20 minutes. For this we need a > clarification below. > > 1. what are the system configuration setup required for all the 3 > machine’s ?. > 2. Hard disk size. > At least a petabyte, maybe three. If you were planning to do some pre-storage processing, such as filter or compress the data, to it before the upload. > 3. RAM size. > > 4. Mother board > > 5. Network cable > > 6. How much Gbps Infiniband required. > > yes. > For the same setup we need cloud computing environment too? > > Please suggest and help me on this. > > Thanks, > Prabhu, I don't think you've been reading the replies fully. The data rate coming off the filtered Cern LHC experiments is 1.6 PB/month. Your "10 minute" upload is trying to handle two weeks' worth of CERN data in a fraction of time. Nobody can seriously point to your questions and say "this is the motherboard you need" as your project seems to have some unrealistic goals. If you do want to do a 1PB upload in 10 minutes -or even, say 30-60 minutes, the first actions in your project should be 1. Come up with some realistic deliverables rather than a a vague "1 PB/10 minute" requirements. 2. Include a realistic timetable as part of those deliverables. 3. Look at the data source(s) and work out how fast they can actually generate data off their hard disks, out of their database, or whatever. That's your maximum bandwidth irrespective of what you do with the data afterwards. 4. Hire someone who knows about these problems and how to solve them -or who at least is respected enough that when they say "you need realistic goals" they'd be believed. Someone could set up a network to transfer 1 PB of data into a Hadoop cluster in 10 Minutes, but it would be a bleeding edge exercise you'd end up writing papers about in VLDB or similar conferences. The cost of doing so would be utterly excessive unless you were planning to load (and then hopefully, discard) another PB in the next 10 minutes -and again, repeatedly. Otherwise you would be paying massive amounts for network bandwidth that would only ever be using for ten minutes. Asking for help on the -user list isn't going to solve your problems, as the "1 PB in 10 minutes" goal is the problem. Do you really need all that data? In 10 minutes? IF so, then you're going to have to find someone who really, really knows about networking, disk IO bandwidth, cluster commissioning, etc. I'm not volunteering. I may have some colleagues you could talk to, but that -as with other people on this list- would be in the category of action 5, "pay for consultancy" Sorry. +
Steve Loughran 2012-09-10, 09:40
-
Re: One petabyte of data loading into HDFS with in 10 min.Michael Segel 2012-09-10, 11:50
On Sep 10, 2012, at 2:40 AM, prabhu K <[EMAIL PROTECTED]> wrote: > Hi Users, > > Thanks for the response. > > We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration. > > Each Node (1 machine master, 2 machines are slave) > > 1. 500 GB hard disk. > 2. 4Gb RAM > 3. 3 quad code CPUs. > 4. Speed 1333 MHz > > Now, we are planning to load 1 petabyte of data (single file) into Hadoop HDFS and Hive table within 10-20 minutes. For this we need a clarification below. > > Ok... Some say that I am sometimes too harsh in my criticisms so take what I say with a grain of salt... You loaded 100GB in an hour using woefully underperforming hardware and are now saying you want to load 1PB in 10 mins. I would strongly suggest that you first learn more about Hadoop. No really. Looking at your first machine, its obvious that you don't really grok hadoop and what it requires to achieve optimum performance. You couldn't even extrapolate any meaningful data from your current environment. Secondly, I think you need to actually think about the problem. Did you mean PB or TB? Because your math seems to be off by a couple orders of magnitude. A single file measured in PBs? That is currently impossible using today (2012) technology. In fact a single file that is measured in PBs wouldn't exist within the next 5 years and most likely the next decade. [Moore's law is all about CPU power, not disk density.] Also take a look at networking. ToR switch design differs, however current technology, the fabric tends to max out at 40GBs. What's the widest fabric on a backplane? That's your first bottleneck because even if you had a 1PB of data, you couldn't feed it to the cluster fast enough. Forget disk. look at PCIe based memory. (Money no object, right? ) You still couldn't populate it fast enough. I guess Steve hit this nail on the head when he talked about this being a homework assignment. High school maybe? > 1. what are the system configuration setup required for all the 3 machine’s ?. > > 2. Hard disk size. > > 3. RAM size. > > 4. Mother board > > 5. Network cable > > 6. How much Gbps Infiniband required. > > For the same setup we need cloud computing environment too? > > Please suggest and help me on this. > > Thanks, > > Prabhu. > > On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > Sorry, but you didn't account for the network saturation. > > And why 1GBe and not 10GBe? Also which version of hadoop? > > Here MapR works well with bonding two 10GBe ports and with the right switch, you could do ok. > Also 2 ToR switches... per rack. etc... > > How many machines? 150? 300? more? > > Then you don't talk about how much memory, CPUs, what type of storage... > > Lots of factors. > > I'm sorry to interrupt this mental masturbation about how to load 1PB in 10min. > There is a lot more questions that should be asked that weren't. > > Hey but look. Its a Friday, so I suggest some pizza, beer and then take it to a white board. > > But what do I know? In a different thread, I'm talking about how to tame HR and Accounting so they let me play with my team Ninja! > :-P > > On Sep 5, 2012, at 9:56 AM, zGreenfelder <[EMAIL PROTECTED]> wrote: > > > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[EMAIL PROTECTED]> wrote: > >> Here's an extremely naïve ballpark estimation: at theoretical hardware > >> speed, for 3PB representing 1PB with 3x replication > >> > >> Over a single 1Gbps connection (and I'm not sure, you can actually reach > >> 1Gbps) > >> (3 petabytes) / (1 Gbps) = 291.271111 days > >> > >> So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes > >> :) - (3PB/1Gbps)/40000 > >> > >> The actual number of nodes would depend a lot on the actual network > >> architecture, the type of storage you use (SSD, HDD), etc. > >> > >> Cosmin > > > > ah, I went te other direction with the math, and assumed no > > replication (completely unsafe and never reasonable for a real, +
Michael Segel 2012-09-10, 11:50
-
RE: One petabyte of data loading into HDFS with in 10 min.Gauthier, Alexander 2012-09-10, 16:17
Well said Mike. Lots of "funny questions" around here lately...
From: Michael Segel [mailto:[EMAIL PROTECTED]] Sent: Monday, September 10, 2012 4:50 AM To: [EMAIL PROTECTED] Cc: Michael Segel Subject: Re: One petabyte of data loading into HDFS with in 10 min. On Sep 10, 2012, at 2:40 AM, prabhu K <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi Users, Thanks for the response. We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration. Each Node (1 machine master, 2 machines are slave) 1. 500 GB hard disk. 2. 4Gb RAM 3. 3 quad code CPUs. 4. Speed 1333 MHz Now, we are planning to load 1 petabyte of data (single file) into Hadoop HDFS and Hive table within 10-20 minutes. For this we need a clarification below. Ok... Some say that I am sometimes too harsh in my criticisms so take what I say with a grain of salt... You loaded 100GB in an hour using woefully underperforming hardware and are now saying you want to load 1PB in 10 mins. I would strongly suggest that you first learn more about Hadoop. No really. Looking at your first machine, its obvious that you don't really grok hadoop and what it requires to achieve optimum performance. You couldn't even extrapolate any meaningful data from your current environment. Secondly, I think you need to actually think about the problem. Did you mean PB or TB? Because your math seems to be off by a couple orders of magnitude. A single file measured in PBs? That is currently impossible using today (2012) technology. In fact a single file that is measured in PBs wouldn't exist within the next 5 years and most likely the next decade. [Moore's law is all about CPU power, not disk density.] Also take a look at networking. ToR switch design differs, however current technology, the fabric tends to max out at 40GBs. What's the widest fabric on a backplane? That's your first bottleneck because even if you had a 1PB of data, you couldn't feed it to the cluster fast enough. Forget disk. look at PCIe based memory. (Money no object, right? ) You still couldn't populate it fast enough. I guess Steve hit this nail on the head when he talked about this being a homework assignment. High school maybe? 1. what are the system configuration setup required for all the 3 machine's ?. 2. Hard disk size. 3. RAM size. 4. Mother board 5. Network cable 6. How much Gbps Infiniband required. For the same setup we need cloud computing environment too? Please suggest and help me on this. Thanks, Prabhu. On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Sorry, but you didn't account for the network saturation. And why 1GBe and not 10GBe? Also which version of hadoop? Here MapR works well with bonding two 10GBe ports and with the right switch, you could do ok. Also 2 ToR switches... per rack. etc... How many machines? 150? 300? more? Then you don't talk about how much memory, CPUs, what type of storage... Lots of factors. I'm sorry to interrupt this mental masturbation about how to load 1PB in 10min. There is a lot more questions that should be asked that weren't. Hey but look. Its a Friday, so I suggest some pizza, beer and then take it to a white board. But what do I know? In a different thread, I'm talking about how to tame HR and Accounting so they let me play with my team Ninja! :-P On Sep 5, 2012, at 9:56 AM, zGreenfelder <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> Here's an extremely naïve ballpark estimation: at theoretical hardware >> speed, for 3PB representing 1PB with 3x replication >> >> Over a single 1Gbps connection (and I'm not sure, you can actually reach >> 1Gbps) >> (3 petabytes) / (1 Gbps) = 291.271111 days >> >> So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes >> :) - (3PB/1Gbps)/40000 >> +
Gauthier, Alexander 2012-09-10, 16:17
-
Re: One petabyte of data loading into HDFS with in 10 min.Fabio Pitzolu 2012-09-05, 14:47
290 days per petabyte, I'll analyze your data manually!! Also print out
some report! :-D Fabio 2012/9/5 Cosmin Lehene <[EMAIL PROTECTED]> > Here's an extremely naïve ballpark estimation: at theoretical hardware > speed, for 3PB representing 1PB with 3x replication > > Over a single 1Gbps connection (and I'm not sure, you can actually reach > 1Gbps) > (3 petabytes) / (1 Gbps) = 291.271111 days > > So you'd need at least 40,000 1Gbps network cards to get that in 10 > minutes :) - (3PB/1Gbps)/40000<http://www.google.ro/search?client=safari&rls=en&q=(3PB/1Gbps)/40000&ie=UTF-8&oe=UTF-8&redir_esc=&ei=2WRHUNWtGIWo0QW52oDYDw> > > The actual number of nodes would depend a lot on the actual network > architecture, the type of storage you use (SSD, HDD), etc. > > Cosmin > From: prabhu K <[EMAIL PROTECTED]> > Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Date: Wednesday, September 5, 2012 3:21 PM > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Subject: One petabyte of data loading into HDFS with in 10 min. > > Hi Users, > > Please clarify the below questions. > > 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many > slave (Data Nodes) machines required. > > 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is > the configuration setup for cloud computing. > > Please suggest and help me on this. > > Thanks&Regards, > Prabhu. > > +
Fabio Pitzolu 2012-09-05, 14:47
-
One petabyte of data loading into HDFS with in 10 min.prabhu K 2012-09-05, 12:21
Hi Users,
Please clarify the below questions. 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many slave (Data Nodes) machines required. 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is the configuration setup for cloud computing. Please suggest and help me on this. Thanks&Regards, Prabhu. +
prabhu K 2012-09-05, 12:21
-
Re: One petabyte of data loading into HDFS with in 10 min.Chen He 2012-09-05, 14:03
If it is not a single file, you can upload them using multiple threads to
HDFS. On Wed, Sep 5, 2012 at 7:21 AM, prabhu K <[EMAIL PROTECTED]> wrote: > Hi Users, > > Please clarify the below questions. > > 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many > slave (Data Nodes) machines required. > > 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is > the configuration setup for cloud computing. > > Please suggest and help me on this. > > Thanks&Regards, > Prabhu. > > +
Chen He 2012-09-05, 14:03
-
RE: One petabyte of data loading into HDFS with in 10 min.Shailesh Dargude 2012-09-05, 14:14
Sorry Prabhu for hijacking this discussion a bit.. I wonder , what is the best practice to load the data in HDFS in general. Considering the size of the data ( many times its in gbs or TBs generally), how are storage and time constraints handled.
If anybody can share your experiences or best practice it would great! -Shailesh. From: Chen He [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 05, 2012 7:34 PM To: [EMAIL PROTECTED] Subject: Re: One petabyte of data loading into HDFS with in 10 min. If it is not a single file, you can upload them using multiple threads to HDFS. On Wed, Sep 5, 2012 at 7:21 AM, prabhu K <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi Users, Please clarify the below questions. 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many slave (Data Nodes) machines required. 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is the configuration setup for cloud computing. Please suggest and help me on this. Thanks&Regards, Prabhu. +
Shailesh Dargude 2012-09-05, 14:14
-
Re: One petabyte of data loading into HDFS with in 10 min.Mohammad Tariq 2012-09-05, 14:22
Hello Shailesh,
Give distcp a shot. It runs a MR for copying data from source to destination, so the data can be copied parallely. Regards, Mohammad Tariq On Wed, Sep 5, 2012 at 7:44 PM, Shailesh Dargude < [EMAIL PROTECTED]> wrote: > Sorry Prabhu for hijacking this discussion a bit.. I wonder , what is the > best practice to load the data in HDFS in general. Considering the size of > the data ( many times its in gbs or TBs generally), how are storage and > time constraints handled.**** > > ** ** > > If anybody can share your experiences or best practice it would great!*** > * > > ** ** > > -Shailesh.**** > > ** ** > > *From:* Chen He [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, September 05, 2012 7:34 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: One petabyte of data loading into HDFS with in 10 min.**** > > ** ** > > If it is not a single file, you can upload them using multiple threads to > HDFS.**** > > On Wed, Sep 5, 2012 at 7:21 AM, prabhu K <[EMAIL PROTECTED]> wrote:* > *** > > Hi Users,**** > > **** > > Please clarify the below questions.**** > > **** > > 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many > slave (Data Nodes) machines required.**** > > **** > > 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is > the configuration setup for cloud computing.**** > > **** > > Please suggest and help me on this.**** > > **** > > Thanks&Regards,**** > > Prabhu.**** > > **** > > ** ** > +
Mohammad Tariq 2012-09-05, 14:22
-
Re: One petabyte of data loading into HDFS with in 10 min.Steve Loughran 2012-09-07, 09:12
On 5 September 2012 13:21, prabhu K <[EMAIL PROTECTED]> wrote:
> Hi Users, > > Please clarify the below questions. > > 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many > slave (Data Nodes) machines required. > > 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is > the configuration setup for cloud computing. > > Please suggest and help me on this. > > Thanks&Regards, > Prabhu. > > Which course is this the homework for? +
Steve Loughran 2012-09-07, 09:12
-
Re: One petabyte of data loading into HDFS with in 10 min.Gulfie 2012-09-06, 20:52
Back up for a second. Why would you want to do this and where does the data come from? Is this a new PB of data every time? or is it PB total with some new and some old? Only migrating the deltas could help. Can the data migration/load have it's latency hidden? Is the PB of data ready all at once? Is the first 100TB ready to be loaded long before the last 100TB is written/gathered/generated? Is it possible to generate/gather the data into HDFS originally so there is no initial load time penalty? 1PB / 10 minutes = 26 Terabits / second throughput ( 3x that for naive data redundancy ). That is a lot. Not crazy a lot, but a lot. Todays large core switches/routers can do single multi Tb/sec, you'd need a fleet of them or use openflow. Redundancy will require going across a node to node network of some sort be it SAN, Ethernet or whatever. By building a special purpose back end replication network/nodes you may be able to decrease the network costs. If you really want to push this much data around that quickly the only type of network that makes sense one that avoids over subscription. Look into Fat tree networks as a start. Tens of thousands of nodes running at gigabit or thousands of nodes running at 10gig, or hundreds of nodes running infiniband (40gbit). The biggest question is can you avoid having to do this much data migration? Networks aren't getting faster as fast as CPUs are. A long term architecture based on growing datasets and data migration is looking for trouble. -gulfie On Wed, Sep 05, 2012 at 05:51:50PM +0530, prabhu K wrote: > Hi Users, > > Please clarify the below questions. > > 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many > slave (Data Nodes) machines required. > > 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is the > configuration setup for cloud computing. > > Please suggest and help me on this. > > Thanks&Regards, > Prabhu. +
Gulfie 2012-09-06, 20:52
-
Re: One petabyte of data loading into HDFS with in 10 min.Michael Segel 2012-09-10, 19:54
Well I think the question would make more sense if he meant to say how one could load a GB file within 10 mins.
Note that 1x10^6 GB are in a PB. (Hence the comment about being off by several orders of magnitude. ) Now were the OP asking about how to load 1GB file in 10min, then you're within the realm of 10GBe, SATA drives and a couple of nodes. And then the question would make sense. But to your point. What's the incremental load if the data is a single 1PB file? Either you have the file, or you don't. ;-) As to hitting your limits, we all have limits. Mine is c. ;-) On Sep 10, 2012, at 2:22 PM, Siddharth Tiwari <[EMAIL PROTECTED]> wrote: > Well can't you load the incremental data only ? as the goal seems quite unrealistic. The big guns have already spoken :P > > > *------------------------* > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of God.” > "Maybe other people will try to limit me but I don't limit myself" > > > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: One petabyte of data loading into HDFS with in 10 min. > Date: Mon, 10 Sep 2012 16:17:20 +0000 > > Well said Mike. Lots of “funny questions” around here lately… > > From: Michael Segel [mailto:[EMAIL PROTECTED]] > Sent: Monday, September 10, 2012 4:50 AM > To: [EMAIL PROTECTED] > Cc: Michael Segel > Subject: Re: One petabyte of data loading into HDFS with in 10 min. > > > On Sep 10, 2012, at 2:40 AM, prabhu K <[EMAIL PROTECTED]> wrote: > > > Hi Users, > > Thanks for the response. > > We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration. > > Each Node (1 machine master, 2 machines are slave) > > 1. 500 GB hard disk. > 2. 4Gb RAM > 3. 3 quad code CPUs. > 4. Speed 1333 MHz > > Now, we are planning to load 1 petabyte of data (single file) into Hadoop HDFS and Hive table within 10-20 minutes. For this we need a clarification below. > > Ok... > > Some say that I am sometimes too harsh in my criticisms so take what I say with a grain of salt... > > You loaded 100GB in an hour using woefully underperforming hardware and are now saying you want to load 1PB in 10 mins. > > I would strongly suggest that you first learn more about Hadoop. No really. Looking at your first machine, its obvious that you don't really grok hadoop and what it requires to achieve optimum performance. You couldn't even extrapolate any meaningful data from your current environment. > > Secondly, I think you need to actually think about the problem. Did you mean PB or TB? Because your math seems to be off by a couple orders of magnitude. > > A single file measured in PBs? That is currently impossible using today (2012) technology. In fact a single file that is measured in PBs wouldn't exist within the next 5 years and most likely the next decade. [Moore's law is all about CPU power, not disk density.] > > Also take a look at networking. > ToR switch design differs, however current technology, the fabric tends to max out at 40GBs. What's the widest fabric on a backplane? > That's your first bottleneck because even if you had a 1PB of data, you couldn't feed it to the cluster fast enough. > > Forget disk. look at PCIe based memory. (Money no object, right? ) > You still couldn't populate it fast enough. > > I guess Steve hit this nail on the head when he talked about this being a homework assignment. > > High school maybe? > > > > 1. what are the system configuration setup required for all the 3 machine’s ?. > > 2. Hard disk size. > > 3. RAM size. > > 4. Mother board > > 5. Network cable > > 6. How much Gbps Infiniband required. > > For the same setup we need cloud computing environment too? > > Please suggest and help me on this. > > Thanks, > > Prabhu. > > On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <[EMAIL PROTECTED]> wrote: +
Michael Segel 2012-09-10, 19:54
|