Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: One petabyte of data loading into HDFS with in 10 min.


+
Nick Jones 2012-09-05, 14:59
+
Mathias Herberts 2012-09-05, 15:12
+
zGreenfelder 2012-09-05, 14:56
+
DSouza, Clive V 2012-09-05, 14:58
+
Michael Segel 2012-09-07, 14:00
+
prabhu K 2012-09-10, 07:40
+
Steve Loughran 2012-09-10, 09:40
+
Michael Segel 2012-09-10, 11:50
+
Gauthier, Alexander 2012-09-10, 16:17
+
Fabio Pitzolu 2012-09-05, 14:47
+
prabhu K 2012-09-05, 12:21
+
Chen He 2012-09-05, 14:03
+
Shailesh Dargude 2012-09-05, 14:14
+
Mohammad Tariq 2012-09-05, 14:22
+
Steve Loughran 2012-09-07, 09:12
+
Gulfie 2012-09-06, 20:52
Copy link to this message
-
Re: One petabyte of data loading into HDFS with in 10 min.
Well I think the question would make more sense if he meant to say how one could load a GB file within 10 mins.

Note that 1x10^6 GB are in a PB.  (Hence the comment about being off by several orders of magnitude. )

Now were the OP asking about how to load 1GB file in 10min,

then you're within the realm of 10GBe, SATA drives and a couple of nodes.
And then the question would make sense.

But to your point. What's the incremental load if the data is a single 1PB file?
Either you have the file, or you don't. ;-)

As to hitting your limits, we all have limits.  Mine is c. ;-)

On Sep 10, 2012, at 2:22 PM, Siddharth Tiwari <[EMAIL PROTECTED]> wrote:

> Well can't you load the incremental data only ? as the goal seems quite unrealistic. The big guns have already spoken :P
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: One petabyte of data loading into HDFS with in 10 min.
> Date: Mon, 10 Sep 2012 16:17:20 +0000
>
> Well said Mike. Lots of “funny questions” around here lately…
>  
> From: Michael Segel [mailto:[EMAIL PROTECTED]]
> Sent: Monday, September 10, 2012 4:50 AM
> To: [EMAIL PROTECTED]
> Cc: Michael Segel
> Subject: Re: One petabyte of data loading into HDFS with in 10 min.
>  
>  
> On Sep 10, 2012, at 2:40 AM, prabhu K <[EMAIL PROTECTED]> wrote:
>
>
> Hi Users,
>  
> Thanks for the response.
>  
> We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration.
>
> Each Node (1 machine master, 2 machines  are slave)
>
> 1.    500 GB hard disk.
> 2.    4Gb RAM
> 3.    3 quad code CPUs.
> 4.    Speed 1333 MHz
>  
> Now, we are planning to load 1 petabyte of data (single file)  into Hadoop HDFS and Hive table within 10-20 minutes. For this we need a clarification below.
>
> Ok...
>  
> Some say that I am sometimes too harsh in my criticisms so take what I say with a grain of salt...
>  
> You loaded 100GB in an hour using woefully underperforming hardware and are now saying you want to load 1PB in 10 mins.
>  
> I would strongly suggest that you first learn more about Hadoop.  No really. Looking at your first machine, its obvious that you don't really grok hadoop and what it requires to achieve optimum performance.  You couldn't even extrapolate any meaningful data from your current environment.
>  
> Secondly, I think you need to actually think about the problem. Did you mean PB or TB? Because your math seems to be off by a couple orders of magnitude.
>  
> A single file measured in PBs? That is currently impossible using today (2012) technology. In fact a single file that is measured in PBs wouldn't exist within the next 5 years and most likely the next decade. [Moore's law is all about CPU power, not disk density.]
>  
> Also take a look at networking.
> ToR switch design differs, however current technology, the fabric tends to max out at 40GBs.  What's the widest fabric on a backplane?
> That's your first bottleneck because even if you had a 1PB of data, you couldn't feed it to the cluster fast enough.
>  
> Forget disk. look at PCIe based memory. (Money no object, right? )
> You still couldn't populate it fast enough.
>  
> I guess Steve hit this nail on the head when he talked about this being a homework assignment.
>  
> High school maybe?
>  
>
>
> 1. what are the system configuration setup required for all the 3 machine’s ?.
>
> 2. Hard disk size.
>
> 3. RAM size.
>
> 4. Mother board
>
> 5. Network cable
>
> 6. How much Gbps  Infiniband required.
>
>  For the same setup we need cloud computing environment too?
>
> Please suggest and help me on this.
>
>  Thanks,
>
> Prabhu.
>
> On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB