Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> One petabyte of data loading into HDFS with in 10 min.


Copy link to this message
-
Re: One petabyte of data loading into HDFS with in 10 min.
Sorry, but you didn't account for the network saturation.

And why 1GBe and not 10GBe? Also which version of hadoop?

Here MapR works well with bonding two 10GBe ports and with the right switch, you could do ok.
Also 2 ToR switches... per rack. etc...

How many machines? 150? 300? more?

Then you don't talk about how much memory, CPUs, what type of storage...

Lots of factors.

I'm sorry to interrupt this mental masturbation about how to load 1PB in 10min.
There is a lot more questions that should be asked that weren't.

Hey but look. Its a Friday, so I suggest some pizza, beer and then take it to a white board.

But what do I know? In a different thread, I'm talking about how to tame HR and Accounting so they let me play with my team Ninja!
:-P

On Sep 5, 2012, at 9:56 AM, zGreenfelder <[EMAIL PROTECTED]> wrote:

> On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[EMAIL PROTECTED]> wrote:
>> Here's an extremely naïve ballpark estimation: at theoretical hardware
>> speed, for 3PB representing 1PB with 3x replication
>>
>> Over a single 1Gbps connection (and I'm not sure, you can actually reach
>> 1Gbps)
>> (3 petabytes) / (1 Gbps) = 291.271111 days
>>
>> So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes
>> :) - (3PB/1Gbps)/40000
>>
>> The actual number of nodes would depend a lot on the actual network
>> architecture, the type of storage you use (SSD,  HDD), etc.
>>
>> Cosmin
>
> ah, I went te other direction with the math, and assumed no
> replication (completely unsafe and never reasonable for a real,
> production environment, but since we're all theory and just looking
> for starting point numbers)
>
>
> 1PB in 10 min => 1,000,000gB in 10 min => 8,000,000gb in 600 seconds =>
> 80,000/6  ~= 14k machines running at gigabit or about 1.5k machines if you
> get 10Gb connected machines.
>
> all assuming there's no network or cluster sync overhead
> (of course there would be)
>
>
> that seems like some pretty deep pockets to get to < 10 minute load
> time for that much data.
>
> I could also be off, I just threw some stuff together somewhat
> quickly.between conf calls.
>
> --
> Even the Magic 8 ball has an opinion on email clients: Outlook not so good.
>