Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Tune MapReduce over HBase to insert data


Copy link to this message
-
Re: Tune MapReduce over HBase to insert data
Bing Jiang 2013-01-13, 09:31
hi,anoop.
Why not hbase mapreduce package contains the tools like this?

Anoop John <[EMAIL PROTECTED]>编写:

>Hi
>             Can you think of using HFileOutputFormat ?  Here you use
>TableOutputFormat now. There will be put calls to HTable. Instead in
>HFileOutput format the MR will write the HFiles directly.[No flushes ,
>compactions] Later using LoadIncrementalHFiles need to load the HFiles to
>the regions.  May help you..
>
>-Anoop-
>
>On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari <
>[EMAIL PROTECTED]> wrote:
>
>> Thank you guys,let me change these configuration & test mapreduce again.
>>
>> On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika <[EMAIL PROTECTED]>
>> wrote:
>>
>> > Start by testing HDFS throughput by doing s simple copyFromLocal using
>> > Hadoop command line shell (bin/hadoop fs -copyFromLocal pathTo8GBFile
>> > /tmp/dummyFile1). If you have 1000Mbit/sec network between the computers,
>> > you should get around 75 MB/sec.
>> >
>> > On Tuesday, January 8, 2013, Bing Jiang wrote:
>> >
>> > > In our experience, it can enhance mapreduce insert by
>> > > 1.add regionserver flush thread number
>> > > 2.add memstore/jvm_heap
>> > > 3.pre split table region before mapreduce
>> > > 4.add large and small compaction thread number.
>> > >
>> > > please correct me if wrong, or any other better ideas.
>> > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED]
>> <javascript:;>>
>> > > wrote:
>> > >
>> > > > What type of disks and how many?
>> > > > With the default replication factor your 2 (or 6) GB are actually
>> > > > replicated 3 times.
>> > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which a
>> > > > reasonable machine should be able to absorb.
>> > > > The fact that deferred log flush does not help you seems to indicate
>> > that
>> > > > you're over IO bound.
>> > > >
>> > > >
>> > > > What's your memstore flush size? Potentially the data is written many
>> > > > times during compactions.
>> > > >
>> > > >
>> > > > In your case you dial down the HDFS replication, since you only have
>> > two
>> > > > physical machines anyway.
>> > > > (Set it to 2. If you do not specify any failure zones, you might as
>> > well
>> > > > set it to 1... You will lose data if one of your server machines dies
>> > > > anyway).
>> > > >
>> > > > It does not really make that much sense to deploy HBase and HDFS on
>> > > > virtual nodes like this.
>> > > > -- Lars
>> > > >
>> > > >
>> > > >
>> > > > ________________________________
>> > > >  From: Farrokh Shahriari <[EMAIL PROTECTED]
>> <javascript:;>>
>> > > > To: [EMAIL PROTECTED] <javascript:;>
>> > > > Sent: Monday, January 7, 2013 9:38 PM
>> > > > Subject: Re: Tune MapReduce over HBase to insert data
>> > > >
>> > > > Hi again,
>> > > > I'm using HBase 0.92.1-cdh4.0.0.
>> > > > I have two server machine with 48Gb RAM,12 physical core & 24 logical
>> > > core
>> > > > that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM
>> &
>> > 2
>> > > > VCPU.
>> > > > I've set some parameter that get better result like set WAL=off on
>> > > put,but
>> > > > some parameters like Heap-size,Deferred log flush don't help me.
>> > > > Beside that I have another question,why each time I've run
>> > mapreduce,I've
>> > > > got different result time while all the config & hardware are same &
>> > not
>> > > > change ?
>> > > >
>> > > > Tnx you guys
>> > > >
>> > > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED]
>> > <javascript:;>>
>> > > wrote:
>> > > >
>> > > > > Have you read through
>> http://hbase.apache.org/book.html#performance?
>> > > > >
>> > > > > What version of HBase are you using ?
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari <
>> > > > > [EMAIL PROTECTED] <javascript:;>> wrote:
>> > > > >
>> > > > > > Hi there
>> > > > > > I have a cluster with 12 nodes that each of them has 2 core of
>> CPU.
>> > > > Now,I
>> > > > > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ).