Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> A question about HBase MapReduce


Copy link to this message
-
Re: A question about HBase MapReduce

re:  "data from raw data file into hbase table"

One approach is bulk loading..

http://hbase.apache.org/book.html#arch.bulk.load

If he's talking about using an Hbase table as the source of a MR job, then
see this...
http://hbase.apache.org/book.html#splitter
On 5/25/12 2:35 AM, "Florin P" <[EMAIL PROTECTED]> wrote:

>Hello!
>
>I've read Lars George's blog
>http://www.larsgeorge.com/2009/05/hbase-mapreduce-101-part-i.html where
>at the end of the article, he mentioned "In the next post I will show you
>how to import data from a raw data
>file into a HBase table and how you eventually process the data in the
>HBase table. We will address questions like how many mappers and/or
>reducers are needed and how can I improve import and processing
>performance.". I looked in the blog up for these questions, but it seems
>that there is no article related. Do you knoe if he you touched these
>subjects into a different post or book? Particular I am interested
>
>1. how you can set up the number of mappers?
>2. number of mappers can be set up per region server? If yes how?
>3. How the big number of set up mappers can affect the data locality?
>4. is this algorithm for computing the number of mappers
>(https://issues.apache.org/jira/browse/HBASE-1172) still available
>"Currently,
>the number of mappers specified when using TableInputFormat is strictly
>followed if less than total regions on the input table. If greater, the
>number of regions is used.
>This will modify the splitting algorithm to do the following:
> * Specify 0 mappers when you want # mappers = # regions
> * If you specify fewer mappers than regions, will use exactly the number
>you specify based on the current algorithm
> * If
>you specify more mappers than regions, will divide regions up by
>determining [start,X) [X,end). The number of mappers will always be a
>multiple of number of regions. This is so we do not have scanners
>spanning multiple regions.
>There is an additional issue in that the default number of mappers
>in JobConf is set to 1. That means if a user does not explicitly set
>number of map tasks, a single mapper will be used. "
>
>I'll look forward for you answers. Thank you.
>
>Kind regards, Florin
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB