Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Reagrding HBase Hadoop multiple scan objects issue


Copy link to this message
-
Re: Reagrding HBase Hadoop multiple scan objects issue

Hi there-

You probably want to review this section of the RegGuide:
http://hbase.apache.org/book.html#mapreduce

re:  "it's inefficient to have one scan object to scan everything."
It is.  But in the MapReduce case, there is a Map-task for each input
split (see the RefGuide for details), and therefore a Scanner instance per
Map-task.

On 1/18/13 5:43 PM, "Xu, Leon" <[EMAIL PROTECTED]> wrote:

>Hi HBase users,
>
>I am currently trying to set up a denormalization map-reduce job for my
>HBase Table.
>Since our table contains large volume of data, it's inefficient to have
>one scan object to scan everything. We are only need to process those
>records that have changes. I am planning to have multiple scan objects,
>each of which scan object specifies range given that we are in track of
>what rows has been changed.
>Therefore I am trying to set up the map-reduce job with multiple scan
>objects, is this possible?
>I am seeing some post online suggesting extending the InputFormat object
>and change the getSplits, is this the most efficient way?
>
>Using filter seems to be not very efficient in my case because it's
>basically still scan the whole table,right? Just filter out some certain
>records.
>
>Can you point me to the right direction?
>
>
>Thanks
>Leon
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB