Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> MapReduce job with mixed data sources: HBase table and HDFS files


+
S. Zhou 2013-07-03, 04:34
+
Azuryy Yu 2013-07-03, 05:06
+
S. Zhou 2013-07-03, 15:34
+
Michael Segel 2013-07-03, 21:19
+
Azuryy Yu 2013-07-04, 01:02
+
Ted Yu 2013-07-04, 04:29
+
S. Zhou 2013-07-04, 03:41
Copy link to this message
-
Re: MapReduce job with mixed data sources: HBase table and HDFS files
Actually you can, albeit it will be slower than you would think.

You'd have to do a single threaded scan to pull the data from the remote cluster to the local cluster then once its local you can parallelize the HDFS m/r portion of the job.

Note: Can do some thing versus can't do something doesn't mean its going to be a good idea.

An alternative would be for the client to run a m/r job on the remote cluster which then writes to the second cluster.  This will parallelize the initial scan.

On Jul 3, 2013, at 8:02 PM, Azuryy Yu <[EMAIL PROTECTED]> wrote:

> Hi,
> 1) It cannot input two different cluster's data to a MR job.
> 2) If your data locates in the same cluster, then:
>
>    conf.set(TableInputFormat.SCAN,
> TableMapReduceUtil.convertScanToString(new Scan()));
>    conf.set(TableInputFormat.INPUT_TABLE, tableName);
>
>    MultipleInputs.addInputPath(conf, new Path(input_on_hdfs),
> TextInputFormat.class, MapperForHdfs.class);
>    MultipleInputs.addInputPath(conf, new Path(input_on_hbase),
> TableInputFormat.class, MapperForHBase.class);*
>
> *
> but,
> new Path(input_on_hbase) can be any path, it make no sense.*
>
> *
> Please refer to
> org.apache.hadoop.hbase.mapreduce.IndexBuilder for how to read table in the
> MR job under $HBASE_HOME/src/example*
>
>
>
> *
>
>
> On Thu, Jul 4, 2013 at 5:19 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> You may want to pull your data from your HBase first in a separate map
>> only job and then use its output along with other HDFS input.
>> There is a significant disparity between the reads from HDFS and from
>> HBase.
>>
>>
>> On Jul 3, 2013, at 10:34 AM, S. Zhou <[EMAIL PROTECTED]> wrote:
>>
>>> Azuryy, I am looking at the MultipleInputs doc. But I could not figure
>> out how to add HBase table as a Path to the input? Do you have some sample
>> code? Thanks!
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Azuryy Yu <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]; S. Zhou <[EMAIL PROTECTED]>
>>> Sent: Tuesday, July 2, 2013 10:06 PM
>>> Subject: Re: MapReduce job with mixed data sources: HBase table and HDFS
>> files
>>>
>>>
>>> Hi ,
>>>
>>> Use MultipleInputs, which can solve your problem.
>>>
>>>
>>> On Wed, Jul 3, 2013 at 12:34 PM, S. Zhou <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I know how to create MapReduce job with HBase data source only or HDFS
>>>> file as data source. Now I need to create a MapReduce job with mixed
>> data
>>>> sources, that is, this MR job need to read data from both HBase and HDFS
>>>> files. Is it possible? If yes, could u share some sample code?
>>>>
>>>> Thanks!
>>>> Senqiang
>>
>>
+
S. Zhou 2013-07-10, 17:15
+
Ted Yu 2013-07-10, 17:21
+
S. Zhou 2013-07-10, 17:55
+
Ted Yu 2013-07-10, 18:21
+
S. Zhou 2013-07-11, 22:44
+
Ted Yu 2013-07-11, 22:51
+
S. Zhou 2013-07-12, 04:49
+
Ted Yu 2013-07-12, 04:54
+
S. Zhou 2013-07-12, 05:19
+
S. Zhou 2013-07-12, 15:49
+
S. Zhou 2013-07-03, 15:18
+
Ted Yu 2013-07-03, 04:57
+
S. Zhou 2013-07-03, 15:17
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB