Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> MapReduce job with mixed data sources: HBase table and HDFS files


+
S. Zhou 2013-07-03, 04:34
+
Azuryy Yu 2013-07-03, 05:06
+
S. Zhou 2013-07-03, 15:34
+
Michael Segel 2013-07-03, 21:19
+
Azuryy Yu 2013-07-04, 01:02
+
Ted Yu 2013-07-04, 04:29
+
S. Zhou 2013-07-04, 03:41
Copy link to this message
-
Re: MapReduce job with mixed data sources: HBase table and HDFS files
Actually you can, albeit it will be slower than you would think.

You'd have to do a single threaded scan to pull the data from the remote cluster to the local cluster then once its local you can parallelize the HDFS m/r portion of the job.

Note: Can do some thing versus can't do something doesn't mean its going to be a good idea.

An alternative would be for the client to run a m/r job on the remote cluster which then writes to the second cluster.  This will parallelize the initial scan.

On Jul 3, 2013, at 8:02 PM, Azuryy Yu <[EMAIL PROTECTED]> wrote:

> Hi,
> 1) It cannot input two different cluster's data to a MR job.
> 2) If your data locates in the same cluster, then:
>
>    conf.set(TableInputFormat.SCAN,
> TableMapReduceUtil.convertScanToString(new Scan()));
>    conf.set(TableInputFormat.INPUT_TABLE, tableName);
>
>    MultipleInputs.addInputPath(conf, new Path(input_on_hdfs),
> TextInputFormat.class, MapperForHdfs.class);
>    MultipleInputs.addInputPath(conf, new Path(input_on_hbase),
> TableInputFormat.class, MapperForHBase.class);*
>
> *
> but,
> new Path(input_on_hbase) can be any path, it make no sense.*
>
> *
> Please refer to
> org.apache.hadoop.hbase.mapreduce.IndexBuilder for how to read table in the
> MR job under $HBASE_HOME/src/example*
>
>
>
> *
>
>
> On Thu, Jul 4, 2013 at 5:19 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> You may want to pull your data from your HBase first in a separate map
>> only job and then use its output along with other HDFS input.
>> There is a significant disparity between the reads from HDFS and from
>> HBase.
>>
>>
>> On Jul 3, 2013, at 10:34 AM, S. Zhou <[EMAIL PROTECTED]> wrote:
>>
>>> Azuryy, I am looking at the MultipleInputs doc. But I could not figure
>> out how to add HBase table as a Path to the input? Do you have some sample
>> code? Thanks!
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Azuryy Yu <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]; S. Zhou <[EMAIL PROTECTED]>
>>> Sent: Tuesday, July 2, 2013 10:06 PM
>>> Subject: Re: MapReduce job with mixed data sources: HBase table and HDFS
>> files
>>>
>>>
>>> Hi ,
>>>
>>> Use MultipleInputs, which can solve your problem.
>>>
>>>
>>> On Wed, Jul 3, 2013 at 12:34 PM, S. Zhou <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I know how to create MapReduce job with HBase data source only or HDFS
>>>> file as data source. Now I need to create a MapReduce job with mixed
>> data
>>>> sources, that is, this MR job need to read data from both HBase and HDFS
>>>> files. Is it possible? If yes, could u share some sample code?
>>>>
>>>> Thanks!
>>>> Senqiang
>>
>>
+
S. Zhou 2013-07-10, 17:15
+
Ted Yu 2013-07-10, 17:21
+
S. Zhou 2013-07-10, 17:55
+
Ted Yu 2013-07-10, 18:21
+
S. Zhou 2013-07-11, 22:44
+
Ted Yu 2013-07-11, 22:51
+
S. Zhou 2013-07-12, 04:49
+
Ted Yu 2013-07-12, 04:54
+
S. Zhou 2013-07-12, 05:19
+
S. Zhou 2013-07-12, 15:49
+
S. Zhou 2013-07-03, 15:18
+
Ted Yu 2013-07-03, 04:57
+
S. Zhou 2013-07-03, 15:17