|
|
-
Request for suggestions
srinivasrajagopalan@... 2012-11-26, 18:54
Hi, We have a scenario where we want a single Hadoop job to create/manage multiple mapper tasks where each mapper task will query a subset of columns in a relational database table. We looked into DataDrivenDBInputFormat, but that only seems to facilitate partitioning where each mapper task can query a subset of rows in a relational database table.
I am not sure if Pig can help us in this case.
Appreciate any suggestions in this regard.
Thanks srinivas
+
srinivasrajagopalan@... 2012-11-26, 18:54
-
Re: Request for suggestions
Jonathan Coveney 2012-11-26, 21:14
Can you flesh out what you want it to do a little more? Maybe some example queries? 2012/11/26 <[EMAIL PROTECTED]>
> Hi, > > > We have a scenario where we want a single Hadoop job to create/manage > multiple mapper tasks where each mapper task will query a subset of columns > in a relational database table. We looked into DataDrivenDBInputFormat, but > that only seems to facilitate partitioning where each mapper task can query > a subset of rows in a relational database table. > > I am not sure if Pig can help us in this case. > > Appreciate any suggestions in this regard. > > Thanks > srinivas
+
Jonathan Coveney 2012-11-26, 21:14
-
Re: Request for suggestions
srinivasrajagopalan@... 2012-11-26, 21:54
let us say if we have a relational table with 100 columns and 100000 rows of data.
Using the DataDrivenDBInputFormat class, we were able to provide the min & max ids (let us say 1 and 100000) and let Hadoop manage spinning off as many mapper tasks and each such task handles a subset of data (rather some # of rows of data). i.e. partitioning based on rows But, we also want to partition the columns so that a single Hadoop job can spin off say 20 mapper tasks where each mapper task works with 5 columns of data. i.e. partitioning based on columns
If we were to use Cassandra (and not a relational table) with Hadoop, then they provide something called ClassFamilyInputFormat, which seems to offer what we are looking for. I am not 100% sure though. Hope it is clear.
regards, srinivas ________________________________ From: Jonathan Coveney <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; [EMAIL PROTECTED] Sent: Monday, November 26, 2012 1:14 PM Subject: Re: Request for suggestions
Can you flesh out what you want it to do a little more? Maybe some example queries?
2012/11/26 <[EMAIL PROTECTED]>
Hi, > > >We have a scenario where we want a single Hadoop job to create/manage multiple mapper tasks where each mapper task will query a subset of columns in a relational database table. We looked into DataDrivenDBInputFormat, but that only seems to facilitate partitioning where each mapper task can query a subset of rows in a relational database table. > >I am not sure if Pig can help us in this case. > >Appreciate any suggestions in this regard. > >Thanks >srinivas
+
srinivasrajagopalan@... 2012-11-26, 21:54
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext