Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Request for suggestions


+
srinivasrajagopalan@... 2012-11-26, 18:54
+
Jonathan Coveney 2012-11-26, 21:14
Copy link to this message
-
Re: Request for suggestions
let us say if we have a relational table with 100 columns and 100000 rows of data.

Using the DataDrivenDBInputFormat class, we were able to provide the min & max ids (let us say 1 and 100000) and let Hadoop manage spinning off as many mapper tasks and each such task handles a subset of data (rather some # of rows of data). i.e. partitioning based on rows
But, we also want to partition the columns so that a single Hadoop job can spin off say 20 mapper
tasks where each mapper task works with 5 columns of data. i.e. partitioning based on columns

If we were to use Cassandra (and not a relational table) with Hadoop, then they provide something called ClassFamilyInputFormat, which seems to offer what we are looking for. I am not 100% sure though.
Hope it is clear. 

regards,
srinivas
________________________________
 From: Jonathan Coveney <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; [EMAIL PROTECTED]
Sent: Monday, November 26, 2012 1:14 PM
Subject: Re: Request for suggestions
 

Can you flesh out what you want it to do a little more? Maybe some example queries?

2012/11/26 <[EMAIL PROTECTED]>

Hi,
>
>
>We have a scenario where we want a single Hadoop job to create/manage multiple mapper tasks where each mapper task will query a subset of columns in a relational database table. We looked into DataDrivenDBInputFormat, but that only seems to facilitate partitioning where each mapper task can query a subset of rows in a relational database table.
>
>I am not sure if Pig can help us in this case.
>
>Appreciate any suggestions in this regard.
>
>Thanks
>srinivas
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB