Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: [jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage


Copy link to this message
-
Re: [jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage
What about option a but return a map?

Sent from my iPhone

On Jan 27, 2011, at 5:01 PM, "Bill Graham (JIRA)" <[EMAIL PROTECTED]> wrote:

>
>    [ https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987839#action_12987839 ]
>
> Bill Graham commented on PIG-1782:
> ----------------------------------
>
> Assigning this to myself, since I've got a working patch, but the design needs to be vetted out further with this approach.
>
> One issue is that the number of columns per family per row is not constant, so with a sparse table you'd have no idea what column names go with each value of the tuple returned. Another issue is that the column name is actually dynamic descriptive data often times in HBase and there can be multiple timestamped values for a cell.
>
> * Option A:
> Instead of returning a tuple of values the load can return a tuple of tuples. Each inner tuple is a two-tuple that contains the column descriptor and the most recent value. This data structure would be returned if a 'cf:' style column exists in the column list, but default behavior exists with explicit column names. This is the simplest approach.
>
> * Option B:
> Build out an even more rich (and complex) data structure that also takes into account multiple values and their timestamps. A tuple of tuple of tuple of tuples to capture the entire HBase KeyValue data structure. Something like this:
>
> {code}
> (
> ( column name, ( (value, ts), ... ) ), ...
> )
> {code}
>
> Either way, the variable length tuples returned for each row containing additional variable length tuples would probably require a number of custom UDFs to do anything useful with variable name columns and multiple timestamped values.
>
> I guess I lean towards option B so we can support more use cases down the road with this refactor. Other opinions?
>
>> Add ability to load data by column family in HBaseStorage
>> ---------------------------------------------------------
>>
>>                Key: PIG-1782
>>                URL: https://issues.apache.org/jira/browse/PIG-1782
>>            Project: Pig
>>         Issue Type: New Feature
>>        Environment: Java 6, Mac OS X 10.6
>>           Reporter: Eric Yang
>>           Assignee: Bill Graham
>>
>> It would be nice to load all columns in the column family by using short hand syntax like:
>> {noformat}
>> CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
>> {noformat}
>> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in cpu column family.
>> CpuMetrics would contain something like:
>> {noformat}
>> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
>> {noformat}
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB