Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> HBaseStorage improvements


Copy link to this message
-
Re: HBaseStorage improvements
This is awesome, thanks for taking charge on this Christoph. HBaseStorage
could has grown to the point that it can use some love. A few comments:

- Yes, HBaseStorage has gotten too big and we need to think about how to
break it up a bit. Maybe we make it into a composite or we break the
storage/loader parts out somehow, but the changes will need to be backwards
compatible. I think this work should be done in it's own jira without any
new functionality.

- There has been discussion in the past about supporting returning multiple
versions of a cell with timestamps. The thought was that this would produce
a different schema and would be in a new storage/loader class. The idea is
that you'd get one row per rk, but each descriptor field would have a tuple
of two-tuples (ts, value). Would this work for your needs instead of
producing multiple rows per rk? Producing multiple rows per rk would
require some tricky grouping to get specific fields out, especially if the
cell values don't share common timestamps. If we had one tuple per rk, that
would lend itself to UDFs that could operate on each fields cell values.

- What you're proposing re the snapshot functionality is great, but I think
the syntax is a bit confusing. The term 'snapshot' might mean different
things to different people, but if we talk in terms of cell timestamps I
think it will make the implied functionality very clear. Also, speaking in
terms of greater than or less then helps. This would also align with the
current syntax. I'm thinking of options like this:

-cellTsLt
-cellTsGt
-cellTsEquals
-cellTsLte
-cellTsGte
-cellLimit

Would that work for your use case? That would allow us to support returning
multiple cell versions or just one with that syntax. cellLimit would
default to 1, but you could set it > 1 to get back multiple version of a
cell.

Let me know what you think.

thanks,
Bill

On Thu, Nov 8, 2012 at 7:18 AM, Christoph Bauer <[EMAIL PROTECTED]>wrote:

> Hi,
>
> here at postdirekt we have need for a lot more timestamp handling in
> HBaseStorage then there is. We're starting on a patch to pig.
>
> I think there are many people out there who would welcome those changes and
> we are willing to pass that patch on to the community if it is desired.
>
> So there is a short proposal here:
>
> https://cwiki.apache.org/confluence/display/PIG/HBaseStorage+Timestamp+Extensions
>
> We're also open to other changes. So please reply.
>
>
> I have a question:
> HBaseStorage is getting really big and could do with splitting up into
> smaller parts to make it readable again. Would this require a patch on its
> own?
>
> regards,
> Christoph Bauer
>

--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB