Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Adjusting column value size.


+
edward choi 2011-10-04, 05:58
+
Jean-Daniel Cryans 2011-10-06, 17:49
Copy link to this message
-
Re: Adjusting column value size.
Yes, I need all of those ints at the same time. And no, there is no
streaming.

I have decided to pack 1024 ints into one cell so that each cell would be of
size 4kb.
I am already using LZO on my tables.

I'll do some experiments once I finish implementing both approach.
I'll add a thread about the results when I am done.
Thanks for the advice.

Ed.

2011/10/7 Jean-Daniel Cryans <[EMAIL PROTECTED]>

> (BCC'd common-user@ since this seems strictly HBase related)
>
> Interesting question... And you probably need all those ints at the same
> time right? No streaming? I'll assume no.
>
> So the second solution seems better due to the overhead of storing each
> cell. Basically, storing one int per cell you would end up storing more
> keys
> than values (size wise).
>
> Another thing is that if you pack enough ints together and there's some
> sort
> of repetition, you might be able to use LZO compression on that table.
>
> I'd love to hear about your experimentations once you've done them.
>
> J-D
>
> On Mon, Oct 3, 2011 at 10:58 PM, edward choi <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I have a question regarding the performance and column value size.
> > I need to store per row several million integers. ("Several million" is
> > important here)
> > I was wondering which method would be more beneficial performance wise.
> >
> > 1) Store each integer to a single column so that when a row is called,
> > several million columns will also be called. And the user would map each
> > column values to some kind of container (ex: vector, arrayList)
> > 2) Store, for example, a thousand integers into a single column (by
> > concatenating them) so that when a row is called, only several thousand
> > columns will be called along. The user would have to split the column
> value
> > into 4 bytes and map the split integer to some kind of container (ex:
> > vector, arrayList)
> >
> > I am curious which approach would be better. 1) would call several
> millions
> > of columns but no additional process is needed. 2) would call only
> several
> > thousands of columns but additional process is needed.
> > Any advice would be appreciated.
> >
> > Ed
> >
>