Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Best practices in sizing values?


Copy link to this message
-
RE: Best practices in sizing values?
So, what are your thoughts on storing a bunch of small files on the HDFS?  Sequence Files, Avro?
I will note that these are essentially write once and read heavy chunks of text.

> Date: Sun, 9 Jun 2013 17:08:42 -0400
> Subject: Re: Best practices in sizing values?
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
>
> At the very least, I would keep it under the size of your compressed
> data blocks in your RFiles (this may mean you should increase value of
> table.file.compress.blocksize to be larger than the default of 100K).
>
> You could also tweak this according to your application. Say, for
> example, you wanted to limit the additional work to resolve the
> pointer and retrieve from HDFS only 5% of the time, you could sample
> your data, and choose a cutoff value that keeps 95% of your data in
> the Accumulo table.
>
> Personally, I like to keep things under 1MB in the value, and under 1K
> in the key, as a crude rule of thumb, but it very much depends on the
> application.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith <[EMAIL PROTECTED]> wrote:
> > I have an application where I have a block of unstructured text.  Normally
> > that text is relatively small <500k, but there are conditions where it can
> > be up to GBs of text.
> >
> > I was considering of using a threshold where I simply decide to change from
> > storing the text in the value of my mutation, and just add a reference to
> > the HDFS location, but I wanted to get some advice on where that threshold
> > should (best practice) or must (system limitation) be?
> >
> > Also, can I stream data into a value, vice passing a byte array?  Similar to
> > how CLOBs and BLOBs are handled in an RDBMS.
> >
> > Thanks,
> >
> > Frank