The HDFS namenode may have problems with many small files, but it
depends on how many you're talking about. If the quantity you're
talking about becomes problematic for HDFS, you could consider a
chunking strategy to break up files larger than your threshold and
store the chunks in Accumulo. (Accumulo does not have the ability to
stream content into values, in response to your first post, but
chunking could achieve a similar result.)
Christopher L Tubbs II
On Sun, Jun 9, 2013 at 8:21 PM, Frank Smith <[EMAIL PROTECTED]> wrote:
> So, what are your thoughts on storing a bunch of small files on the HDFS?
> Sequence Files, Avro?
> I will note that these are essentially write once and read heavy chunks of
>> Date: Sun, 9 Jun 2013 17:08:42 -0400
>> Subject: Re: Best practices in sizing values?
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>> At the very least, I would keep it under the size of your compressed
>> data blocks in your RFiles (this may mean you should increase value of
>> table.file.compress.blocksize to be larger than the default of 100K).
>> You could also tweak this according to your application. Say, for
>> example, you wanted to limit the additional work to resolve the
>> pointer and retrieve from HDFS only 5% of the time, you could sample
>> your data, and choose a cutoff value that keeps 95% of your data in
>> the Accumulo table.
>> Personally, I like to keep things under 1MB in the value, and under 1K
>> in the key, as a crude rule of thumb, but it very much depends on the
>> Christopher L Tubbs II
>> On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith <[EMAIL PROTECTED]>
>> > I have an application where I have a block of unstructured text.
>> > Normally
>> > that text is relatively small <500k, but there are conditions where it
>> > can
>> > be up to GBs of text.
>> > I was considering of using a threshold where I simply decide to change
>> > from
>> > storing the text in the value of my mutation, and just add a reference
>> > to
>> > the HDFS location, but I wanted to get some advice on where that
>> > threshold
>> > should (best practice) or must (system limitation) be?
>> > Also, can I stream data into a value, vice passing a byte array? Similar
>> > to
>> > how CLOBs and BLOBs are handled in an RDBMS.
>> > Thanks,
>> > Frank