I don't know that having the values being 128M chunks would make much
difference if you still need to reassemble the chunk at a later time.
The data is going to be stored in chunks smaller than that (unless the
size of the data when its stored in HDFS is less than the block size),
meaning that you'll probably have a longer access time in storing it
in chunks than if you store it as one value (where the splits will be
On Mon, Apr 1, 2013 at 10:55 AM, Josh Elser <[EMAIL PROTECTED]> wrote:
> Ignoring the actual size constraint necessary (I'm not entirely sure how
> that all adds up; it would be affected by concurrent query load and many
> other things), placing the large chunk into the Key will affect the size of
> the index inside of RFile (the file construct actually backing the data in
> your table). This will increase your access times just to find the offset in
> the file for the Key you're looking for.
> Putting a chunk number in the Key and the actual data in the Value will
> probably net you much better results. Chunking into 128M should work with a
> 3G heap; however, I'd err on the cautious side and make many smaller chunks
> instead of few very large chunks.
> On 4/1/13 10:33 AM, David Medinets wrote:
>> I have a chunk of data (let's say 400M) that I want to store in Accumulo.
>> I can store the chunk in the ColumnFamily or in the Value. Does it make any
>> difference to Accumulo which is used?
>> My tserver is setup to use -Xmx3g. What is the largest size that seems to
>> work? I have much more that I can allocate.
>> Or should I focus on breaking the data into smaller pieces ... say 128M