Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Best practices in sizing values?


Copy link to this message
-
Re: Best practices in sizing values?
Billie Rinaldi 2013-06-10, 01:45
See also the filedata example of splitting a file into chunks in a similar
way to what Josh describes.
http://accumulo.apache.org/1.5/examples/filedata.html
There is more information about the table structure for this example under
Data Table in the dirlist example.
http://accumulo.apache.org/1.5/examples/dirlist.html
On Sun, Jun 9, 2013 at 6:33 PM, Josh Elser <[EMAIL PROTECTED]> wrote:

> You would likely want to keep some common prefix in the key. This would
> make seeking to an arbitrary point in the file easier.
>
> e.g.
>
> doc1 data:0000001 [] _bytes_
> doc1 data:0000002 [] _bytes_
> doc1 data:0000003 [] _bytes_
>
> As far as chunk size, Christopher's advice is probably better than
> anything I could provide without direct experimentation with the HDFS block
> size, Accumulo table.file.compress.blocksize, and size of each Value. The
> best choice for you likely depends on your usage patterns.
>
> You could even store additional metadata for each "document" you store,
> such as chunk size, number of chunks, etc. Lots of flexibility with how you
> could approach this given the flexibility Accumulo provides with the
> columns you can use.
>
>
> On 06/09/2013 08:56 PM, Frank Smith wrote:
>
>> Josh,
>>
>> That is an interesting idea.  Would you link them through the keys, or
>> append the key to the end of the value of the previous part?
>>
>> You have thoughts on how big the chunks should be?
>>
>> I definitely agree that it would be better to keep the data in Accumulo,
>> vice references to the HDFS.  Accumulo already gives me a scheme for
>> organizing files very effectively on the HDFS, rolling my own doesn't
>> make sense, unless I don't have a good sense for the limitations of a
>> tablet server to manage those large files.
>>
>> Thanks,
>>
>> Frank
>>
>>  > Date: Sun, 9 Jun 2013 20:45:15 -0400
>>  > From: [EMAIL PROTECTED]
>>  > To: [EMAIL PROTECTED]
>>  > Subject: Re: Best practices in sizing values?
>>  >
>>  > One thing I wanted to add is that you will likely fare quite well
>>  > storing your very large files as a linked-list of bytes (multiple
>>  > key-value pairs make up one of your large blobs of text). You can even
>>  > use your segmentation of the large chunks of text to do more efficient
>>  > seek'ing within the file, if applicable to your application.
>>  >
>>  > I personally don't like the idea of using storing HDFS URIs into
>>  > Accumulo. If you think about what Accumulo is providing you, one of the
>>  > things it's great at is abstracting away the notion of that underlying
>>  > filesystem. Just a thought.
>>  >
>>  > On 06/09/2013 08:21 PM, Frank Smith wrote:
>>  > > So, what are your thoughts on storing a bunch of small files on the
>>  > > HDFS? Sequence Files, Avro?
>>  > >
>>  > > I will note that these are essentially write once and read heavy
>> chunks
>>  > > of text.
>>  > >
>>  > > > Date: Sun, 9 Jun 2013 17:08:42 -0400
>>  > > > Subject: Re: Best practices in sizing values?
>>  > > > From: [EMAIL PROTECTED]
>>  > > > To: [EMAIL PROTECTED]
>>  > > >
>>  > > > At the very least, I would keep it under the size of your
>> compressed
>>  > > > data blocks in your RFiles (this may mean you should increase
>> value of
>>  > > > table.file.compress.blocksize to be larger than the default of
>> 100K).
>>  > > >
>>  > > > You could also tweak this according to your application. Say, for
>>  > > > example, you wanted to limit the additional work to resolve the
>>  > > > pointer and retrieve from HDFS only 5% of the time, you could
>> sample
>>  > > > your data, and choose a cutoff value that keeps 95% of your data in
>>  > > > the Accumulo table.
>>  > > >
>>  > > > Personally, I like to keep things under 1MB in the value, and
>> under 1K
>>  > > > in the key, as a crude rule of thumb, but it very much depends on
>> the
>>  > > > application.
>>  > > >
>>  > > > --
>>  > > > Christopher L Tubbs II
>>  > > > http://gravatar.com/ctubbsii
>>  > > >
>>  > > >
>>  > > > On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith