Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> RE: EXTERNAL: Re: Large files in Accumulo


Copy link to this message
-
Re: EXTERNAL: Re: Large files in Accumulo
You can still index a 1GB file... you just shouldn't try to push it
all in in a single mutation, nor should you try to store it using a
scheme that uses large keys.

You can even still store the whole raw file in Accumulo, particularly
if you chunk it up across multiple entries, but you may have to have a
2-stage lookup, where you get intermediate results in the middle, and
you subsequently issue another query for the final result. It seems to
me that this 2-stage lookup would be simple enough to implement as a
client-side tool, provided you get the storage/indexes figured out.

On Thu, Aug 23, 2012 at 5:34 PM, Cardon, Tejay E
<[EMAIL PROTECTED]> wrote:
> Thanks Eric,
>
> I was afraid that would be the case.  If I understand you correctly, putting
> a GB file into Accumulo would be a bad idea.  Given that fact, are there any
> strategies available to ensure that a given file in HDFS is co-located with
> the index info for that file in Accumulo? (I would assume not).  In my case,
> I could use Accumulo to store my indexes for fast query, but then have them
> return a URL/URI to the actual file.  However, I have to process each of
> those files further to get to my final result, and I was hoping to do the
> second stage of processing without having to return intermediate results.
> Am I correct in assuming that this can’t be done?
>
>
>
> Thanks,
>
> Tejay
>
>
>
> From: Eric Newton [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, August 23, 2012 3:06 PM
> To: [EMAIL PROTECTED]
> Subject: Re: EXTERNAL: Re: Large files in Accumulo
>
>
>
> An entire mutation needs to fit in memory several times, so you should not
> attempt to push in a single mutation larger than a 100MB unless you have a
> lot of memory in your tserver/logger.
>
>
>
> And while I'm at it, large keys will create large indexes, so try to keep
> your (row,cf,cq,cv) under 100K.
>
>
>
> -Eric
>
> On Thu, Aug 23, 2012 at 4:37 PM, Cardon, Tejay E <[EMAIL PROTECTED]>
> wrote:
>
> In my case I’ll be doing a document based index store (like the wikisearch
> example), but my documents may be as large as several GB.  I just wanted to
> pick the collective brain of the group to see if I’m walking into a major
> headache.  If it’s never been tried before, then I’ll give it a shot and
> report back.
>
>
> Tejay
>
>
>
> From: William Slacum [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, August 23, 2012 2:07 PM
> To: [EMAIL PROTECTED]
> Subject: EXTERNAL: Re: Large files in Accumulo
>
>
>
> Are these RFiles as a whole? I know at some point HBase needed to have
> entire rows fit into memory; Accumulo does not have this restriction.
>
> On Thu, Aug 23, 2012 at 12:55 PM, Cardon, Tejay E <[EMAIL PROTECTED]>
> wrote:
>
> Alright, this one’s a quick question.  I’ve been told that HBase does not
> perform well if large (> 100MB) files are stored in it).  Does Accumulo have
> similar trouble?  If so, can it be overcome by storing the large files in
> their own locality group?
>
>
>
> Thanks,
>
> Tejay
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB