Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Wikisearch Performance Question


Copy link to this message
-
Re: Wikisearch Performance Question
You should see much better ingest performance having decompressed input.
Hadoop will also 'naturally' handle the splits for you based on the HDFS
block size.

On 5/21/13 2:35 PM, Patrick Lynch wrote:
> I think your description is accurate, except that I split the single
> archive into a much greater number of pieces than the number of
> different archives I ingested. Specifically, I set numGroups to a higher
> number, I didn't split the archive my hand in hdfs. The archives are
> bzip2-ed, not gzip-ed. Will decompressing still have the same benefit?
>
>
> -----Original Message-----
> From: Josh Elser <[EMAIL PROTECTED]>
> To: user <[EMAIL PROTECTED]>
> Sent: Tue, May 21, 2013 2:16 pm
> Subject: Re: Wikisearch Performance Question
>
> Let me see if I understand what you're asking: you took one mediawiki
> archive and split it into n archives of size 1/n the original. You then
> took many n _different_ mediawiki archives and ingested those. You tried
> to get the speed of ingesting many different archives be as fast as
> splitting an original single archive?
>
> Are you using gzip'ed input files? Have you tried just decompressing the
> gzip into plaintext? Hadoop will naturally split uncompressed text and
> and give you nice balancing.
>
> I haven't looked at the ingest code in a long time. Not sure if it ever
> received much love.
>
> On 5/21/13 1:30 PM, Patrick Lynch wrote:
>> user@accumulo,
>>
>> I was working with the Wikipedia Accumulo ingest examples, and I was
>> trying to get the ingest of a single archive file to be as fast as
>> ingesting multiple archives through parallelization. I increased the
>> number of ways the job split the single archive so that all the servers
>> could work on ingesting at the same time. What I noticed, however, was
>> that having all the servers work on ingesting the same file was still
>> not nearly as fast as using multiple ingest files. I was wondering if I
>> could have some insight into the design of the Wikipedia ingest that
>> could explain this phenomenon.
>>
>> Thank you for your time,
>> Patrick Lynch
>