The files decompress remarkably fast, too. I seem to recall about 8 minutes
on our hardware.
I could not get map/reduce to split on blocks in bzip'd files.
That gave me a long tail since the English file is so much bigger.
Uncompressing the files is the way to go.
On Tue, May 21, 2013 at 2:58 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> You should see much better ingest performance having decompressed input.
> Hadoop will also 'naturally' handle the splits for you based on the HDFS
> block size.
> On 5/21/13 2:35 PM, Patrick Lynch wrote:
>> I think your description is accurate, except that I split the single
>> archive into a much greater number of pieces than the number of
>> different archives I ingested. Specifically, I set numGroups to a higher
>> number, I didn't split the archive my hand in hdfs. The archives are
>> bzip2-ed, not gzip-ed. Will decompressing still have the same benefit?
>> -----Original Message-----
>> From: Josh Elser <[EMAIL PROTECTED]>
>> To: user <[EMAIL PROTECTED]>
>> Sent: Tue, May 21, 2013 2:16 pm
>> Subject: Re: Wikisearch Performance Question
>> Let me see if I understand what you're asking: you took one mediawiki
>> archive and split it into n archives of size 1/n the original. You then
>> took many n _different_ mediawiki archives and ingested those. You tried
>> to get the speed of ingesting many different archives be as fast as
>> splitting an original single archive?
>> Are you using gzip'ed input files? Have you tried just decompressing the
>> gzip into plaintext? Hadoop will naturally split uncompressed text and
>> and give you nice balancing.
>> I haven't looked at the ingest code in a long time. Not sure if it ever
>> received much love.
>> On 5/21/13 1:30 PM, Patrick Lynch wrote:
>>> I was working with the Wikipedia Accumulo ingest examples, and I was
>>> trying to get the ingest of a single archive file to be as fast as
>>> ingesting multiple archives through parallelization. I increased the
>>> number of ways the job split the single archive so that all the servers
>>> could work on ingesting at the same time. What I noticed, however, was
>>> that having all the servers work on ingesting the same file was still
>>> not nearly as fast as using multiple ingest files. I was wondering if I
>>> could have some insight into the design of the Wikipedia ingest that
>>> could explain this phenomenon.
>>> Thank you for your time,
>>> Patrick Lynch