Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Wikisearch Performance Question


+
Patrick Lynch 2013-05-21, 17:30
+
Josh Elser 2013-05-21, 18:16
+
Christopher 2013-05-21, 18:11
+
Patrick Lynch 2013-05-21, 18:47
+
Patrick Lynch 2013-05-21, 18:35
+
Josh Elser 2013-05-21, 18:58
+
Eric Newton 2013-05-21, 19:54
Copy link to this message
-
Re: Wikisearch Performance Question
According to https://issues.apache.org/jira/browse/HADOOP-7823 , it should
possible to split bzip2 files in Hadoop 1.1.
On Tue, May 21, 2013 at 3:54 PM, Eric Newton <[EMAIL PROTECTED]> wrote:

> The files decompress remarkably fast, too. I seem to recall about 8
> minutes on our hardware.
>
> I could not get map/reduce to split on blocks in bzip'd files.
>
> That gave me a long tail since the English file is so much bigger.
>
> Uncompressing the files is the way to go.
>
> -Eric
>
>
> On Tue, May 21, 2013 at 2:58 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
>
>> You should see much better ingest performance having decompressed input.
>> Hadoop will also 'naturally' handle the splits for you based on the HDFS
>> block size.
>>
>>
>> On 5/21/13 2:35 PM, Patrick Lynch wrote:
>>
>>> I think your description is accurate, except that I split the single
>>> archive into a much greater number of pieces than the number of
>>> different archives I ingested. Specifically, I set numGroups to a higher
>>> number, I didn't split the archive my hand in hdfs. The archives are
>>> bzip2-ed, not gzip-ed. Will decompressing still have the same benefit?
>>>
>>>
>>> -----Original Message-----
>>> From: Josh Elser <[EMAIL PROTECTED]>
>>> To: user <[EMAIL PROTECTED]>
>>> Sent: Tue, May 21, 2013 2:16 pm
>>> Subject: Re: Wikisearch Performance Question
>>>
>>> Let me see if I understand what you're asking: you took one mediawiki
>>> archive and split it into n archives of size 1/n the original. You then
>>> took many n _different_ mediawiki archives and ingested those. You tried
>>> to get the speed of ingesting many different archives be as fast as
>>> splitting an original single archive?
>>>
>>> Are you using gzip'ed input files? Have you tried just decompressing the
>>> gzip into plaintext? Hadoop will naturally split uncompressed text and
>>> and give you nice balancing.
>>>
>>> I haven't looked at the ingest code in a long time. Not sure if it ever
>>> received much love.
>>>
>>> On 5/21/13 1:30 PM, Patrick Lynch wrote:
>>>
>>>> user@accumulo,
>>>>
>>>> I was working with the Wikipedia Accumulo ingest examples, and I was
>>>> trying to get the ingest of a single archive file to be as fast as
>>>> ingesting multiple archives through parallelization. I increased the
>>>> number of ways the job split the single archive so that all the servers
>>>> could work on ingesting at the same time. What I noticed, however, was
>>>> that having all the servers work on ingesting the same file was still
>>>> not nearly as fast as using multiple ingest files. I was wondering if I
>>>> could have some insight into the design of the Wikipedia ingest that
>>>> could explain this phenomenon.
>>>>
>>>> Thank you for your time,
>>>> Patrick Lynch
>>>>
>>>
>>>
>
+
Patrick Lynch 2013-05-22, 18:36
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB