Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> hadoop streaming and a directory containing large number of .tgz files


Copy link to this message
-
Re: hadoop streaming and a directory containing large number of .tgz files
Sunil

You could use identity mappers, a single identity reducer and by not having output compression.,

Raj

>________________________________
> From: Sunil S Nandihalli <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Tuesday, April 24, 2012 7:01 AM
>Subject: Re: hadoop streaming and a directory containing large number of .tgz files
>
>Sorry for reforwarding this email. I was not sure if it actually got
>through since I just got the confirmation regarding my membership to the
>mailing list.
>Thanks,
>Sunil.
>
>On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
>[EMAIL PROTECTED]> wrote:
>
>> Hi Everybody,
>>  I am a newbie to hadoop. I have about 40K .tgz files each of
>> approximately 3MB . I would like to process this as if it were a single
>> large file formed by
>> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
>> how can I achieve this using hadoop-streaming or some-other similar
>> library..
>>
>>
>> thanks,
>> Sunil.
>>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB