Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> hadoop streaming and a directory containing large number of .tgz files


Copy link to this message
-
Re: hadoop streaming and a directory containing large number of .tgz files
Sunil

You could use identity mappers, a single identity reducer and by not having output compression.,

Raj

>________________________________
> From: Sunil S Nandihalli <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Tuesday, April 24, 2012 7:01 AM
>Subject: Re: hadoop streaming and a directory containing large number of .tgz files
>
>Sorry for reforwarding this email. I was not sure if it actually got
>through since I just got the confirmation regarding my membership to the
>mailing list.
>Thanks,
>Sunil.
>
>On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
>[EMAIL PROTECTED]> wrote:
>
>> Hi Everybody,
>>  I am a newbie to hadoop. I have about 40K .tgz files each of
>> approximately 3MB . I would like to process this as if it were a single
>> large file formed by
>> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
>> how can I achieve this using hadoop-streaming or some-other similar
>> library..
>>
>>
>> thanks,
>> Sunil.
>>
>
>
>