|
|
-
hadoop streaming and a directory containing large number of .tgz files
Sunil S Nandihalli 2012-04-24, 13:42
Hi Everybody, I am a newbie to hadoop. I have about 40K .tgz files each of approximately 3MB . I would like to process this as if it were a single large file formed by "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt" how can I achieve this using hadoop-streaming or some-other similar library.. thanks, Sunil.
-
Re: hadoop streaming and a directory containing large number of .tgz files
Sunil S Nandihalli 2012-04-24, 14:01
Sorry for reforwarding this email. I was not sure if it actually got through since I just got the confirmation regarding my membership to the mailing list. Thanks, Sunil.
On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli < [EMAIL PROTECTED]> wrote:
> Hi Everybody, > I am a newbie to hadoop. I have about 40K .tgz files each of > approximately 3MB . I would like to process this as if it were a single > large file formed by > "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt" > how can I achieve this using hadoop-streaming or some-other similar > library.. > > > thanks, > Sunil. >
-
Re: hadoop streaming and a directory containing large number of .tgz files
Raj Vishwanathan 2012-04-24, 14:29
Sunil
You could use identity mappers, a single identity reducer and by not having output compression.,
Raj
>________________________________ > From: Sunil S Nandihalli <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Tuesday, April 24, 2012 7:01 AM >Subject: Re: hadoop streaming and a directory containing large number of .tgz files > >Sorry for reforwarding this email. I was not sure if it actually got >through since I just got the confirmation regarding my membership to the >mailing list. >Thanks, >Sunil. > >On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli < >[EMAIL PROTECTED]> wrote: > >> Hi Everybody, >> I am a newbie to hadoop. I have about 40K .tgz files each of >> approximately 3MB . I would like to process this as if it were a single >> large file formed by >> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt" >> how can I achieve this using hadoop-streaming or some-other similar >> library.. >> >> >> thanks, >> Sunil. >> > > >
-
RE: hadoop streaming and a directory containing large number of .tgz files
Devaraj k 2012-04-24, 17:37
Hi Sunil,
Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful to solve your problem.
Thanks Devaraj ________________________________________ From: Sunil S Nandihalli [[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 7:12 PM To: [EMAIL PROTECTED] Subject: hadoop streaming and a directory containing large number of .tgz files
Hi Everybody, I am a newbie to hadoop. I have about 40K .tgz files each of approximately 3MB . I would like to process this as if it were a single large file formed by "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt" how can I achieve this using hadoop-streaming or some-other similar library.. thanks, Sunil.
|
|