Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> hadoop streaming and a directory containing large number of .tgz files


Copy link to this message
-
RE: hadoop streaming and a directory containing large number of .tgz files
Hi Sunil,

    Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful to solve your problem.

Thanks
Devaraj
________________________________________
From: Sunil S Nandihalli [[EMAIL PROTECTED]]
Sent: Tuesday, April 24, 2012 7:12 PM
To: [EMAIL PROTECTED]
Subject: hadoop streaming and a directory containing large number of .tgz files

Hi Everybody,
 I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
"cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
how can I achieve this using hadoop-streaming or some-other similar
library..
thanks,
Sunil.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB