Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> hadoop streaming and a directory containing large number of .tgz files


+
Sunil S Nandihalli 2012-04-24, 13:42
Copy link to this message
-
RE: hadoop streaming and a directory containing large number of .tgz files
Hi Sunil,

    Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful to solve your problem.

Thanks
Devaraj
________________________________________
From: Sunil S Nandihalli [[EMAIL PROTECTED]]
Sent: Tuesday, April 24, 2012 7:12 PM
To: [EMAIL PROTECTED]
Subject: hadoop streaming and a directory containing large number of .tgz files

Hi Everybody,
 I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
"cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
how can I achieve this using hadoop-streaming or some-other similar
library..
thanks,
Sunil.
+
Sunil S Nandihalli 2012-04-24, 14:01
+
Raj Vishwanathan 2012-04-24, 14:29
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB