Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> XML files and Sequencefile


Copy link to this message
-
XML files and Sequencefile
Hi There,

I have a lot of small (~0.5 MB to 3 MB) XML files that I would like to
process using Apache Pig. Since dealing with a lot of small files is problematic , I was thinking of creating SeqeunceFiles such that each sequence file between 60 to 64 MB and no XML file is split onto 2 Sequence Files. Is there any utility that does the storing and loading of these files from Pig. I can for example create a Pig job that would read these XML files and generates few large sequence files  such that XML file is split onto 2 Sequence Files. I will then write another Pig job that will load these sequence files and then analyze them. Each of these XML files contains a lot of information for a given entity and the nesting can be quite deep. Any help with this would be great.

     
+
Shahab Yunus 2013-10-23, 22:34
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB