Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - XML files and Sequencefile


Copy link to this message
-
XML files and Sequencefile
Sameer Tilak 2013-10-23, 19:30
Hi There,

I have a lot of small (~0.5 MB to 3 MB) XML files that I would like to
process using Apache Pig. Since dealing with a lot of small files is problematic , I was thinking of creating SeqeunceFiles such that each sequence file between 60 to 64 MB and no XML file is split onto 2 Sequence Files. Is there any utility that does the storing and loading of these files from Pig. I can for example create a Pig job that would read these XML files and generates few large sequence files  such that XML file is split onto 2 Sequence Files. I will then write another Pig job that will load these sequence files and then analyze them. Each of these XML files contains a lot of information for a given entity and the nesting can be quite deep. Any help with this would be great.

     
+
Shahab Yunus 2013-10-23, 22:34