Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> XML files and Sequencefile


Copy link to this message
-
XML files and Sequencefile
Hi There,

I have a lot of small (~0.5 MB to 3 MB) XML files that I would like to
process using Apache Pig. Since dealing with a lot of small files is problematic , I was thinking of creating SeqeunceFiles such that each sequence file between 60 to 64 MB and no XML file is split onto 2 Sequence Files. Is there any utility that does the storing and loading of these files from Pig. I can for example create a Pig job that would read these XML files and generates few large sequence files  such that XML file is split onto 2 Sequence Files. I will then write another Pig job that will load these sequence files and then analyze them. Each of these XML files contains a lot of information for a given entity and the nesting can be quite deep. Any help with this would be great.