I have a lot of small (~0.5 MB to 3 MB) XML files that I would like to
process using Apache Pig. Since dealing with a lot of small files is problematic , I was thinking of creating SeqeunceFiles such that each sequence file between 60 to 64 MB and no XML file is split onto 2 Sequence Files. Is there any utility that does the storing and loading of these files from Pig. I can for example create a Pig job that would read these XML files and generates few large sequence files such that XML file is split onto 2 Sequence Files. I will then write another Pig job that will load these sequence files and then analyze them. Each of these XML files contains a lot of information for a given entity and the nesting can be quite deep. Any help with this would be great.
Shahab Yunus 2013-10-23, 22:34