I am trying to puzzle this out, and am hoping for some insight - I have an
IMAP inbox dump that I am analyzing - I need to track how many times a
given item is referred to in the inbox, i.e. how many emails came in about
that thing and over what time. I can load it into MapReduce as
TextInputFormat and parse it properly, and have managed to crudely
concatenate lines that represent an email together as my final output, so,
basically, it is working now, but my program is seeing each line as an
InputSplit, and I so it is only working reliably with one InputFileSplit.
If I had a bigger file, with multiple InputFileSplits presenting
line-by-line InputSplits, I have no way to be sure that the lines that make
one email will not end up in two different splits - does that make sense?
Someone I work with suggested that I attempt to read each email as a
record, since they have their MIME encoding intact in the text dump, rather
than each line as a record.
Does anyone know of a MIME MapReduce input type? I can't be sure this will
help anyway, since the file is already text-encoded - I may have to get the
email from the original inbox as individual messages somehow to utilize the
MIME header information.
Googling this has been challenging, mainly because the words you have to
use are somewhat overloaded - but I am finding some good clown schools in
my research...so, any help is appreciated.
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com