I think you've more-or-less outlined the pros and cons of each format
(though do see Alex's important point regarding SequenceFiles and
compression). If everyone who worked with Hadoop clearly favored one or the
other, we probably wouldn't include support for both formats by default. :)
Neither format is "right" or "wrong" in the general case. The decision will
I would point out, though, that you may be underestimating the processing
cost of parsing records. If you've got a really dead-simple problem like
"each record is just a set of integers", you could probably split a line of
text on commas/tabs/etc. into fields and then convert those to proper
integer values in a relatively efficient fashion. But if you may have
delimiters embedded in free-form strings, you'll need to build up a much
more complex DFA to process the data, and it's not too hard to find yourself
CPU-bound. (Java regular expressions can be very slow.) Yes, you can always
throw more nodes at the problem, but you may find that your manager is
unwilling to sign off on purchasing more nodes at some point :) Also,
writing/maintaining parser code is its own challenge.
If your data is essentially text in nature, you might just store it in text
files and be done with it for all the reasons you've stated.
But for complex record types, SequenceFiles will be faster. Especially if
you have to work with raw byte arrays at any point, escaping that (e.g.,
BASE64 encoding) into text and then back is hardly worth the trouble. Just
store it in a binary format and be done with it. Intermediate job data
should probably live as SequenceFiles all the time. They're only ever going
to be read by more MapReduce jobs, right? For data at either "edge" of your
problem--either input or final output data--you might want the greater
ubiquity of text-based files.
On Fri, Jul 2, 2010 at 3:35 PM, Joe Stein <[EMAIL PROTECTED]>wrote:
> You can also set compression to occur of your data between your map &
> tasks (this data can be large and often is quicker to compress and transfer
> than just transfer when the copy gets going).
> Setting this value to *true* should speed up the reducers copy greatly
> especially when working with large data sets.
> When we load in our data we use the HDFS API and get the data in to begin
> with as SequenceFiles (compressed by block) and never look back from there.
> We have a custom SequenceFileLoader so we can still use Pig also against
> SequenceFiles. It is worth the little bit of engineering effort to save
> Joe Stein
> Twitter: @allthingshadoop
> On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard <[EMAIL PROTECTED]>
> > Hi David,
> > On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch <[EMAIL PROTECTED]
> > >wrote:
> > >
> > > * We should use a SequenceFile (binary) format as it's faster for the
> > > machine to read than parsing text, and the files are smaller.
> > >
> > > * We should use a text file format as it's easier for humans to read,
> > > easier to change, text files can be compressed quite small, and a) if
> > > text format is designed well and b) given the context of a distributed
> > > system like Hadoop where you can throw more nodes at a problem, the
> > > parsing time will wind up being negligible/irrelevant in the overall
> > > processing time.
> > >
> > SequenceFiles can also be compressed, either per record or per block.
> > is advantageous if you want to use gzip, because gzip isn't splittable.
> > SF compressed by blocks is therefor splittable, because each block is
> > gzipped vs. the entire file being gzipped.
> > As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
> > SequenceFiles.
> > Lastly, I promise that eventually you'll run out of space in your cluster