Those are all great questions, and mostly difficultto answer. I havent
played with serialization APIs in some time, but let me try to give some
guidance. WRT to your bulleted questions above:
1) Serialization is file system independant: The use of any hadoop
compatible file system should support any kind of serialization.
2) See (1). The "default serialization" is Writables: But you can easily
add your own by modifiying the io.serializations configuration parameter.
3) I doubt anything significant effecting the way serialization works: The
main thrust of 1->2 was in the way services are deployed, not changing the
internals of how data is serialized. After all, the serialization APIs
need to remain stability even as the arch. of hadoop changes.
4) It depends on the implementation. If you have a custom writable that is
really good at compressing your data, that will be better than using a
thrift auto generated API for serialization that is uncustomized out of the
box. Example: Say you are writing "strings" and you know the string is
max 3 characters. A "smart" Writable serializer with custom
implementations optimized for your data will beat a thrift serialization
approach. But I think in general, the advantage of thrift/avro is that its
easier to get really good compression natively out-of-the-box, due to the
fact that many different data types are strongly supported by the way they
apply the schemas (for example , a thrift struct can contain a "boolean",
two "strings" , and an "int" These types will all be optmiized for you by
thrift.... Where as in Writables, you cannot as easily create sophisticated
types with optimization of nested properties.
On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe