This is consistent with my experience. As a user of HDFS, I would find data
produced by others and not know the semantics well enough to use it. On
board schemas, with comments, make this data more useable, although a
system like HCatalog is useful in facilitating this kind of discovery.
Avro enables and encourages the preparation of shared data sets among
users, which saves cycles and improves productivity.
Russell Jurney http://datasyndrome.com
On Aug 13, 2012, at 4:00 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
It is worth keeping in mind that explicit external schema is another
> cost in not just designing but also maintaining the system. As such,
> it is most useful for closely-coupled internal system, where one
> controls both ends. This may be the case for computing pipelines a
> single team owns.
Our experiences have been quite the opposite. When the developer producing
data was the same as the developer writing code to consume it, json worked
fine since the developer knew what fields to expect. As our company grew,
this turned into tribal knowledge and the approach did not scale. That's
when having schemas is critical: when one team produces data and many
others consume it. The cost is that the producer needs to publish the
schema for others to discover.
On Mon, Aug 13, 2012 at 10:50 AM, Tatu Saloranta <[EMAIL PROTECTED]>wrote:
> On Sun, Aug 12, 2012 at 8:03 PM, Russell Jurney
> <[EMAIL PROTECTED]> wrote:
> > To be fair, you can test types as you parse JSON. But only a few.
> Difference between external/explicit schema typed formats and
> schema-free (optional schema, as in JSON) formats is similar to that
> between statically and dynamically typed languages.
> Testing and handling differ, as well as trade-offs.
> -+ Tatu +-