|
|
John Kristian 2011-02-04, 03:04
Have you thought about extending schema resolution, so that an int or long can be promoted to a string? The string would be the ASCII decimal representation of the number, I expect. Similarly, an enum could be promoted to its symbol (as a string).
I’ve seen this sort of thing used for evolving a schema: you start out thinking a number is all you need, and then discover you need a richer format. Or vice-versa.
JavaScript and some other languages do this, and people mostly like it. They also do other conversions that I don’t suggest for Avro, such as string to number and float to string. (The string representation of a float depends on your locale.)
I’d be happy to put this into JIRA, if you think that’s appropriate.
- John Kristian
Scott Carey 2011-02-04, 18:02
I have been thinking about more advanced type promotion in Avro after facing more complicated schema evolution issues. I think that we need to draw the line of what is in the 'basic' promotion concept versus more advanced things that need metadata decoration. We recently added aliases, which are an example of schema evolution that requires some metadata.
Int > String is one that has options. Decimal? Hex? Etc. Therefore it is a candidate for something different than the intrinsic promotion. In some sense, it is not type promotion at all, but type conversion. One can say that a float was promoted to a double, and that the opposite move is a demotion. There is only one way that each direction is handled.
Int to string and back, which direction is promotion? Neither, it is a conversion with multiple ways to go each direction.
More advanced schema transformations that I have faced in real schema evolution are:
1. Nesting groups of fields into a record. 2. "flattening" fields from a record into a container record. 3. Breaking a union into components: [A, B, C] --> [Null, A], [Null,B], [Null,C] 4. Converting a union into an array of options: [A, B, C] --> Array([A ,B, C])
#3 is needed when data must go to a system that does not support unions. The client may still enforce that only one of the three exists, and a fourth field indicating which is active may be added. #4 happens when your data model changes and you now want multiple of a branch, or concurrent existence of branches. It is painful to write client code to adapt but could be handled by advanced schema transformation in Avro.
None of these are simple and they often require additional information from the user to achieve.
In Avro Java API language, we have a ResolvingDecoder that handles all the basic schema reader/writer evolution and promotion. A new 'TransformingDecoder' could be supplied with more advanced type transformation options. Each type of transformation would need to be well defined. If it was a general Avro tool and not only Java, it would require additions to the spec. On 2/3/11 7:04 PM, "John Kristian" <[EMAIL PROTECTED]> wrote:
>Have you thought about extending schema resolution, so that an int or >long can be promoted to a string? The string would be the ASCII decimal >representation of the number, I expect. Similarly, an enum could be >promoted to its symbol (as a string). > >I¹ve seen this sort of thing used for evolving a schema: you start out >thinking a number is all you need, and then discover you need a richer >format. Or vice-versa. > >JavaScript and some other languages do this, and people mostly like it. >They also do other conversions that I don¹t suggest for Avro, such as >string to number and float to string. (The string representation of a >float depends on your locale.) > >I¹d be happy to put this into JIRA, if you think that¹s appropriate. > >- John Kristian
Doug Cutting 2011-02-04, 18:49
On 02/04/2011 10:02 AM, Scott Carey wrote: > In Avro Java API language, we have a ResolvingDecoder that handles all the > basic schema reader/writer evolution and promotion. A new > 'TransformingDecoder' could be supplied with more advanced type > transformation options. Each type of transformation would need to be well > defined. If it was a general Avro tool and not only Java, it would > require additions to the spec. I think it makes sense to implement something like this first (e.g., in Java) to working out the details before adding it to the spec. Ideally we might implement it in two languages before adding it to the spec if it seems like details might be language-dependent. We should probably better define features in the spec so that we can better describe implementations. I think the distinct specification features that an implementation might implement are roughly: - binary-format i/o of data corresponding to a schema - json-format i/o - sort-order comparison - aliases - container file format - rpc over http I'd hope to soon add RPC over sockets using SASL to the specification, but am waiting for another implementation or two. All of these are optional, though most depend on binary i/o. I don't think that, e.g., schema resolution, default values and the deflate codec are optional, even though they may not be implemented in every implementation. When they're not implemented I think that's a bug in the implementation of a specification feature. Do folks agree with this categorization? The following table should probably better correspond to this: https://cwiki.apache.org/confluence/display/AVRO/Supported+LanguagesDoug
John Kristian 2011-02-04, 19:22
If Avro were to support conversions to and from string, it would be good to specify them tightly enough to ensure round-trip interoperability. It could follow an existing standard, for example the XML Schema canonical representation.
On Fri, 2/41 10:02:28 -0800, Scott Carey <[EMAIL PROTECTED]> wrote:
I have been thinking about more advanced type promotion in Avro after facing more complicated schema evolution issues. I think that we need to draw the line of what is in the 'basic' promotion concept versus more advanced things that need metadata decoration. We recently added aliases, which are an example of schema evolution that requires some metadata.
Int > String is one that has options. Decimal? Hex? Etc. Therefore it is a candidate for something different than the intrinsic promotion. In some sense, it is not type promotion at all, but type conversion. One can say that a float was promoted to a double, and that the opposite move is a demotion. There is only one way that each direction is handled.
Int to string and back, which direction is promotion? Neither, it is a conversion with multiple ways to go each direction.
...
Philip Zeyliger 2011-02-04, 21:16
On Fri, Feb 4, 2011 at 10:02 AM, Scott Carey <[EMAIL PROTECTED]>wrote:
> I have been thinking about more advanced type promotion in Avro after > facing more complicated schema evolution issues. My two cents:
This way lies madness. Avro (and PB and Thrift) give you some basic tools to evolve an API without doing much extra code. At some point, you end up forking and creating an APIv2, and eventually deprecate APIv1. If you try to make that magical, you'll end up building a programming language.
By all means define a language that converts from one Avro record into another. An Avro expression language would be quite useful, actually. Putting it in the core, however, strikes me as feature creep.
-- Philip
Scott Carey 2011-02-10, 18:28
On 2/4/11 1:16 PM, "Philip Zeyliger" <[EMAIL PROTECTED]> wrote:
>On Fri, Feb 4, 2011 at 10:02 AM, Scott Carey ><[EMAIL PROTECTED]>wrote: > >> I have been thinking about more advanced type promotion in Avro after >> facing more complicated schema evolution issues. > > >My two cents: > >This way lies madness. Avro (and PB and Thrift) give you some basic tools >to evolve an API without doing much extra code. At some point, you end up >forking and creating an APIv2, and eventually deprecate APIv1. If you try >to make that magical, you'll end up building a programming language.
I agree that protocol API versus AVIv2 is an example where exotic conversions don't make a lot of sense. The schemas in a protocol API isn't persisted long term, it is only on the wire.
My use cases are in long term persisted file data, where schema evolution spans a much longer time window (forever unless I can re-write all data). Having File format v1 not being compatible with file format v2 is a lot harder to swallow than API v2 not being compatible with API v2.
I have another use case in mind as well. Schema transformation is a common need for interoperation with other frameworks. Cascading doesn't support nested records (or it didn't last I looked), so a Cascading Tap has to either flatten them or not support them. Pig doesn't support unions, so they are either not supported, or manipulated into non-union structures. Schema transformation is a common use case when integrating Avro with pre-existing systems. When working on Pig and Hive adapter prototypes, there turned out to be a lot of overlap and repeated work -- and its almost all in schema transformation (flattening, unions, etc), classification (recursive?), and translation. If there was a general helper library for this sort of work, then the remaining adapter would be rather small and not require so much Avro domain knowledge. > >By all means define a language that converts from one Avro record into >another. An Avro expression language would be quite useful, actually. >Putting it in the core, however, strikes me as feature creep.
Core should definitely remain simple. Anything like this should be an optional library. Support for each transformation should be optional as well -- many languages might have string <> int, while only a couple have union branch materialization.
The more complicated transforms are mostly useful for frameworks that want to use Avro in a way that can interop with other frameworks using avro.
The initial reaction to the above statement is probably, "If they are both using Avro already, shouldn't they automatically be able to share data?" The answer is no. They aren't using Avro as their internal schema system. They are _translating_ between their internal schema system and Avro, potentially applying various transformation rules. So, for the lowest common denominator supported schemas, it works fine, anything more complicated and it won't. This is not a fault of Avro, it is the nature of compatibility between two non-Avro schema systems. Hive supports Maps with integers as keys. Pig does not. These can be made to interop through Avro if both systems share their schema translation techniques, but not otherwise.
> >-- Philip
Philip Zeyliger 2011-02-14, 02:17
Scott,
Thanks for your response.
I completely agree that your use cases are valuable. I think we also agree that the right place to layer this is as a separate "translation" or "transformation" library. I think madness lies in pushing those transformations into schema JSON; that's not what you're proposing, however, so all is good.
-- Philip
On Thu, Feb 10, 2011 at 10:28 AM, Scott Carey <[EMAIL PROTECTED]>wrote:
> > > On 2/4/11 1:16 PM, "Philip Zeyliger" <[EMAIL PROTECTED]> wrote: > > >On Fri, Feb 4, 2011 at 10:02 AM, Scott Carey > ><[EMAIL PROTECTED]>wrote: > > > >> I have been thinking about more advanced type promotion in Avro after > >> facing more complicated schema evolution issues. > > > > > >My two cents: > > > >This way lies madness. Avro (and PB and Thrift) give you some basic tools > >to evolve an API without doing much extra code. At some point, you end up > >forking and creating an APIv2, and eventually deprecate APIv1. If you try > >to make that magical, you'll end up building a programming language. > > I agree that protocol API versus AVIv2 is an example where exotic > conversions don't make a lot of sense. The schemas in a protocol API > isn't persisted long term, it is only on the wire. > > My use cases are in long term persisted file data, where schema evolution > spans a much longer time window (forever unless I can re-write all data). > Having File format v1 not being compatible with file format v2 is a lot > harder to swallow than API v2 not being compatible with API v2. > > I have another use case in mind as well. Schema transformation is a > common need for interoperation with other frameworks. Cascading doesn't > support nested records (or it didn't last I looked), so a Cascading Tap > has to either flatten them or not support them. Pig doesn't support > unions, so they are either not supported, or manipulated into non-union > structures. Schema transformation is a common use case when integrating > Avro with pre-existing systems. > When working on Pig and Hive adapter prototypes, there turned out to be a > lot of overlap and repeated work -- and its almost all in schema > transformation (flattening, unions, etc), classification (recursive?), and > translation. > If there was a general helper library for this sort of work, then the > remaining adapter would be rather small and not require so much Avro > domain knowledge. > > > > > >By all means define a language that converts from one Avro record into > >another. An Avro expression language would be quite useful, actually. > >Putting it in the core, however, strikes me as feature creep. > > Core should definitely remain simple. Anything like this should be an > optional library. Support for each transformation should be optional as > well -- many languages might have string <> int, while only a couple have > union branch materialization. > > The more complicated transforms are mostly useful for frameworks that want > to use Avro in a way that can interop with other frameworks using avro. > > The initial reaction to the above statement is probably, "If they are both > using Avro already, shouldn't they automatically be able to share data?" > The answer is no. They aren't using Avro as their internal schema system. > They are _translating_ between their internal schema system and Avro, > potentially applying various transformation rules. So, for the lowest > common denominator supported schemas, it works fine, anything more > complicated and it won't. This is not a fault of Avro, it is the nature > of compatibility between two non-Avro schema systems. > Hive supports Maps with integers as keys. Pig does not. These can be > made to interop through Avro if both systems share their schema > translation techniques, but not otherwise. > > > > >-- Philip > >
|
|