|
Christophe Taton
2012-05-30, 06:04
Philip Zeyliger
2012-05-30, 16:34
Scott Carey
2012-05-30, 21:52
Christophe Taton
2012-05-30, 23:26
Michael Armbrust
2012-06-05, 21:53
|
-
Scala APIChristophe Taton 2012-05-30, 06:04
Hi people,
Is there interest in a custom Scala API for Avro records and protocols? I am currently working on an schema compiler for Scala, but before I go deeper, I would really like to have external feedback. I would especially like to hear from anyone who has opinions on how to map Avro types onto Scala types. Here are a few hints on what I've been trying so far: - Records are compiled into two forms: mutable and immutable. - To avoid collisions with Java generated classes, scala classes are generated in a .scala sub-package. - Avro arrays are translated to Seq/List when immutable and Buffer/ArrayBuffer when mutable. - Avro maps are translated to immutable or mutable Map/HashMap. - Bytes/Fixed are translated to Seq[Byte] when immutable and Buffer[Byte] when mutable. - Avro unions are currently translated into Any, but I plan to: - translate union{null, X} into Scala Option[X] - compile union {T1, T2, T3} into a custom case classes to have proper type checking and pattern matching. - Scala records provide a method encode(encoder) to serialize as binary into a byte stream (appears ~30% faster than SpecificDatumWriter). - Scala mutable records provide a method decode(decoder) to deserialize a byte stream (appears ~25% faster than SpecificDatumReader). - Scala records implement the SpecificRecord Java interface (with some overhead), so one may still use the SpecificDatumReader/Writer when the custom encoder/decoder methods cannot be used. - Mutable records can be converted to immutable (ie. can act as builders). Thanks, Christophe
-
Re: Scala APIPhilip Zeyliger 2012-05-30, 16:34
On Tue, May 29, 2012 at 11:04 PM, Christophe Taton <[EMAIL PROTECTED]>wrote:
> Hi people, > > Is there interest in a custom Scala API for Avro records and protocols? > Sure! I know Michael Armbrust over at Berkeley has been using Scala with Avro; you might send him an e-mail (he's a grad student, so I'm sure you could look him up) to see whether he developed anything in this area. > I am currently working on an schema compiler for Scala, but before I go > deeper, I would really like to have external feedback. > You might look into extending the existing compiler to produce scala files in addition to java files. It uses templates, so it's not too tricky to do more languages.
-
Re: Scala APIScott Carey 2012-05-30, 21:52
This would be fantastic. I would be excited to see it. It would be great
to see a Scala language addition to the project if you wish to contribute. I believe there have been a few other Scala Avro attempts by others over time. I recall one where all records were case classes (but this broke at 22 fields). Another thing to look at is: http://code.google.com/p/avro-scala-compiler-plugin/ Perhaps we can get a few of the other people who have developed Scala Avro tools to review/comment or contribute as well? On 5/29/12 11:04 PM, "Christophe Taton" <[EMAIL PROTECTED]> wrote: > Hi people, > > Is there interest in a custom Scala API for Avro records and protocols? > I am currently working on an schema compiler for Scala, but before I go > deeper, I would really like to have external feedback. > I would especially like to hear from anyone who has opinions on how to map > Avro types onto Scala types. > Here are a few hints on what I've been trying so far: > * Records are compiled into two forms: mutable and immutable. Very nice. > * To avoid collisions with Java generated classes, scala classes are generated > in a .scala sub-package. > * Avro arrays are translated to Seq/List when immutable and Buffer/ArrayBuffer > when mutable. > * Avro maps are translated to immutable or mutable Map/HashMap. > * Bytes/Fixed are translated to Seq[Byte] when immutable and Buffer[Byte] when > mutable. > * Avro unions are currently translated into Any, but I plan to: >> * translate union{null, X} into Scala Option[X] >> * compile union {T1, T2, T3} into a custom case classes to have proper type >> checking and pattern matching. If you have a record R1, it compiles to a Scala class. If you put it in a union of {T1, String}, what does the case class for the union look like? Is it basically a wrapper like a specialized Either[T1, String] ? Maybe Scala will get Union types later to push this into the compiler instead of object instances :) > * Scala records provide a method encode(encoder) to serialize as binary into a > byte stream (appears ~30% faster than SpecificDatumWriter). > * Scala mutable records provide a method decode(decoder) to deserialize a byte > stream (appears ~25% faster than SpecificDatumReader). I have some plans to improve {Generic,Specific}Datum{Reader,Writer} in Java, I would be interested in seeing how the Scala one here works. The Java one is slowed by traversing too many data structures that represent decisions that could be pre-computed rather than repeatedly parsed for each record. > * Scala records implement the SpecificRecord Java interface (with some > overhead), so one may still use the SpecificDatumReader/Writer when the custom > encoder/decoder methods cannot be used. > * Mutable records can be converted to immutable (ie. can act as builders). > Thanks, > Christophe >
-
Re: Scala APIChristophe Taton 2012-05-30, 23:26
Thanks a lot for your replies!
On Wed, May 30, 2012 at 2:52 PM, Scott Carey <[EMAIL PROTECTED]> wrote: > This would be fantastic. I would be excited to see it. It would be great > to see a Scala language addition to the project if you wish to contribute. > > I believe there have been a few other Scala Avro attempts by others over > time. I recall one where all records were case classes (but this broke at > 22 fields). > Another thing to look at is: > http://code.google.com/p/avro-scala-compiler-plugin/ > > Perhaps we can get a few of the other people who have developed Scala Avro > tools to review/comment or contribute as well? > That would be great! I just filed https://issues.apache.org/jira/browse/AVRO-1105 to record feedback there. I will file more targeted issues and post an initial patch soon. On 5/29/12 11:04 PM, "Christophe Taton" <[EMAIL PROTECTED]> wrote: > > Hi people, > > Is there interest in a custom Scala API for Avro records and protocols? > I am currently working on an schema compiler for Scala, but before I go > deeper, I would really like to have external feedback. > I would especially like to hear from anyone who has opinions on how to map > Avro types onto Scala types. > Here are a few hints on what I've been trying so far: > > - Records are compiled into two forms: mutable and immutable. > > Very nice. > > > - To avoid collisions with Java generated classes, scala classes are > generated in a .scala sub-package. > - Avro arrays are translated to Seq/List when immutable and > Buffer/ArrayBuffer when mutable. > - Avro maps are translated to immutable or mutable Map/HashMap. > - Bytes/Fixed are translated to Seq[Byte] when immutable and > Buffer[Byte] when mutable. > - Avro unions are currently translated into Any, but I plan to: > - translate union{null, X} into Scala Option[X] > - compile union {T1, T2, T3} into a custom case classes to have > proper type checking and pattern matching. > > If you have a record R1, it compiles to a Scala class. If you put it in a > union of {T1, String}, what does the case class for the union look like? > Is it basically a wrapper like a specialized Either[T1, String] ? Maybe > Scala will get Union types later to push this into the compiler instead of > object instances :) > I was thinking of using Either[X,Y] but this does not scale. Assuming this union schema: record Rec { union { int, array<int>, Record1 } field1; } If unions are compiled to Any, Scala can match on simple types: field1 match { case value: Int => ... case value: Array[Int] => ... case value: Record1 => ... } But this does not work in all cases because of type erasures. Maybe this would work with scala 2.10 and runtime type reification. In all cases, Any would not provide type safety... For now, I am planning on generating the following: abstract class Field1Union case class Field1Int(data: Int) case class Field1ArrayInt(data: ArrayInt) case class Field1Record1(data: Record1) Each case class only has one constructor parameter, so this should not hit the 22 constructor parameters limit of case classes. Constructing a record would look like: val rec = new Rec(field1=new Field1Int(1)) or val rec = new Rec(field1=new Field1ArrayInt(...)) And reading the union field would look like: rec.field1 match { case Field1Int(intValue) => ... case Field1ArrayInt(array) => ... case Field1Record1(rec1) => ... } Thoughts? > > - Scala records provide a method encode(encoder) to serialize as > binary into a byte stream (appears ~30% faster than SpecificDatumWriter). > - Scala mutable records provide a method decode(decoder) to > deserialize a byte stream (appears ~25% faster than SpecificDatumReader). > > I have some plans to improve {Generic,Specific}Datum{Reader,Writer} in > Java, I would be interested in seeing how the Scala one here works. The > Java one is slowed by traversing too many data structures that represent > decisions that could be pre-computed rather than repeatedly parsed for each The scala reader/writer is very straightforward. It is a shortcut that most likely does not work in all cases (especially when decoding from another schema version). If you want to have a look, I attached one schema I am using for testing and the generated code. C.
-
Re: Scala APIMichael Armbrust 2012-06-05, 21:53
On Wed, May 30, 2012 at 9:34 AM, Philip Zeyliger <[EMAIL PROTECTED]>wrote:
> Sure! I know Michael Armbrust over at Berkeley has been using Scala with > Avro; you might send him an e-mail (he's a grad student, so I'm sure you > could look him up) to see whether he developed anything in this area. > We have a plugin for the scala compiler that takes case classes that extent a special marker trait (AvroRecord) and generates the code needed for Avro serialization. It has mostly been used for research thus far, but we use it quite a bit as the serialization for our K/V store, storing experimental results, as well as our own homegrown message passing system. Details can be found here: https://github.com/radlab/SCADS/wiki/Avro-Plugin Let me know if you have any questions! Michael |