|
|
-
question about completely untagged data...
David Jeske 2010-11-29, 02:39
I have a storage project considering adding Thrift or Avro to for record packing, and I have a couple questions.
Other than than type-id and field-ids, Avro and Thrift's designs seem isomorphic. *Is the binary format not including field-type-info something that's set in stone, or something that's open for feedback? *
I prefer the philosophy of Avro, namely to expect schemas to be available, use those schemas for compatibility mapping, and to support dynamic schema parsing in any supported language. In fact, being able to parse schemas dynamically in any language is the real draw of Avro for me. (personally I'd prefer if they were actually Avro IDL, instead of JSON, but I understand that might complicate implementing client stubs).
However, the fact that data is not tagged with any type-information is unacceptable dangerous for my application. There will be mechanisms for mapping records to schemas, and schemas will be stored, but if a schema were ever lost or corrupted, I can't afford for the data to turn into total junk. Unless data is trivially small, encoding a field type wouldn't change the size of the encoding much, but would provide some 'sanity checking' in addition to be able to recover the raw data even if a schema was lost or the schema ID for a piece of data was corrupted.
Since Avro is relatively new, I'm asking to find out if this is anathama to the entire concept of Avro, or something something that was chosen, but might be reconsidered eventually.
Going the thrift route for me will mean injecting a bit of the Avro philosophy into Thrift, namely, adding a Thrift IDL parser to the language I need, so I can save Thrift IDLs and then dynamically read them. However, doing this as a one-off for my language different than having a supported mechanism for all client languages -- like in Avro.
+
David Jeske 2010-11-29, 02:39
-
Re: question about completely untagged data...
Philip Zeyliger 2010-11-29, 04:09
Hi David, Your assessment of Thrift and Avro being isomorphic is correct, and you've correctly identified the major philosophical difference. (It's in fact a little bit deeper than you suggest: at read time, there are always two schemas available: the reader's schema and the original schema that the data was written with.) Where are you storing the Avro records? Avro's binary format for records is unlikely to change: it's pretty stable and changing would be a big deal. On the other hand, Avro already has multiple ways for passing schema information along. Avro's RPC implementations do one thing. Avro Data File store the schema in the header. You could, in your system, always store (schema, data) tuples. That's what Sam is doing in HAvroBase( http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/). -- Philip On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <[EMAIL PROTECTED]> wrote: > I have a storage project considering adding Thrift or Avro to for record > packing, and I have a couple questions. > Other than than type-id and field-ids, Avro and Thrift's designs seem > isomorphic. Is the binary format not including field-type-info something > that's set in stone, or something that's open for feedback? > I prefer the philosophy of Avro, namely to expect schemas to be available, > use those schemas for compatibility mapping, and to support dynamic schema > parsing in any supported language. In fact, being able to parse schemas > dynamically in any language is the real draw of Avro for me. (personally I'd > prefer if they were actually Avro IDL, instead of JSON, but I understand > that might complicate implementing client stubs). > However, the fact that data is not tagged with any type-information is > unacceptable dangerous for my application. There will be mechanisms for > mapping records to schemas, and schemas will be stored, but if a schema were > ever lost or corrupted, I can't afford for the data to turn into total junk. > Unless data is trivially small, encoding a field type wouldn't change the > size of the encoding much, but would provide some 'sanity checking' in > addition to be able to recover the raw data even if a schema was lost or the > schema ID for a piece of data was corrupted. > Since Avro is relatively new, I'm asking to find out if this is anathama to > the entire concept of Avro, or something something that was chosen, but > might be reconsidered eventually. > Going the thrift route for me will mean injecting a bit of the Avro > philosophy into Thrift, namely, adding a Thrift IDL parser to the language I > need, so I can save Thrift IDLs and then dynamically read them. However, > doing this as a one-off for my language different than having a supported > mechanism for all client languages -- like in Avro. > >
+
Philip Zeyliger 2010-11-29, 04:09
-
Re: question about completely untagged data...
Bruce Mitchener 2010-11-29, 04:44
To be clear, HAvroBase stores tuples of (schema ID, data) and then looks up the schema from that ID. It doesn't store each schema separately / entirely alongside the corresponding data records / entries. HAvroBase is really pretty nice and has backends for storing data into things other than HBase... - Bruce On Mon, Nov 29, 2010 at 11:09 AM, Philip Zeyliger <[EMAIL PROTECTED]>wrote: > Hi David, > > Your assessment of Thrift and Avro being isomorphic is correct, and > you've correctly identified the major philosophical difference. (It's > in fact a little bit deeper than you suggest: at read time, there are > always two schemas available: the reader's schema and the original > schema that the data was written with.) > > Where are you storing the Avro records? Avro's binary format for > records is unlikely to change: it's pretty stable and changing would > be a big deal. On the other hand, Avro already has multiple ways for > passing schema information along. Avro's RPC implementations do one > thing. Avro Data File store the schema in the header. You could, in > your system, always store (schema, data) tuples. That's what Sam is > doing in HAvroBase> ( > http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/> ). > > -- Philip > > On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <[EMAIL PROTECTED]> wrote: > > I have a storage project considering adding Thrift or Avro to for record > > packing, and I have a couple questions. > > Other than than type-id and field-ids, Avro and Thrift's designs seem > > isomorphic. Is the binary format not including field-type-info something > > that's set in stone, or something that's open for feedback? > > I prefer the philosophy of Avro, namely to expect schemas to be > available, > > use those schemas for compatibility mapping, and to support dynamic > schema > > parsing in any supported language. In fact, being able to parse schemas > > dynamically in any language is the real draw of Avro for me. (personally > I'd > > prefer if they were actually Avro IDL, instead of JSON, but I understand > > that might complicate implementing client stubs). > > However, the fact that data is not tagged with any type-information is > > unacceptable dangerous for my application. There will be mechanisms for > > mapping records to schemas, and schemas will be stored, but if a schema > were > > ever lost or corrupted, I can't afford for the data to turn into total > junk. > > Unless data is trivially small, encoding a field type wouldn't change the > > size of the encoding much, but would provide some 'sanity checking' in > > addition to be able to recover the raw data even if a schema was lost or > the > > schema ID for a piece of data was corrupted. > > Since Avro is relatively new, I'm asking to find out if this is anathama > to > > the entire concept of Avro, or something something that was chosen, but > > might be reconsidered eventually. > > Going the thrift route for me will mean injecting a bit of the Avro > > philosophy into Thrift, namely, adding a Thrift IDL parser to the > language I > > need, so I can save Thrift IDLs and then dynamically read them. However, > > doing this as a one-off for my language different than having a supported > > mechanism for all client languages -- like in Avro. > > > > >
+
Bruce Mitchener 2010-11-29, 04:44
-
Re: question about completely untagged data...
David Jeske 2010-11-29, 04:50
On Sun, Nov 28, 2010 at 8:44 PM, Bruce Mitchener <[EMAIL PROTECTED]>wrote:
> To be clear, HAvroBase stores tuples of (schema ID, data) and then looks up > the schema from that ID. It doesn't store each schema separately / entirely > alongside the corresponding data records / entries. Ahh, yes, that's analagous to what I'm planning to do as well. The Schema-ID points to a directory of user-supplied schemas. However, it's important for me to have a contingency plan in case somehow, someday there is ever corruption that disconnected the schema-ID from the actual schema.
I think putting a packed-binary format of the field-type-info into each record would give me what I want with a space-usage that's proportional to Thrift overall. It also seems like the kind of thing that could (possibly) one-day be a supported mechanism of Avro without actually changing the existing binary format. Best of all worlds.
As a bonus, there are situations where the schemas i'll be using are so unchanging and common (i.e. embedded in code) that there really isn't any fear of them being lost. In these cases it's nice that Avro can be used to pack and unpack things without any field-type overhead.
Thanks for the comments.
+
David Jeske 2010-11-29, 04:50
-
Re: question about completely untagged data...
Bruce Mitchener 2010-11-29, 05:12
If your schemas are next to your data and part of the same storage system, aren't you also similarly worried about protecting your data against loss and corruption?
I'm not sure why one would be separate from the other in terms of backups, disaster prevention or recovery?
And you may well want to look at just adding a separate backend (if needed) to HAvroBase ... it sounds like it is already most of the way towards what you want.
- Bruce
On Mon, Nov 29, 2010 at 11:50 AM, David Jeske <[EMAIL PROTECTED]> wrote:
> On Sun, Nov 28, 2010 at 8:44 PM, Bruce Mitchener < > [EMAIL PROTECTED]> wrote: > >> To be clear, HAvroBase stores tuples of (schema ID, data) and then looks >> up the schema from that ID. It doesn't store each schema separately / >> entirely alongside the corresponding data records / entries. > > > Ahh, yes, that's analagous to what I'm planning to do as well. The > Schema-ID points to a directory of user-supplied schemas. However, it's > important for me to have a contingency plan in case somehow, someday there > is ever corruption that disconnected the schema-ID from the actual schema. > > I think putting a packed-binary format of the field-type-info into each > record would give me what I want with a space-usage that's proportional to > Thrift overall. It also seems like the kind of thing that could (possibly) > one-day be a supported mechanism of Avro without actually changing the > existing binary format. Best of all worlds. > > As a bonus, there are situations where the schemas i'll be using are so > unchanging and common (i.e. embedded in code) that there really isn't any > fear of them being lost. In these cases it's nice that Avro can be used to > pack and unpack things without any field-type overhead. > > Thanks for the comments. > > >
+
Bruce Mitchener 2010-11-29, 05:12
-
Re: question about completely untagged data...
David Jeske 2010-11-29, 06:26
On Sun, Nov 28, 2010 at 9:12 PM, Bruce Mitchener <[EMAIL PROTECTED]>wrote:
> If your schemas are next to your data and part of the same storage system, > aren't you also similarly worried about protecting your data against loss > and corruption? Absolutly. Still, things happen. It may be the number of times I've seen odd corruption cause trouble in supposedly reliable systems, but I like being careful. > I'm not sure why one would be separate from the other in terms of backups, > disaster prevention or recovery? >
Think of every Mysql, Oracle, MS-SQL server instance out there.What percentage of them are backed up properly? Certainly not 100%. What about embedded storage systems, on phones, cameras, iPad, field-devices, everywhere? What about filesystems? (We have redundant superblocks and fsck for a reason, even with journaling).
Every one of those systems, at some point, has had corruption that could have caused major pain for a user, but was mitigated by some sensible safety checks on the part of a developer. I have seen enough small and larger corruptions to know that the world is a nasty nasty place for real software operations, and desire some sensible safety nets.
I hope those give you some insights into my motivations. I understand someone choosing to make a different choice, but those are my sensibilities.
+
David Jeske 2010-11-29, 06:26
-
Re: question about completely untagged data...
Doug Cutting 2010-11-29, 18:25
On 11/28/2010 08:50 PM, David Jeske wrote: > However, it's > important for me to have a contingency plan in case somehow, someday > there is ever corruption that disconnected the schema-ID from the actual > schema.
If this worst-case transpired, I don't think it would be too difficult for most datasets to reconstruct the schema by examining the data. With ProtocolBuffers and Thrift, if the IDL is lost you'd be in a similar, although simpler, situation of having to figure out field names and types. Folks regularly reverse-engineer much more complex stuff than this.
That said, you could store the Id->Schema mapping in multiple places. Among other places, it could be in your source code repository.
Doug
+
Doug Cutting 2010-11-29, 18:25
-
Re: question about completely untagged data...
David Jeske 2010-11-29, 19:04
On Mon, Nov 29, 2010 at 10:25 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> If this worst-case transpired, I don't think it would be too difficult for > most datasets to reconstruct the schema by examining the data. With > ProtocolBuffers and Thrift, if the IDL is lost you'd be in a similar, > although simpler, situation of having to figure out field names and types. > Folks regularly reverse-engineer much more complex stuff than this. >
I don't follow how this would be possible with Avro. With no type information, how would you tell the difference between an array of ints, a bunch of enums, a binary chunk of data, or even just a string? Thrift and Protobufs have the types so understanding the structure would be trivial, it's only the meaning that would need to be re-derived.
That said, you could store the Id->Schema mapping in multiple places. Among > other places, it could be in your source code repository. >
These are user-supplied schemas. Rather than get lost in details of my project, let me frame this with a shared-context example.
Imagine we were going to replace the row-packing format of MySQL with Avro. Currently Mysql has one "row-packing format" per table. To add or remove a column, one must rewrite the whole table. Therefore, to mimic this functionality, we would keep a copy of the schema in system metadata. However, if that copy was corrupted, the table contents would be completely unintelligible. Storing the Avro schema in source code isn't something the client would always do, because he used SQL to create the schema and might not even care Avro exists inside. Of course one might argue that this is a use case where Avro doesn't provide value over custom-coded packings like MySQL, but I have a different opinion. One significant performance limit in databases is the need for the database to always unpack and repack record values to deliver them. By using a standardized format that others can understand as the native on-disk format, a "fast path" around the sql parser can allow users a way around that overhead if it's useful.
If you really want to keep bit more of descriptive information, you
could also just consider formats that do include property names, like
JSON (with compression). Depending on exactly what you plan to store, it > might be a competitive choice all around. Compressing a single copy of the json schema would not produce nearly a small enough representation, because the names of the fields would not be present in the stream to be compressed out. Consider a schema that is just 3 integers with names. The field names would make the compressed schema much bigger than the data (compressed or not). I obviously could remove the field names.. which is the idea I mentioned in my earlier email to store a packed form of just the type-structure.
Which means in order to get good compression of the field-names, the output would need to be block compressed across multiple records. I don't want this requirement.
I don't think either Avro or Thrift is actually aimed so much for storing > data as for transferring data; since the issue of persisting schemas does > complicate things significantly (same is true with protobuf too, just even > more so). I'm not sure why you say this. While I'm no longer at Google, while I was there protobuf was used extensively for data schema. If I could share how much "protobuf schema data" was stored, you'd probably think 'extensively' is an understatement. The Google Sawzall paper shares a glimmer into how this was used. The schemas were recorded into a central repository (the source control tree). However, if ever one was lost, the types in the binary format would allow some limited ability to read the data. > And Avro specifically seems like best fit for sequences of homogenous data > entries (rows of DB, log entries etc). This may or may not be similar to > your use case. But maybe there are other reasons why you have limited > choice to just these two formats? Actually, Thrift and Protobufs are both perfect binary formats for my application, so if that was the only issue I don't need to look beyond them. As I stated in my first post, I also want an implementation that allows clients in a variety of languages to read schemas and dynamically interpret data. Hive/Pig is a good example. From the Google Sawzall paper you can see that Google has this internally as part of Sawzall, but neither the public protobuf project nor Thrift have this capability built in.
Avro does have this capability to dynamically read schemas in the multi-language client-code, so I came around to ask if there was a way to get the slightly-better data-safety I'd like. I believe the workaround that I mentioned earlier in the thread might be acceptable (storing a packed form of the type-structure).
Thanks again for the comments!
+
David Jeske 2010-11-29, 19:04
-
Re: question about completely untagged data...
Doug Cutting 2010-11-29, 19:37
On 11/29/2010 11:04 AM, David Jeske wrote: > I don't follow how this would be possible with Avro. With no type > information, how would you tell the difference between an array of ints, > a bunch of enums, a binary chunk of data, or even just a string? Thrift > and Protobufs have the types so understanding the structure would be > trivial, it's only the meaning that would need to be re-derived.
Protobuf binary only has sizes, not types. Thrift's efficient encodings probably also just have sizes.
If you have a file with 1M records of the same structure, it's usually not hard to find patterns. Strings stand out and punctuate things. Byte arrays are also often easy to identify, especially since they're length prefixed. Fields that often have the same value (i.e., zero or one) also help punctuate. However a record that contains only four random single-point floating-point values versus an Avro "fixed" containing 16 random bytes could be hard to distinguish. In my experience, structures like these are less common. In this case, protobuf and thrift would let you know that one had four four-byte values and the other one 16-byte value, which would be helpful but not definitive. Also, if you have a table, you often have some idea what it contains.
Doug
+
Doug Cutting 2010-11-29, 19:37
-
Re: question about completely untagged data...
David Jeske 2010-11-29, 20:16
On Mon, Nov 29, 2010 at 11:37 AM, Doug Cutting <[EMAIL PROTECTED]> wrote: > Protobuf binary only has sizes, not types. > I've always thought of the encoding as having types, but overloading ints and floats and lettting the decoder figure it out. (possibly because at one point I think it was actually types, but then new features were overloaded on existing bits) But yes, thinking of it as having sizes seems more correct. The most important part for me is that the fields are delimited. Thrift's efficient encodings probably also just have sizes. > Do you have a reference to this? The stuff I've been told (mostly from ex-googlers turned facebookers) is that it has a byte for types. The only documentation I'm able to find is this new compact format design< http://wiki.apache.org/thrift/New_compact_binary_protocol>whichstill has types but encodes them more compactly. (of course there is always the code, but I havn't ready that). I suppose Thrift is a bit of a wildcard in attempting to support multiple binary formats.
+
David Jeske 2010-11-29, 20:16
-
Re: question about completely untagged data...
David Jeske 2010-11-29, 04:40
On Sun, Nov 28, 2010 at 8:09 PM, Philip Zeyliger <[EMAIL PROTECTED]>wrote: > Where are you storing the Avro records? This is part of a database/storage project. To avoid the overhead of a schema-per record, I can store a schema-ID per record, and then have a directory of schemas in the system. However, if somehow the place that the schema is stored gets botched (id gets corrupted, schema file gets corrupted or lost, etc), the records would become completely unintelligible. That sounds like a scarry prospect. > You could, in your system, always store (schema, data) tuples. That's what > Sam is > doing in HAvroBase> ( > http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/> ). > This sounds fine for something where records are documents and record sizes are quite big. However, in my application there are going to be too many records proportional to (or smaller than) the schema-size for this to be practical. Without compression this would double or triple the data-size. With compression it's a bunch of unnecessary extra work decoding and rencoding schemas. I suppose I could use Avro's API to dump a "dense binary type-only schema" that wouldn't have the names of types, only a packed format of the types themselves. This would essentially be the same as Thrift, except that the types would be packed at the beginning of the record (or end) instead of interspersed with the records. In the common case Avro would be handed the "real" schema any (old and new), so it wouldn't even be looking at this. It would just be in there for safety sake in case we needed to do some disaster recovery.
+
David Jeske 2010-11-29, 04:40
-
Re: question about completely untagged data...
Tatu Saloranta 2010-11-29, 18:04
On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <[EMAIL PROTECTED]> wrote: > I have a storage project considering adding Thrift or Avro to for record > packing, and I have a couple questions. > Other than than type-id and field-ids, Avro and Thrift's designs seem > isomorphic. Is the binary format not including field-type-info something > that's set in stone, or something that's open for feedback? ... > Going the thrift route for me will mean injecting a bit of the Avro > philosophy into Thrift, namely, adding a Thrift IDL parser to the language I > need, so I can save Thrift IDLs and then dynamically read them. However, > doing this as a one-off for my language different than having a supported > mechanism for all client languages -- like in Avro.
If you really want to keep bit more of descriptive information, you could also just consider formats that do include property names, like JSON (with compression). Depending on exactly what you plan to store, it might be a competitive choice all around.
I don't think either Avro or Thrift is actually aimed so much for storing data as for transferring data; since the issue of persisting schemas does complicate things significantly (same is true with protobuf too, just even more so). And Avro specifically seems like best fit for sequences of homogenous data entries (rows of DB, log entries etc). This may or may not be similar to your use case. But maybe there are other reasons why you have limited choice to just these two formats?
-+ Tatu +-
+
Tatu Saloranta 2010-11-29, 18:04
|
|