|
Philip Zeyliger
2010-01-22, 01:37
Ryan King
2010-01-22, 01:52
Doug Cutting
2010-01-22, 17:27
Ryan King
2010-01-22, 18:26
Philip Zeyliger
2010-01-22, 18:33
Scott Carey
2010-01-22, 19:18
Doug Cutting
2010-01-22, 19:39
Scott Carey
2010-01-22, 22:27
|
-
Reserving more keywords in Avro Object Container Files?Philip Zeyliger 2010-01-22, 01:37
Hey folks,
I currently reserve "schema", "codec", and "sync" in the container files. As part of AVRO-135, I could imagine an application wanting to set "codec.compression_level" or some such. One thing we could do (before 1.3 is released) is reserve "avro.*", forever. Thoughts? -- Philip
-
Re: Reserving more keywords in Avro Object Container Files?Ryan King 2010-01-22, 01:52
On Thu, Jan 21, 2010 at 5:37 PM, Philip Zeyliger <[EMAIL PROTECTED]> wrote:
> Hey folks, > > I currently reserve "schema", "codec", and "sync" in the container > files. As part of AVRO-135, I could imagine an application wanting to > set "codec.compression_level" or some such. One thing we could do > (before 1.3 is released) is reserve "avro.*", forever. Thoughts? We should reserve some space, like avro.*, rather than having to do it one at a time. -ryan
-
Re: Reserving more keywords in Avro Object Container Files?Doug Cutting 2010-01-22, 17:27
Ryan King wrote:
> We should reserve some space, like avro.*, rather than having to do it > one at a time. +1 This sounds like a good idea. Questions: - Should we rename all existing keywords to this namespace? My vote is yes, now is the time to do this. - Should the namespace be "avro.*" or "org.apache.avro.*"? The fully-qualified name would be more consistent with Avro schema and protocol namespaces, but it might prove awkward should Avro ever become a standard independent of the ASF's implementation. As a vaguely related example, I expect that Avro will become a top-level project at Apache this year, and am pleased that our Java implementation is already in org.apache.avro, not org.apache.hadoop.avro. I think we should thus try to keep the format specification independent of the ASF, and use "avro.*" here. Doug
-
Re: Reserving more keywords in Avro Object Container Files?Ryan King 2010-01-22, 18:26
On Fri, Jan 22, 2010 at 9:27 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Ryan King wrote: >> >> We should reserve some space, like avro.*, rather than having to do it >> one at a time. > > +1 This sounds like a good idea. > > Questions: > - Should we rename all existing keywords to this namespace? My vote is > yes, now is the time to do this. > - Should the namespace be "avro.*" or "org.apache.avro.*"? The > fully-qualified name would be more consistent with Avro schema and protocol > namespaces, but it might prove awkward should Avro ever become a standard > independent of the ASF's implementation. As a vaguely related example, I > expect that Avro will become a top-level project at Apache this year, and am > pleased that our Java implementation is already in org.apache.avro, not > org.apache.hadoop.avro. I think we should thus try to keep the format > specification independent of the ASF, and use "avro.*" here. Java has java.lang, we can have just avro. -ryan
-
Re: Reserving more keywords in Avro Object Container Files?Philip Zeyliger 2010-01-22, 18:33
I'm +1 avro.*, and I'm also +1 doing this before 1.3. It sounds like
we're approaching consensus. I've filed https://issues.apache.org/jira/browse/AVRO-368. If you wish to pipe up, now's your chance! -- Philip
-
Re: Reserving more keywords in Avro Object Container Files?Scott Carey 2010-01-22, 19:18
It makes sense to just reserve avro.*. There is no need for this namespace to exactly line up with the java code namespace, and like that it is succinct.
On the specific needs for compression options, I would rather have avro.codec.options as a general purpose container for codec options than avro.codec.compression_level. Some codecs have compression levels like gzip, 0 to 9. Others have a set of flags or multiple dimensions of options. Each codec can do what it will with avro.codec.options. Deflate can have "level=[0-9]" for values. Additionally, the Codec API can incorporate a public String getOptions(); public void SetOptions(String options); interface so that file appends can pick up the options that the file was created with. -Scott On Jan 22, 2010, at 10:33 AM, Philip Zeyliger wrote: > I'm +1 avro.*, and I'm also +1 doing this before 1.3. It sounds like > we're approaching consensus. > > I've filed https://issues.apache.org/jira/browse/AVRO-368. If you > wish to pipe up, now's your chance! > > -- Philip
-
Re: Reserving more keywords in Avro Object Container Files?Doug Cutting 2010-01-22, 19:39
Scott Carey wrote:
> On the specific needs for compression options, I would rather have avro.codec.options as a general purpose container for codec options than > avro.codec.compression_level. Some codecs have compression levels like gzip, 0 to 9. Others have a set of flags or multiple dimensions of options. Each codec can do what it will with avro.codec.options. Deflate can have "level=[0-9]" for values. > Additionally, the Codec API can incorporate a > > public String getOptions(); > public void SetOptions(String options); > > interface so that file appends can pick up the options that the file was created with. Strictly speaking, we don't need to include options in the file, since they don't affect the format. They could even be misleading, since one might use different compression levels in different append operations, and I don't see any strong reason to prohibit that. A given application could always store its options and re-use them when appending, e.g., my.gzip.level=5. If they're included in the spec then would we then prohibit one to override them? If not, what would be the purpose of putting them in the spec? Also, rather than packing all options into a single string that must be parsed, we might instead reserve avro.codec.<codecName>.* for codec-specific options. So one might specify avro.codec.deflate.level as 5. The codec name is actually redundant, since only a single codec name is permitted per file. So this could just instead perhaps be avro.codec.level without much fear of confusion. Doug
-
Re: Reserving more keywords in Avro Object Container Files?Scott Carey 2010-01-22, 22:27
On Jan 22, 2010, at 11:39 AM, Doug Cutting wrote: > Scott Carey wrote: >> On the specific needs for compression options, I would rather have avro.codec.options as a general purpose container for codec options than >> avro.codec.compression_level. Some codecs have compression levels like gzip, 0 to 9. Others have a set of flags or multiple dimensions of options. Each codec can do what it will with avro.codec.options. Deflate can have "level=[0-9]" for values. >> Additionally, the Codec API can incorporate a >> >> public String getOptions(); >> public void SetOptions(String options); >> >> interface so that file appends can pick up the options that the file was created with. > > Strictly speaking, we don't need to include options in the file, since > they don't affect the format. They could even be misleading, since one > might use different compression levels in different append operations, > and I don't see any strong reason to prohibit that. > It could be misleading for codec formats like gzip/deflate where all parameters are optional. For some codecs however, it may not be optional. LZO for example has several formats, and a header indicates which one is used. This can be in the data block, or metadata. I think the Codec API and metadata namespace should not restrict that choice up front. > A given application could always store its options and re-use them when > appending, e.g., my.gzip.level=5. If they're included in the spec then > would we then prohibit one to override them? If not, what would be the > purpose of putting them in the spec? Those semantics are codec dependent. Its simply a namespace for codecs to store parameters. We do not know in advance what the semantics of these parameters are. > > Also, rather than packing all options into a single string that must be > parsed, we might instead reserve avro.codec.<codecName>.* for > codec-specific options. So one might specify avro.codec.deflate.level > as 5. The codec name is actually redundant, since only a single codec > name is permitted per file. So this could just instead perhaps be > avro.codec.level without much fear of confusion. > Reserving a single name (avro.codec.options), or an entire namespace (avro.codec.*) is fine. The former is just a simpler interface. The latter would mean that the Codec API would have String getOption(String optionName) instead of String getOptions() Either way, the Codec needs a way to read and store options in a file. Gzip/Deflate can live without it since all streams are read-compatible. LZF is the same. Not all codecs are however. I've been thinking of trying an LZP class algorithm (faster encode, slower decode, smaller compressed size than LZ types like LZO), but the size of the hash table and hash algorithm is needed at decode time. Passing options is better than exploding the number of codecs, hard-coding parameters needed at decompress time, or (usually) storing the parameters in the data portion of each block. Exposing parameters to the Codec API means the decision on which of the above is the right thing to do for a given codec is up to the codec. None of the above matters that much at this time from the public spec perspective since it is all within avro.*. But for internal to avro namespace use, I think it is useful to have a general rule that if a name is reserved for a feature, its subspaces are also reserved. e.g. avro.codec is used by the Codec API/Feature, and thus avro.codec.* is implicitly reserved for future use by that API/Feature. -Scott |