Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Record sort order is "lexicographically by field" -- what does that mean?

Jeremy Kahn 2013-03-27, 23:45
Copy link to this message
Re: Record sort order is "lexicographically by field" -- what does that mean?
Hey Jeremy,

On Thu, Mar 28, 2013 at 5:15 AM, Jeremy Kahn <[EMAIL PROTECTED]> wrote:
> According to the documentation
> http://avro.apache.org/docs/current/spec.html#order , the sort order for
> records is:
> record data is ordered lexicographically by field. If a field specifies that
> its order is:
> "ascending", then the order of its values is unaltered.
> "descending", then the order of its values is reversed.
> "ignore", then its values are ignored when sorting.
> What does "ordered lexicographically by field" mean?  I can see two
> interpretations.  Consider a record of the following schema:
> {"name": "ZooInventory",
>  "type": "Record",
>  "fields": [
>    {"name": "city", "type": "string", "order": "ignore"},
>    {"name": "zebras", "type": "int", "order": "descending"},
>    {"name": "anacondas", "type": "int", "order": "ascending"},
>    {"name": "baboons", "type": "int"}
>  ]
> }
> I can read "ordered lexicographically by field" in two ways:
> the names of the fields are sorted lexicographically, and the field that
> goes lexicographically first (not marked as "order":"ignore") dominates.
> the records are sorted by the sort order of each field, with the first
> fields (not marked "order": "ignore") taking sort priority.

The second one is correct. The field's order in the defined schema is
not changed but only walked through.

I've always read this more like it will compare "in the provided order
(of read schema)" and "based on the type of ordering (positive, ignore
or negative)" and thats true from my use of it in Hadoop MR as well.

> So suppose I have my ZooInventory objects, and I sort them according to the
> sort order specification.
> Under interpretation (1), cities with low anaconda counts would go first in
> the sorted list, and within a given value of anacondas, sort by baboon
> count.
> Under interpretation (2), large zebra-count zoos would go first, and within
> a given value of zebras, sort ascending by anacondas.

Yes, (2) is the result you'll see. Baboons would also be considered
ascending as you've not ignored it, btw.

> It seems to me that (2), in which the zebras field values dominate the sort
> descending, is the "right" way to behave, but I can't seem to square that
> with my understanding of "ordered lexicographically by field" -- or maybe
> "lexicographically" means something different to me than to you, or maybe
> (2) just isn't really "right" after all.
> Behavior (2) -- relative to behavior (1) -- offers the ability to adjust the
> order of the schema to express a different sort order, but might present
> problems for schema negotiation.

What kind of problems are you describing here? Sorry if I'm not
getting it by the words "schema negotiation" alone.

Harsh J
Jeremy Kahn 2013-03-28, 17:57
Harsh J 2013-03-28, 18:15