Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Record sort order is "lexicographically by field" -- what does that mean?

Copy link to this message
Record sort order is "lexicographically by field" -- what does that mean?
According to the documentation
http://avro.apache.org/docs/current/spec.html#order , the sort order for
records is:

record data is ordered lexicographically by field. If a field specifies
that its order is:
   - "ascending", then the order of its values is unaltered.
      - "descending", then the order of its values is reversed.
      - "ignore", then its values are ignored when sorting.
What does "ordered lexicographically by field" mean?  I can see two
interpretations.  Consider a record of the following schema:

{"name": "ZooInventory",
 "type": "Record",
 "fields": [
   {"name": "city", "type": "string", "order": "ignore"},
   {"name": "zebras", "type": "int", "order": "descending"},
   {"name": "anacondas", "type": "int", "order": "ascending"},
   {"name": "baboons", "type": "int"}
I can read "ordered lexicographically by field" in two ways:

   1. the names of the fields are sorted lexicographically, and the field
   that goes lexicographically first (not marked as "order":"ignore")
   2. the records are sorted by the sort order of each field, with the
   first fields (not marked "order": "ignore") taking sort priority.

So suppose I have my ZooInventory objects, and I sort them according to the
sort order specification.

Under interpretation (1), cities with low anaconda counts would go first in
the sorted list, and within a given value of anacondas, sort by baboon
Under interpretation (2), large zebra-count zoos would go first, and within
a given value of zebras, sort ascending by anacondas.

It seems to me that (2), in which the zebras field values dominate the sort
descending, is the "right" way to behave, but I can't seem to square that
with my understanding of "ordered lexicographically by field" -- or maybe
"lexicographically" means something different to me than to you, or maybe
(2) just isn't really "right" after all.

Behavior (2) -- relative to behavior (1) -- offers the ability to adjust
the order of the schema to express a different sort order, but might
present problems for schema negotiation.

Which is the expected behavior?