Thanks Pradeep - none of our logs currently use Proto Buf/Thrift/Avro and
we were somewhat trying to stay away from these guys but they may be a good
On Thu, Dec 12, 2013 at 6:35 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:
> It seems like what you're asking for is Versioned Schema management. Pig is
> not designed for that. Pig is only a scripting language to manipulate
> I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
> compact serialization libraries that do versioned schema management.
> On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <[EMAIL PROTECTED]>
> > We're playing around with options to what I'm sure is a common problem -
> > changing schemas in our log data.
> > Specifically we collect pixel data via nginx servers. These pixels
> > currently have a pretty static list of parameters in the query string.
> > have eventual plans to change this and support many different types of
> > parameters in the query string.
> > Our current logs have a static number of fields separated by a \u0001
> > delimiter. So to support "dynamic fields" we have two options:
> > 1. Store data using a Java/Pig Map of key:chararray and val:chararray
> > 2. Stick with static fields, and version the log format so that we
> > exactly how many fields to expect and what the schema is per line
> > *Option 1 Pros:*
> > No versioning needed. If we add a new param, it's automatically picked
> > in the map and is available for all scripts to use. Old scripts don't
> > to worry about new params being added.
> > *Option 1 Cons:*
> > Adds significantly to our file sizes. Compression will help big time as
> > many of the keys in the map are repeated string values which will benefit
> > largely from compression. But eventually when logs are decompressed for
> > analysis, they'll eat up significantly more disk space. Also, we're not
> > sure about this but dealing with a ton of Map objects in Pig could be way
> > more inefficient and have more overhead than just a bunch of
> > chararrays/Strings. Anyone know if this is true?
> > *Option 2 Pros:*
> > Basically smaller file size is the big one here since we don't have to
> > store the field name in our raw logs only the value and probably a
> > number also.
> > *Option 2 Cons:*
> > Becomes harder for scripts to work with different versions and we need to
> > explicitly state which log file version the script depends on somewhere.
> > Was hoping to get a few opinions on this, what are people doing to solve
> > this in the wild?
> > --
> > Mike Sukmanowsky
> > Product Lead, http://parse.ly
> > 989 Avenue of the Americas, 3rd Floor
> > New York, NY 10018
> > p: +1 (416) 953-4248
> > e: [EMAIL PROTECTED]
Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY 10018
p: +1 (416) 953-4248
e: [EMAIL PROTECTED]