-Re: Log File Versioning and Pig
Pradeep Gollakota 2013-12-12, 23:35
It seems like what you're asking for is Versioned Schema management. Pig is
not designed for that. Pig is only a scripting language to manipulate
I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
compact serialization libraries that do versioned schema management.
On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <[EMAIL PROTECTED]> wrote:
> We're playing around with options to what I'm sure is a common problem -
> changing schemas in our log data.
> Specifically we collect pixel data via nginx servers. These pixels
> currently have a pretty static list of parameters in the query string. We
> have eventual plans to change this and support many different types of
> parameters in the query string.
> Our current logs have a static number of fields separated by a \u0001
> delimiter. So to support "dynamic fields" we have two options:
> 1. Store data using a Java/Pig Map of key:chararray and val:chararray
> 2. Stick with static fields, and version the log format so that we know
> exactly how many fields to expect and what the schema is per line
> *Option 1 Pros:*
> No versioning needed. If we add a new param, it's automatically picked up
> in the map and is available for all scripts to use. Old scripts don't have
> to worry about new params being added.
> *Option 1 Cons:*
> Adds significantly to our file sizes. Compression will help big time as
> many of the keys in the map are repeated string values which will benefit
> largely from compression. But eventually when logs are decompressed for
> analysis, they'll eat up significantly more disk space. Also, we're not
> sure about this but dealing with a ton of Map objects in Pig could be way
> more inefficient and have more overhead than just a bunch of
> chararrays/Strings. Anyone know if this is true?
> *Option 2 Pros:*
> Basically smaller file size is the big one here since we don't have to
> store the field name in our raw logs only the value and probably a version
> number also.
> *Option 2 Cons:*
> Becomes harder for scripts to work with different versions and we need to
> explicitly state which log file version the script depends on somewhere.
> Was hoping to get a few opinions on this, what are people doing to solve
> this in the wild?
> Mike Sukmanowsky
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY 10018
> p: +1 (416) 953-4248
> e: [EMAIL PROTECTED]