Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Log File Versioning and Pig

Mike Sukmanowsky 2013-12-12, 22:06
Copy link to this message
Re: Log File Versioning and Pig
It seems like what you're asking for is Versioned Schema management. Pig is
not designed for that. Pig is only a scripting language to manipulate

I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
compact serialization libraries that do versioned schema management.
On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <[EMAIL PROTECTED]> wrote:

> We're playing around with options to what I'm sure is a common problem -
> changing schemas in our log data.
> Specifically we collect pixel data via nginx servers.  These pixels
> currently have a pretty static list of parameters in the query string.  We
> have eventual plans to change this and support many different types of
> parameters in the query string.
> Our current logs have a static number of fields separated by a \u0001
> delimiter.  So to support "dynamic fields" we have two options:
>    1. Store data using a Java/Pig Map of key:chararray and val:chararray
>    2. Stick with static fields, and version the log format so that we know
>    exactly how many fields to expect and what the schema is per line
> *Option 1 Pros:*
> No versioning needed.  If we add a new param, it's automatically picked up
> in the map and is available for all scripts to use.  Old scripts don't have
> to worry about new params being added.
> *Option 1 Cons:*
> Adds significantly to our file sizes.  Compression will help big time as
> many of the keys in the map are repeated string values which will benefit
> largely from compression.   But eventually when logs are decompressed for
> analysis, they'll eat up significantly more disk space.  Also, we're not
> sure about this but dealing with a ton of Map objects in Pig could be way
> more inefficient and have more overhead than just a bunch of
> chararrays/Strings.  Anyone know if this is true?
> *Option 2 Pros:*
> Basically smaller file size is the big one here since we don't have to
> store the field name in our raw logs only the value and probably a version
> number also.
> *Option 2 Cons:*
> Becomes harder for scripts to work with different versions and we need to
> explicitly state which log file version the script depends on somewhere.
> Was hoping to get a few opinions on this, what are people doing to solve
> this in the wild?
> --
> Mike Sukmanowsky
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
Mike Sukmanowsky 2013-12-13, 14:42