Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Log File Versioning and Pig


Copy link to this message
-
Re: Log File Versioning and Pig
Thanks Pradeep - none of our logs currently use Proto Buf/Thrift/Avro and
we were somewhat trying to stay away from these guys but they may be a good
option.
On Thu, Dec 12, 2013 at 6:35 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> It seems like what you're asking for is Versioned Schema management. Pig is
> not designed for that. Pig is only a scripting language to manipulate
> datasets.
>
> I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
> compact serialization libraries that do versioned schema management.
>
>
> On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <[EMAIL PROTECTED]>
> wrote:
>
> > We're playing around with options to what I'm sure is a common problem -
> > changing schemas in our log data.
> >
> > Specifically we collect pixel data via nginx servers.  These pixels
> > currently have a pretty static list of parameters in the query string.
>  We
> > have eventual plans to change this and support many different types of
> > parameters in the query string.
> >
> > Our current logs have a static number of fields separated by a \u0001
> > delimiter.  So to support "dynamic fields" we have two options:
> >
> >    1. Store data using a Java/Pig Map of key:chararray and val:chararray
> >    2. Stick with static fields, and version the log format so that we
> know
> >    exactly how many fields to expect and what the schema is per line
> >
> > *Option 1 Pros:*
> > No versioning needed.  If we add a new param, it's automatically picked
> up
> > in the map and is available for all scripts to use.  Old scripts don't
> have
> > to worry about new params being added.
> >
> > *Option 1 Cons:*
> > Adds significantly to our file sizes.  Compression will help big time as
> > many of the keys in the map are repeated string values which will benefit
> > largely from compression.   But eventually when logs are decompressed for
> > analysis, they'll eat up significantly more disk space.  Also, we're not
> > sure about this but dealing with a ton of Map objects in Pig could be way
> > more inefficient and have more overhead than just a bunch of
> > chararrays/Strings.  Anyone know if this is true?
> >
> > *Option 2 Pros:*
> > Basically smaller file size is the big one here since we don't have to
> > store the field name in our raw logs only the value and probably a
> version
> > number also.
> >
> > *Option 2 Cons:*
> > Becomes harder for scripts to work with different versions and we need to
> > explicitly state which log file version the script depends on somewhere.
> >
> > Was hoping to get a few opinions on this, what are people doing to solve
> > this in the wild?
> >
> > --
> > Mike Sukmanowsky
> >
> > Product Lead, http://parse.ly
> > 989 Avenue of the Americas, 3rd Floor
> > New York, NY  10018
> > p: +1 (416) 953-4248
> > e: [EMAIL PROTECTED]
> >
>

--
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: [EMAIL PROTECTED]