|
|
-
Pig Meetup NotesRussell Jurney 2012-06-13, 04:45
Tuesday, Pig Meetup
Alan Gates - upcoming improvements in operators/backend physical plan. Desphagetification. Reworking UDF interface, keep backward compatibility. Hadoop 2 coming, will be slow adoption. Bill Graham, Julien & Twitter - Optimization oriented. Cluster is at capacity. Detect skew, cost based optimizers, dynamic tuning. Gathering performance metrics, will be in HCatalog. Look at previous executions of same job to optimize on the fly. Companies: Yahoo, consultants, salesforce, twitter, hortonworks, cloudera, zocalo systems?, trend micro Bill presented Ambrose. Motivation: 40MR job pig scripts, added DAG view. Shows you progress of your script as percentage and stepwise view. Helps with debug, optimization. Major progress. Pig users talk - using pig in local mode on sample, then pushing to cluster. Using illustrate to cut developer iterations. No counters in local mode. Embedded pig in loops for ML. Java embedding. Java API PigServer to run scripts from apps. Macros are helping remove ugly blocks of code, but UDFs are more solved by JRuby. Mortar data fixed Python UDFs. Reducing friction around using Pig with tools is important. Slowness of batch is hard for new users. Sample is hard to prepare that will do joins. Illustrate was invented for this purpose. Scheduling pig jobs is still a problem. Oozie is unpopular and too hard. Azkaban is inadequate for the enterprise. People hack things together. It sucks. HCatalog is maturing. Rest API. Hive and Pig together. Rest interface is for metadata so far. People are wanting to extend it to grab UDFs, etc. Russell Jurney http://datasyndrome.com |