Tuesday, Pig Meetup
Alan Gates - upcoming improvements in operators/backend physical plan.
Reworking UDF interface, keep backward compatibility.
Hadoop 2 coming, will be slow adoption.
Bill Graham, Julien & Twitter - Optimization oriented. Cluster is at
capacity. Detect skew, cost based optimizers, dynamic tuning. Gathering
performance metrics, will be in HCatalog. Look at previous executions of
same job to optimize on the fly.
Companies: Yahoo, consultants, salesforce, twitter, hortonworks, cloudera,
zocalo systems?, trend micro
Bill presented Ambrose. Motivation: 40MR job pig scripts, added DAG view.
Shows you progress of your script as percentage and stepwise view. Helps
with debug, optimization. Major progress.
Pig users talk - using pig in local mode on sample, then pushing to
cluster. Using illustrate to cut developer iterations. No counters in local
mode. Embedded pig in loops for ML. Java embedding.
Java API PigServer to run scripts from apps. Macros are helping remove ugly
blocks of code, but UDFs are more solved by JRuby. Mortar data fixed Python
Reducing friction around using Pig with tools is important. Slowness of
batch is hard for new users. Sample is hard to prepare that will do joins.
Illustrate was invented for this purpose.
Scheduling pig jobs is still a problem. Oozie is unpopular and too hard.
Azkaban is inadequate for the enterprise. People hack things together. It
HCatalog is maturing. Rest API. Hive and Pig together. Rest interface is
for metadata so far. People are wanting to extend it to grab UDFs, etc.
Russell Jurney http://datasyndrome.com