-Notes of interest from Apache Pig Hackday, Austin edition
Jeremy Hanna 2012-05-12, 18:23
Thanks again to Twitter for doing their event and inspiring ours. I just wanted to report on some things we did in Austin for any interested. We had a good turnout of about 30 people.
Kevin Safford presented an introduction to Pig, or Pig 101. The slides are available here: http://www.slideshare.net/ktsafford/dachis-group-pigout101-12895911
Timothy Potter down from Colorado gave a presentation on intermediate Pig, or Pig 202. His slides are available here: http://www.slideshare.net/thelabdude/dachis-group-pig-hackday-pig-202
Clint Miller gave an introduction to unit testing with Pig with these slides: http://www.slideshare.net/clintmiller1/unit-testing-pig
After that we had some lunch and linked up remotely for a bit to the Twitter hackday in the Bay Area. Their group is mostly Pig committers and contributors so they worked on Pig tickets. One thing that Twitter opensourced as part of the event was a workflow visualization tool called Ambrose, https://github.com/twitter/ambrose
Also mentioned was Alan Gates excellent reference Programming Pig, the web version found here: http://ofps.oreilly.com/titles/9781449302641/index.html
We started the afternoon with a list of things we could work on:
• Pig mahout integration (pigout) led by Timothy Potter
• Pig Unit improvments led by Clint Miller
• David Boney wanted to get his KDD data preparation going with Pig for a competition
• Kevin wanted to help people get the presentation examples running
• Brandon Kearby led a group on helping get the IntelliJ IDEA Pig plugin working.
• Josh Levy wanted to see about getting grunt to recognize parameters passed in.
• Josh also wanted to look more at the python udf scripting and see if it could be improved.
• John Prior wanted see if there could be a grunt pretty print when using describe
• John also wanted to see if bash command history facilities could be added to grunt
• John also brought up that knime is a really cool visual workflow creator for machine learning that could also could be developed for Pig.
• The CassandraStorage loadstorefunc was also brought up as something Brandon Williams might work on, specifically the way to have it automatically use secondary indexes.
What actually happened?
Tim is going to continue working on the pig-vector integration into Mahout pending some feedback from Tim and the mahout folks.
Clint worked on getting Pig 0.10 branch downloaded and built locally in order to have something to patch against for the pig unit improvements outlined on this ticket: https://issues.apache.org/jira/browse/PIG-2692
David Boney got his data loaded up in CFS, the Cassandra file system and made some progress there.
Several people talked about Pig generally getting things running on their own laptops and environments.
Brandon Kearby and others forked https://github.com/brandonkearby/three-little-piggies and the jar in that project can now be added to your IntelliJ IDEA plugins directory to associate .pig files and provide source coloring. There's still some work to do there, but it's nice to have that working and available for IntelliJ 11 users.
Josh Levy got some ideas together with a couple of other attendees on how to improve the Pig/Python UDF scripting. Josh and Jeremy contacted Julien from Twitter who had written the python udf support and he is reviewing Josh's proposed changes with the possibility of creating a ticket for it.
Grunt pretty print? Coincidentally, someone in the Bay Area had the same thought and independent of our efforts created a ticket along with submitted a patch to do just that: https://issues.apache.org/jira/browse/PIG-2697
Brandon Williams is working on the CassandraStorage ticket - https://issues.apache.org/jira/browse/CASSANDRA-4238
Besides that there was great interaction among everyone until people went their own ways around 4 PM. Thanks for Twitter for doing their hackathon. We didn't interact too much with them because their group was more advanced and we didn't want to slow them down. Several of us chatted in the #hadoop-pig channel on freenode (IRC) as well as Russell Jurney and Jonathan Coveney from the Bay Area.