I'm developing a quite complex system using Pig and I'd like to confirm some ideas if possible. They are not really questions. They are more like thoughts.
1) I'm creating my "input" data using Pig itself. It means that the actual input is a small file with a few rows (few means not Big Data). And for each of these rows I create lots of data (my real input).
Well, in order to do that, considering that the creation of the real input is CPU bounded, I decided to create a separated file for each row and LOAD them separately, this way allowing Pig to fire a different Map process for each of them and hopefully obtaining some parallelization. Is it OK?
2) I have a UDF that I call in a projection relation. This UDF communicates with my S3 bucket and the relation that is produced in this projection is never used. Well, it seems that Pig optimizer simply discards this UDF. What I did was to make this UDF return a boolean value and I store it on S3 (a lightweight file). This way it gets executed. Any thoughts on this?
Thank you. I'll come back later on with other ideas. I hope this reasoning may help someone :)
On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <[EMAIL PROTECTED]> wrote:
Seems totally reasonable to me, albeit laborious. Be sure to set pig.splitCombination to false. Alternatively, you could try the approach here and write your own simple inputFormat: http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html Similar ideas in that the "input" is actually just a very small file and numerous simulations are run in parallel using pig.
Can you explain further? It's not clear what you're trying to do/what isn't working.
Regarding your UDF, you're creating a lot of overhead for storing something outside of the Hadoop ecosystem, imho.
Why not create a dump of your booleans and then have a separate script push them all to S3 at one time after you're Pig script is complete? That way you wouldn't be waiting on puts to S3 to complete your script. On Wed, Jul 16, 2014 at 5:20 AM, Rodrigo Ferreira <[EMAIL PROTECTED]> wrote:
Pig takes a network of relations you define and it only computes what it needs to compute to make the observable results in order to produce the outputs that you want to generate.
Pig is all about creating the desired outputs so it reserves the right to create a query plan which is entirely different from the network of relations you defined. If you create a job with multiple output it is smart enough, however, to share intermediate steps between the outputs.
For instance, if you never use relation B, it won't compute relation B. If relation Z depends on B, it will compute B on demand (or do something equivalent) in the process of computing Z.
You certainly can materialize a relation, store it in HDFS or S3, then load it later. This isn't hard to do, but then you have to write different code in the case that you compute the relation in some cases and in other cases LOAD it. It's one of the many "missing features" in Pig that would make it easier to maintain bigger Pig systems. ᐧ
On Wed, Jul 16, 2014 at 10:23 AM, Jacob Perkins <[EMAIL PROTECTED]> wrote:
Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254 paul.houle on Skype [EMAIL PROTECTED]
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext