Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Re: Why does Pig not use default resources from the Configuration object?


Copy link to this message
-
Re: Why does Pig not use default resources from the Configuration object?
Sounds good. Here is a doc on contributing patch (for some pointers)
https://cwiki.apache.org/confluence/display/PIG/HowToContribute
On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <[EMAIL PROTECTED]>wrote:

> Hey Prashant,
>
> Yup, I can take a stab at it. This is the first time I am looking at Pig
> code, so I might take some time to get started. Will get back to you if I
> have questions in the meantime. And yes, I will write it so it reads a pig
> property.
>
> -
> Bhooshan.
>
>
> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <[EMAIL PROTECTED]
> > wrote:
>
>> Hi Bhooshan,
>>
>> This makes more sense now. I think overriding fs implementation should go
>> into core-site.xml, but it would be useful to be able to add resources if
>> you have a bunch of other properties.
>>
>> Would you like to submit a patch? It should be based on a pig property
>> that suggests the additional resource names (myfs-site.xml) in your case.
>>
>> -Prashant
>>
>>
>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi Prashant,
>>>
>>>
>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>> scenario that I am trying to test -
>>>
>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
>>> filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>> This filesystem uses the scheme myfs:// for its URIs
>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>> made the class available through a jar file that is part of
>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>> 3. In MyFileSystem.class, I have a static block as -
>>> static {
>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>     Configuration.addDefaultResource("myfs-site.xml");
>>> }
>>> Both these files are in the classpath. To be safe, I have also added the
>>> my-fs-site.xml in the constructor of MyFileSystem as
>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>> resources as well as the non-default resources in the Configuration object.
>>> 4. I am trying to access the filesystem in my pig script as -
>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>> (name:chararray, age:int); -- loading data
>>> B = FOREACH A GENERATE name;
>>> store B into 'myfs://myhost.com:8999/testoutput';
>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>> is loaded and the properties defined in it are available.
>>> 6. However, when Pig tries to submit the job, it cannot find these
>>> properties and the job fails to submit successfully.
>>> 7. If I move all the properties defined in myfs-site.xml to
>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>> with all of the properties for a separate filesystem.
>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>> that while creating the JobConf object for a job, pig adds very specific
>>> resources to the job object, and ignores the resources that may have been
>>> added already (eg myfs-site.xml) in the Configuration object.
>>> 9. I have tested this with native map-reduce code as well as hive, and
>>> this approach of having a separate config file for MyFileSystem works fine
>>> in both those cases.
>>>
>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>> from my own config file before submitting a job.
>>>
>>> Thanks,
>>> -
>>> Bhooshan.
>>>
>>>
>>>
>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> +User group
>>>>
>>>> Hi Bhooshan,
>>>>
>>>> By default you should be running in MapReduce mode unless specified
>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>> provide your code here?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[EMAIL PROTECTED]>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB