Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Re: Why does Pig not use default resources from the Configuration object?


Copy link to this message
-
Re: Why does Pig not use default resources from the Configuration object?
Sounds good. Here is a doc on contributing patch (for some pointers)
https://cwiki.apache.org/confluence/display/PIG/HowToContribute
On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <[EMAIL PROTECTED]>wrote:

> Hey Prashant,
>
> Yup, I can take a stab at it. This is the first time I am looking at Pig
> code, so I might take some time to get started. Will get back to you if I
> have questions in the meantime. And yes, I will write it so it reads a pig
> property.
>
> -
> Bhooshan.
>
>
> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <[EMAIL PROTECTED]
> > wrote:
>
>> Hi Bhooshan,
>>
>> This makes more sense now. I think overriding fs implementation should go
>> into core-site.xml, but it would be useful to be able to add resources if
>> you have a bunch of other properties.
>>
>> Would you like to submit a patch? It should be based on a pig property
>> that suggests the additional resource names (myfs-site.xml) in your case.
>>
>> -Prashant
>>
>>
>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi Prashant,
>>>
>>>
>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>> scenario that I am trying to test -
>>>
>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
>>> filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>> This filesystem uses the scheme myfs:// for its URIs
>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>> made the class available through a jar file that is part of
>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>> 3. In MyFileSystem.class, I have a static block as -
>>> static {
>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>     Configuration.addDefaultResource("myfs-site.xml");
>>> }
>>> Both these files are in the classpath. To be safe, I have also added the
>>> my-fs-site.xml in the constructor of MyFileSystem as
>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>> resources as well as the non-default resources in the Configuration object.
>>> 4. I am trying to access the filesystem in my pig script as -
>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>> (name:chararray, age:int); -- loading data
>>> B = FOREACH A GENERATE name;
>>> store B into 'myfs://myhost.com:8999/testoutput';
>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>> is loaded and the properties defined in it are available.
>>> 6. However, when Pig tries to submit the job, it cannot find these
>>> properties and the job fails to submit successfully.
>>> 7. If I move all the properties defined in myfs-site.xml to
>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>> with all of the properties for a separate filesystem.
>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>> that while creating the JobConf object for a job, pig adds very specific
>>> resources to the job object, and ignores the resources that may have been
>>> added already (eg myfs-site.xml) in the Configuration object.
>>> 9. I have tested this with native map-reduce code as well as hive, and
>>> this approach of having a separate config file for MyFileSystem works fine
>>> in both those cases.
>>>
>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>> from my own config file before submitting a job.
>>>
>>> Thanks,
>>> -
>>> Bhooshan.
>>>
>>>
>>>
>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> +User group
>>>>
>>>> Hi Bhooshan,
>>>>
>>>> By default you should be running in MapReduce mode unless specified
>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>> provide your code here?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[EMAIL PROTECTED]>