Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Question regarding a custom LoadFunc implementation


+
Prashant Kommireddi 2012-12-11, 09:10
+
Bill Graham 2012-12-11, 16:12
+
Prashant Kommireddi 2012-12-11, 16:20
Copy link to this message
-
Re: Question regarding a custom LoadFunc implementation
Bill Graham 2012-12-11, 23:06
We had a yml file that mapped physical datasources to the loader that the
generic one serves as a facade to. Now we're moving to an HCatalog based
solution that handles that as well as the logical to physical resolution.
Basically the mappings are stored in a DB.
On Tue, Dec 11, 2012 at 8:20 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> Thanks Bill. Any ideas on how to hide the location of HDFS files from the
> end user?
>
> On Tue, Dec 11, 2012 at 9:42 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
>> I think the latter would be better. Since the LoadFunc would be decoupled
>> from the data exporter you could schedule the exporting independent of the
>> loading. We do something similar, without the $query part.
>>
>>
>> On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <[EMAIL PROTECTED]
>> >wrote:
>>
>> > I was working on a LoadFunc and needed some ideas/second opinion on the
>> > best way to do this:
>> >
>> >
>> >    1. We use an API to download data from database as flat-files.
>> >       - A query is given with table name and fields required to extract
>> > data
>> >       2. Once 1. is done upload data to HDFS
>> >    3. Upload the schema file to HDFS
>> >    4. LoadFunc to read the schema file and parse data
>> >
>> > A strict requirement is to hide the details of the location of these
>> HDFS
>> > files from the user issuing the pig query. For a user it could look as
>> > simple as:
>> >
>> > A = load 'scheme://SampleTable' using CustomLoader('$query');
>> >
>> > User here only issues the load statement on table with a query and API
>> > calls for importing from database could happen in the background.
>> >
>> > What would be the best way to do this? Is it better to do the above as
>> part
>> > of LoadFunc, or would it rather be beneficial to do it separate and
>> somehow
>> > communicate the location from API import to LoadFunc?
>> >
>> > Thanks,
>> >
>> > Prashant
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> [EMAIL PROTECTED] going forward.*
>>
>
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*