Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Question regarding a custom LoadFunc implementation


+
Prashant Kommireddi 2012-12-11, 09:10
+
Bill Graham 2012-12-11, 16:12
+
Prashant Kommireddi 2012-12-11, 16:20
Copy link to this message
-
Re: Question regarding a custom LoadFunc implementation
We had a yml file that mapped physical datasources to the loader that the
generic one serves as a facade to. Now we're moving to an HCatalog based
solution that handles that as well as the logical to physical resolution.
Basically the mappings are stored in a DB.
On Tue, Dec 11, 2012 at 8:20 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> Thanks Bill. Any ideas on how to hide the location of HDFS files from the
> end user?
>
> On Tue, Dec 11, 2012 at 9:42 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
>> I think the latter would be better. Since the LoadFunc would be decoupled
>> from the data exporter you could schedule the exporting independent of the
>> loading. We do something similar, without the $query part.
>>
>>
>> On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <[EMAIL PROTECTED]
>> >wrote:
>>
>> > I was working on a LoadFunc and needed some ideas/second opinion on the
>> > best way to do this:
>> >
>> >
>> >    1. We use an API to download data from database as flat-files.
>> >       - A query is given with table name and fields required to extract
>> > data
>> >       2. Once 1. is done upload data to HDFS
>> >    3. Upload the schema file to HDFS
>> >    4. LoadFunc to read the schema file and parse data
>> >
>> > A strict requirement is to hide the details of the location of these
>> HDFS
>> > files from the user issuing the pig query. For a user it could look as
>> > simple as:
>> >
>> > A = load 'scheme://SampleTable' using CustomLoader('$query');
>> >
>> > User here only issues the load statement on table with a query and API
>> > calls for importing from database could happen in the background.
>> >
>> > What would be the best way to do this? Is it better to do the above as
>> part
>> > of LoadFunc, or would it rather be beneficial to do it separate and
>> somehow
>> > communicate the location from API import to LoadFunc?
>> >
>> > Thanks,
>> >
>> > Prashant
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> [EMAIL PROTECTED] going forward.*
>>
>
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB