Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Question regarding a custom LoadFunc implementation

Copy link to this message
Question regarding a custom LoadFunc implementation
I was working on a LoadFunc and needed some ideas/second opinion on the
best way to do this:
   1. We use an API to download data from database as flat-files.
      - A query is given with table name and fields required to extract data
      2. Once 1. is done upload data to HDFS
   3. Upload the schema file to HDFS
   4. LoadFunc to read the schema file and parse data

A strict requirement is to hide the details of the location of these HDFS
files from the user issuing the pig query. For a user it could look as
simple as:

A = load 'scheme://SampleTable' using CustomLoader('$query');

User here only issues the load statement on table with a query and API
calls for importing from database could happen in the background.

What would be the best way to do this? Is it better to do the above as part
of LoadFunc, or would it rather be beneficial to do it separate and somehow
communicate the location from API import to LoadFunc?