|
|
-
Question regarding a custom LoadFunc implementation
Prashant Kommireddi 2012-12-11, 09:10
I was working on a LoadFunc and needed some ideas/second opinion on the best way to do this: 1. We use an API to download data from database as flat-files. - A query is given with table name and fields required to extract data 2. Once 1. is done upload data to HDFS 3. Upload the schema file to HDFS 4. LoadFunc to read the schema file and parse data
A strict requirement is to hide the details of the location of these HDFS files from the user issuing the pig query. For a user it could look as simple as:
A = load 'scheme://SampleTable' using CustomLoader('$query');
User here only issues the load statement on table with a query and API calls for importing from database could happen in the background.
What would be the best way to do this? Is it better to do the above as part of LoadFunc, or would it rather be beneficial to do it separate and somehow communicate the location from API import to LoadFunc?
Thanks,
Prashant
+
Prashant Kommireddi 2012-12-11, 09:10
-
Re: Question regarding a custom LoadFunc implementation
Bill Graham 2012-12-11, 16:12
I think the latter would be better. Since the LoadFunc would be decoupled from the data exporter you could schedule the exporting independent of the loading. We do something similar, without the $query part. On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> I was working on a LoadFunc and needed some ideas/second opinion on the > best way to do this: > > > 1. We use an API to download data from database as flat-files. > - A query is given with table name and fields required to extract > data > 2. Once 1. is done upload data to HDFS > 3. Upload the schema file to HDFS > 4. LoadFunc to read the schema file and parse data > > A strict requirement is to hide the details of the location of these HDFS > files from the user issuing the pig query. For a user it could look as > simple as: > > A = load 'scheme://SampleTable' using CustomLoader('$query'); > > User here only issues the load statement on table with a query and API > calls for importing from database could happen in the background. > > What would be the best way to do this? Is it better to do the above as part > of LoadFunc, or would it rather be beneficial to do it separate and somehow > communicate the location from API import to LoadFunc? > > Thanks, > > Prashant >
-- *Note that I'm no longer using my Yahoo! email address. Please email me at [EMAIL PROTECTED] going forward.*
+
Bill Graham 2012-12-11, 16:12
-
Re: Question regarding a custom LoadFunc implementation
Prashant Kommireddi 2012-12-11, 16:20
Thanks Bill. Any ideas on how to hide the location of HDFS files from the end user?
On Tue, Dec 11, 2012 at 9:42 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
> I think the latter would be better. Since the LoadFunc would be decoupled > from the data exporter you could schedule the exporting independent of the > loading. We do something similar, without the $query part. > > > On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > I was working on a LoadFunc and needed some ideas/second opinion on the > > best way to do this: > > > > > > 1. We use an API to download data from database as flat-files. > > - A query is given with table name and fields required to extract > > data > > 2. Once 1. is done upload data to HDFS > > 3. Upload the schema file to HDFS > > 4. LoadFunc to read the schema file and parse data > > > > A strict requirement is to hide the details of the location of these HDFS > > files from the user issuing the pig query. For a user it could look as > > simple as: > > > > A = load 'scheme://SampleTable' using CustomLoader('$query'); > > > > User here only issues the load statement on table with a query and API > > calls for importing from database could happen in the background. > > > > What would be the best way to do this? Is it better to do the above as > part > > of LoadFunc, or would it rather be beneficial to do it separate and > somehow > > communicate the location from API import to LoadFunc? > > > > Thanks, > > > > Prashant > > > > > > -- > *Note that I'm no longer using my Yahoo! email address. Please email me at > [EMAIL PROTECTED] going forward.* >
+
Prashant Kommireddi 2012-12-11, 16:20
-
Re: Question regarding a custom LoadFunc implementation
Bill Graham 2012-12-11, 23:06
We had a yml file that mapped physical datasources to the loader that the generic one serves as a facade to. Now we're moving to an HCatalog based solution that handles that as well as the logical to physical resolution. Basically the mappings are stored in a DB. On Tue, Dec 11, 2012 at 8:20 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> Thanks Bill. Any ideas on how to hide the location of HDFS files from the > end user? > > On Tue, Dec 11, 2012 at 9:42 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > >> I think the latter would be better. Since the LoadFunc would be decoupled >> from the data exporter you could schedule the exporting independent of the >> loading. We do something similar, without the $query part. >> >> >> On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <[EMAIL PROTECTED] >> >wrote: >> >> > I was working on a LoadFunc and needed some ideas/second opinion on the >> > best way to do this: >> > >> > >> > 1. We use an API to download data from database as flat-files. >> > - A query is given with table name and fields required to extract >> > data >> > 2. Once 1. is done upload data to HDFS >> > 3. Upload the schema file to HDFS >> > 4. LoadFunc to read the schema file and parse data >> > >> > A strict requirement is to hide the details of the location of these >> HDFS >> > files from the user issuing the pig query. For a user it could look as >> > simple as: >> > >> > A = load 'scheme://SampleTable' using CustomLoader('$query'); >> > >> > User here only issues the load statement on table with a query and API >> > calls for importing from database could happen in the background. >> > >> > What would be the best way to do this? Is it better to do the above as >> part >> > of LoadFunc, or would it rather be beneficial to do it separate and >> somehow >> > communicate the location from API import to LoadFunc? >> > >> > Thanks, >> > >> > Prashant >> > >> >> >> >> -- >> *Note that I'm no longer using my Yahoo! email address. Please email me at >> [EMAIL PROTECTED] going forward.* >> > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [EMAIL PROTECTED] going forward.*
+
Bill Graham 2012-12-11, 23:06
|
|