Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Input and output path


Copy link to this message
-
Re: Input and output path
Ruslan Al-Fakikh 2012-09-11, 15:12
Mohit,

I am suggesting setting up a whole Hive warehouse. This way your
folders will look like
/user/hive/warehouse/yourdataset/date=2012-09-11
/user/hive/warehouse/yourdataset/date=2012-09-12
...
All the partitions' metadata will be kept in a RDBMS, so when you
query them with Hive it will look like
select * from yourdataset where date = 2012-09-11
and it will be fast

HCatalog is a layer that provides this Hive's functionality to Pig and
MapReduce, so in Pig you can FILTER by those dates.
http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html#Load+Examples

Best Regards

On Tue, Sep 11, 2012 at 3:29 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:
> On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:
>
>> Mohit,
>>
>> I guess you could use parameters substitution here
>> http://wiki.apache.org/pig/ParameterSubstitution
>>
>> thanks this works.
>
>
>> Also, a note about your architecture:
>>
>
> Are you suggesting change to the path names or your suggestion is to use
> HCatalog with pig?
>
>
>> You can consider using Hive partitions to effectively select
>> appropriate dates in the folder names. But as your tool is Pig, not
>> Hive, you can use HCatalog as a layer
>>
>> Best Regards
>>
>> On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <[EMAIL PROTECTED]>
>> wrote:
>> > Our input path is something like YYYY/MM/DD/HH/input and we like to write
>> > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a
>> String
>> > and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
>> > clause?
>>