Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Loading data from ranges of ordered subdirs


Copy link to this message
-
Loading data from ranges of ordered subdirs
Rodrick Megraw 2013-06-10, 20:54
Let's say I have my input data from the past 12 months organized into subdirs by date:

/data/2012-06-10
/data/2012-06-11
...
/data/2013-06-09

And now say that I want to run a Pig script to process data from a range of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The regex that I could specify for this date range is going to get quite complicated.

Is there a way that I can get my Pig script to load data from such a range without a regex?

I could load all the data in /data/*, and then FILTER by the date field in each record, but this is not desirable if the range of dates is small compared to the entire dataset.