Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Loading data from ranges of ordered subdirs


Copy link to this message
-
Loading data from ranges of ordered subdirs
Let's say I have my input data from the past 12 months organized into subdirs by date:

/data/2012-06-10
/data/2012-06-11
...
/data/2013-06-09

And now say that I want to run a Pig script to process data from a range of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The regex that I could specify for this date range is going to get quite complicated.

Is there a way that I can get my Pig script to load data from such a range without a regex?

I could load all the data in /data/*, and then FILTER by the date field in each record, but this is not desirable if the range of dates is small compared to the entire dataset.
     
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB