Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Loading data from ranges of ordered subdirs


+
Rodrick Megraw 2013-06-10, 20:54
Copy link to this message
-
Re: Loading data from ranges of ordered subdirs
There's two possibilites that come to mind.

1. Write a custom LoadFunc in which you can handle these regular
expressions. *Not the most ideal solution*
2. Use HCatalog. The example they have in their documentation seems to fit
your use case perfectly. (http://incubator.apache.org/hcatalog/docs/r0.5.0/
).

There might be other ways to do this, but I'm not aware of them.

Hope this helps.
On Mon, Jun 10, 2013 at 4:54 PM, Rodrick Megraw <[EMAIL PROTECTED]>wrote:

> Let's say I have my input data from the past 12 months organized into
> subdirs by date:
>
> /data/2012-06-10
> /data/2012-06-11
> ...
> /data/2013-06-09
>
> And now say that I want to run a Pig script to process data from a range
> of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The
> regex that I could specify for this date range is going to get quite
> complicated.
>
> Is there a way that I can get my Pig script to load data from such a range
> without a regex?
>
> I could load all the data in /data/*, and then FILTER by the date field in
> each record, but this is not desirable if the range of dates is small
> compared to the entire dataset.
>
+
Rodrick Megraw 2013-06-10, 21:30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB