Carl-Daniel Hailfinger 2013-11-07, 13:47
Your pretty much stuck to options 1 and 2, with option 1 being the accepted
solution. The whole idea of MapReduce is that you're not able to use a
single machine to compute your answers. You can put an 'fs -put' command in
your script that can stage the output on HDFS first before running your
script in MR mode.
Local mode is mainly there for testing purposes. Not for production use.
On Thu, Nov 7, 2013 at 5:47 AM, Carl-Daniel Hailfinger <
[EMAIL PROTECTED]> wrote:
> I'm processing squid log files with Pig courtesy of MyRegexLoader. After
> a first processing step (saving with PigStorage) there's quite a lot of
> data processing to do.
> There's a catch, though. A superfluous copy operation:
> 1. variant: Copy the original Squid logs manually to HDFS with "hdfs dfs
> -copyFromLocal", then read them in Pig (distributed mode) from HDFS with
> MyRegexLoader, then store them in HDFS with PigStorage.
> 2. variant: Read the original Logs from local filesystem in Pig (local
> mode) with MyRegexLoader, store the on the local filesystem with
> PigStorage, then copy the result to HDFS with "hdfs dfs -copyFromLocal".
> Is there a way to have Pig read files from local fs, but store the
> result in HDFS? Given that reading files from local fs can't be done in
> distributed mode, I'd be totally happy to have that operation only run
> on the local node as long as the stored file is accessible via HDFS
> I tried various ways to specify file locations as hdfs:// and file://,
> but that didn't work out. AFAICS the documentation is pretty silent on
> Any ideas or hints about what to do?