Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> pig script - failed reading input from s3


Copy link to this message
-
Re: pig script - failed reading input from s3
Hello Vitalii,

The 5TB limit is only valid if you are using the EMR framework to run ur
jobs in a jobflow.
I think we cannot use that in my case as I have a CDH4 cluster on EC2. But
thanks for the tip.
Reference:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html
On Tue, Apr 9, 2013 at 9:09 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote:

> Have you tried it with native? AFAIR the limitation was raised to 5TB few
> years ago.
> 8 квіт. 2013 18:30, "Panshul Whisper" <[EMAIL PROTECTED]> напис.
>
> > Thank you for the advice David.
> >
> > I tried this ant it works with the native system. But my problem is not
> > solved yet, because I have to work with files much bigger than 5GB. My
> test
> > data file is 9GB. How do I make it read from s3://
> >
> > Thanking You,
> >
> > Regards,
> >
> >
> > On Mon, Apr 8, 2013 at 3:27 PM, David LaBarbera <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Try
> > > fs.s3n.aws...
> > >
> > > and also load from s3
> > > data = load 's3n://...'
> > >
> > > The "n" stands for native. I believe S3 also supports block device
> > storage
> > > (s3://) which allows bigger files to be stored. I don't know how (if at
> > > all) the two types interact.
> > >
> > > David
> > >
> > > On Apr 7, 2013, at 1:11 PM, Panshul Whisper <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Hello
> > > >
> > > > I am trying to run a pig script which is suppoesed to read input from
> > s3
> > > > and write back to s3. The cluster
> > > > scenario is as follows:
> > > > * Cluster is installed on EC2 using Cloudera Manager 4.5 Automatic
> > > > Installation
> > > > * Installed version: CDH4
> > > > * Script location on - one of the nodes of cluster
> > > > * running as : $ pig countGroups_daily.pig
> > > >
> > > > *The Pig Script*:
> > > > set fs.s3.awsAccessKeyId xxxxxxxxxxxxxxxxxx
> > > > set fs.s3.awsSecretAccessKey xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > > > --load the sample input file
> > > > data = load 's3://steamdata/nysedata/NYSE_daily.txt' as
> > > > (exchange:chararray, symbol:chararray, date:chararray, open:float,
> > > > high:float, low:float, close:float, volume:int, adj_close:float);
> > > > --group data by symbols
> > > > symbolgrp = group data by symbol;
> > > > --count data in every group
> > > > symcount = foreach symbolgrp generate group,COUNT(data);
> > > > --order the counted list by count
> > > > symcountordered = order symcount by $1;
> > > > store symcountordered into 's3://steamdata/nyseoutput/daily';
> > > >
> > > > *Error:*
> > > >
> > > > Message: org.apache.pig.backend.executionengine.ExecException: ERROR
> > > 2118:
> > > > Input path does not exist: s3://steamdata/nysedata/NYSE_daily.txt
> > > >
> > > > Input(s):
> > > > Failed to read data from "s3://steamdata/nysedata/NYSE_daily.txt"
> > > >
> > > > Please help me, what am I doing wrong. I can assure you that the
> input
> > > > path/file exists on s3 and the AWS key and secret key entered are
> > > correct.
> > > >
> > > > Thanking You,
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Ouch Whisper
> > > > 010101010101
> > >
> > >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>

--
Regards,
Ouch Whisper
010101010101
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB