|
Piyush Kansal
2012-03-05, 09:46
Harsh J
2012-03-05, 10:58
Piyush Kansal
2012-03-05, 12:49
Piyush Kansal
2012-03-05, 14:03
Joey Echeverria
2012-03-05, 14:08
Harsh J
2012-03-05, 17:37
Piyush Kansal
2012-03-14, 21:44
Harsh J
2012-03-16, 07:36
|
-
Query regarding Hadoop version 0.20.203Piyush Kansal 2012-03-05, 09:46
Hi,
I am quite new to Hadoop and Java as well and have two questions: *Ques 1:* =====I have a HDFS directory which contains the o/p files of reducer. I want to read all the part-r-* files present in this directory. I have already tried following options as follows but no luck: - FileSystem.listStatus - FileSystem.getContentSummary - FileSystem.globStatus - FileUtil.stat2Paths - Cant use FileUtil.listFiles as it in not present in 0.20.203 Can you please suggest how can I do it? *Ques 2:* =====Since MultipleOutputs/MultipleOutputFormat is not there in 0.20.203, so can we achieve the same functionality provided by these classes. -- Regards, Piyush Kansal
-
Re: Query regarding Hadoop version 0.20.203Harsh J 2012-03-05, 10:58
Piyush,
On Mon, Mar 5, 2012 at 3:16 PM, Piyush Kansal <[EMAIL PROTECTED]> wrote: > Ques 1: > =====> I have a HDFS directory which contains the o/p files of reducer. I want to > read all the part-r-* files present in this directory. > > I have already tried following options as follows but no luck: > - FileSystem.listStatus > > Can you please suggest how can I do it? Iterate over the FileStatus objects returned by listStatus (they'll be in the right order), and read them one by one. Does that not work for you? > Ques 2: > =====> Since MultipleOutputs/MultipleOutputFormat is not there in 0.20.203, so can > we achieve the same functionality provided by these classes. Upgrade to either 1.0.1 to get MultipleOutputs for new API (Was only recently released with that backport from 0.21), or to any alternative distributions that offer it back-ported, or perhaps switch back to using the stable (old) API which is still recommended to use for MR. Alternatively, read http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F -- Harsh J
-
Re: Query regarding Hadoop version 0.20.203Piyush Kansal 2012-03-05, 12:49
Thanks Harsh. It worked.
On Mon, Mar 5, 2012 at 5:58 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Piyush, > > On Mon, Mar 5, 2012 at 3:16 PM, Piyush Kansal <[EMAIL PROTECTED]> > wrote: > > Ques 1: > > =====> > I have a HDFS directory which contains the o/p files of reducer. I want > to > > read all the part-r-* files present in this directory. > > > > I have already tried following options as follows but no luck: > > - FileSystem.listStatus > > > > Can you please suggest how can I do it? > > Iterate over the FileStatus objects returned by listStatus (they'll be > in the right order), and read them one by one. Does that not work for > you? > > > Ques 2: > > =====> > Since MultipleOutputs/MultipleOutputFormat is not there in 0.20.203, so > can > > we achieve the same functionality provided by these classes. > > Upgrade to either 1.0.1 to get MultipleOutputs for new API (Was only > recently released with that backport from 0.21), or to any alternative > distributions that offer it back-ported, or perhaps switch back to > using the stable (old) API which is still recommended to use for MR. > > Alternatively, read > > http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F > > -- > Harsh J > -- Regards, Piyush Kansal
-
Re: Query regarding Hadoop version 0.20.203Piyush Kansal 2012-03-05, 14:03
Harsh,
When I trying to readFields as follows: FileStatus origFStatus[] = ipFs.listStatus( ip ); DataInput dataIp; origFStatus[ 0 ].readFields( dataIp ); I am getting a compilation error "variable dataIp might not have been initialized". How do we initialize it? Is there a direct method by which I can get the read the fields easily. On Mon, Mar 5, 2012 at 7:49 AM, Piyush Kansal <[EMAIL PROTECTED]>wrote: > Thanks Harsh. It worked. > > > On Mon, Mar 5, 2012 at 5:58 AM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Piyush, >> >> On Mon, Mar 5, 2012 at 3:16 PM, Piyush Kansal <[EMAIL PROTECTED]> >> wrote: >> > Ques 1: >> > =====>> > I have a HDFS directory which contains the o/p files of reducer. I want >> to >> > read all the part-r-* files present in this directory. >> > >> > I have already tried following options as follows but no luck: >> > - FileSystem.listStatus >> > >> > Can you please suggest how can I do it? >> >> Iterate over the FileStatus objects returned by listStatus (they'll be >> in the right order), and read them one by one. Does that not work for >> you? >> >> > Ques 2: >> > =====>> > Since MultipleOutputs/MultipleOutputFormat is not there in 0.20.203, so >> can >> > we achieve the same functionality provided by these classes. >> >> Upgrade to either 1.0.1 to get MultipleOutputs for new API (Was only >> recently released with that backport from 0.21), or to any alternative >> distributions that offer it back-ported, or perhaps switch back to >> using the stable (old) API which is still recommended to use for MR. >> >> Alternatively, read >> >> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F >> >> -- >> Harsh J >> > > > > -- > Regards, > Piyush Kansal > > -- Regards, Piyush Kansal
-
Re: Query regarding Hadoop version 0.20.203Joey Echeverria 2012-03-05, 14:08
You don't need to call readFields(), the FileStatus objects are
already initialized. You should just be able to call the various getters to get the fields that you're interested in. -Joey On Mon, Mar 5, 2012 at 9:03 AM, Piyush Kansal <[EMAIL PROTECTED]> wrote: > Harsh, > > When I trying to readFields as follows: > > FileStatus origFStatus[] = ipFs.listStatus( ip ); > DataInput dataIp; > origFStatus[ 0 ].readFields( dataIp ); > > I am getting a compilation error "variable dataIp might not have been > initialized". > > How do we initialize it? Is there a direct method by which I can get the > read the fields easily. > > > On Mon, Mar 5, 2012 at 7:49 AM, Piyush Kansal <[EMAIL PROTECTED]> > wrote: >> >> Thanks Harsh. It worked. >> >> >> On Mon, Mar 5, 2012 at 5:58 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>> >>> Piyush, >>> >>> On Mon, Mar 5, 2012 at 3:16 PM, Piyush Kansal <[EMAIL PROTECTED]> >>> wrote: >>> > Ques 1: >>> > =====>>> > I have a HDFS directory which contains the o/p files of reducer. I want >>> > to >>> > read all the part-r-* files present in this directory. >>> > >>> > I have already tried following options as follows but no luck: >>> > - FileSystem.listStatus >>> > >>> > Can you please suggest how can I do it? >>> >>> Iterate over the FileStatus objects returned by listStatus (they'll be >>> in the right order), and read them one by one. Does that not work for >>> you? >>> >>> > Ques 2: >>> > =====>>> > Since MultipleOutputs/MultipleOutputFormat is not there in 0.20.203, so >>> > can >>> > we achieve the same functionality provided by these classes. >>> >>> Upgrade to either 1.0.1 to get MultipleOutputs for new API (Was only >>> recently released with that backport from 0.21), or to any alternative >>> distributions that offer it back-ported, or perhaps switch back to >>> using the stable (old) API which is still recommended to use for MR. >>> >>> Alternatively, read >>> >>> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F >>> >>> -- >>> Harsh J >> >> >> >> >> -- >> Regards, >> Piyush Kansal >> > > > > -- > Regards, > Piyush Kansal > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
-
Re: Query regarding Hadoop version 0.20.203Harsh J 2012-03-05, 17:37
What Joey said.
What you'll want is: FileStatus[] fileStatuses = fs.listStatus(somePath); for (FileStatus fstat : fileStatuses) { Path file = fstat.getPath(); // Do other read/etc. logic here with Path and FileSystem as you want. } Also read the FileStatus API at http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/FileStatus.html for more information. On Mon, Mar 5, 2012 at 7:38 PM, Joey Echeverria <[EMAIL PROTECTED]> wrote: > You don't need to call readFields(), the FileStatus objects are > already initialized. You should just be able to call the various > getters to get the fields that you're interested in. > > -Joey > > On Mon, Mar 5, 2012 at 9:03 AM, Piyush Kansal <[EMAIL PROTECTED]> wrote: >> Harsh, >> >> When I trying to readFields as follows: >> >> FileStatus origFStatus[] = ipFs.listStatus( ip ); >> DataInput dataIp; >> origFStatus[ 0 ].readFields( dataIp ); >> >> I am getting a compilation error "variable dataIp might not have been >> initialized". >> >> How do we initialize it? Is there a direct method by which I can get the >> read the fields easily. >> >> >> On Mon, Mar 5, 2012 at 7:49 AM, Piyush Kansal <[EMAIL PROTECTED]> >> wrote: >>> >>> Thanks Harsh. It worked. >>> >>> >>> On Mon, Mar 5, 2012 at 5:58 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>>> >>>> Piyush, >>>> >>>> On Mon, Mar 5, 2012 at 3:16 PM, Piyush Kansal <[EMAIL PROTECTED]> >>>> wrote: >>>> > Ques 1: >>>> > =====>>>> > I have a HDFS directory which contains the o/p files of reducer. I want >>>> > to >>>> > read all the part-r-* files present in this directory. >>>> > >>>> > I have already tried following options as follows but no luck: >>>> > - FileSystem.listStatus >>>> > >>>> > Can you please suggest how can I do it? >>>> >>>> Iterate over the FileStatus objects returned by listStatus (they'll be >>>> in the right order), and read them one by one. Does that not work for >>>> you? >>>> >>>> > Ques 2: >>>> > =====>>>> > Since MultipleOutputs/MultipleOutputFormat is not there in 0.20.203, so >>>> > can >>>> > we achieve the same functionality provided by these classes. >>>> >>>> Upgrade to either 1.0.1 to get MultipleOutputs for new API (Was only >>>> recently released with that backport from 0.21), or to any alternative >>>> distributions that offer it back-ported, or perhaps switch back to >>>> using the stable (old) API which is still recommended to use for MR. >>>> >>>> Alternatively, read >>>> >>>> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F >>>> >>>> -- >>>> Harsh J >>> >>> >>> >>> >>> -- >>> Regards, >>> Piyush Kansal >>> >> >> >> >> -- >> Regards, >> Piyush Kansal >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 -- Harsh J
-
Query regarding Hadoop version 0.20.203Piyush Kansal 2012-03-14, 21:44
Hi,
Since MultipleOutputs is not supported in version 0.20.203, so while using Partitioner class, key-value pairs belonging to partition 1 may end up in file part-r-00000 or part-r-00002. So, to handle this, I am currently *prefixing all the records* in a file with a "*partition number*". So, lets say 4 files gets created on HDFS as follows: part-r-00000: lets say it contains all records for partition 2 part-r-00001: lets say it contains all records for partition 1 part-r-00002: lets say it contains all records for partition 3 part-r-00003: lets say it contains all records for partition 0 Now, I am creating a new command to append all these files into a single file on the local file system based on "*increasing order of partition number*". While doing this, I have to remove the partition number from all the records. I can do it by reading all the files line by line and then using substring, can extract the required data and put it in the o/p file. But, this approach will take too much time as this functionality is intended to be run on very huge files (GBs in size). So, can you please suggest if there can be an alternative way to implement this functionality so as to get it done in minimum time. -- Regards, Piyush Kansal
-
Re: Query regarding Hadoop version 0.20.203Harsh J 2012-03-16, 07:36
Piyush,
On Thu, Mar 15, 2012 at 3:14 AM, Piyush Kansal <[EMAIL PROTECTED]> wrote: > Since MultipleOutputs is not supported in version 0.20.203, so while using Lets be clear here and avoid confusion for others. MultipleOutputs is present in 0.20.203's stable API and is perfectly supported. You are using the new, unstable API which did not have MultipleOutputs backported from trunk in it, until Apache Hadoop 1.0.1. > So, can you please suggest if there can be an alternative way to implement > this functionality so as to get it done in minimum time. Options are: 1. Upgrade to a higher version that does have your required library. 2. Follow http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F carefully to write your own "multiple outputs" version. I'd do (1) cause its supported by the framework instead of having to reinvent the wheel - and I'd only gain more goodness by updating :) -- Harsh J |