|
Vamshi Krishna
2012-02-08, 06:58
Harsh J
2012-02-08, 07:17
Vamshi Krishna
2012-02-09, 06:45
Harsh J
2012-02-09, 07:15
Wellington Chevreuil
2012-02-09, 14:19
Harsh J
2012-02-09, 15:26
Vamshi Krishna
2012-02-11, 06:51
Harsh J
2012-02-11, 13:54
|
-
job taking input file, which "is being" written by its preceding job's map phaseVamshi Krishna 2012-02-08, 06:58
Hi all
i have an important question about mapreduce. i have 2 hadoop mapreduce jobs. job1 has only mapper but no reducer. Job1 started and in its map() it is writing to a "file1" using context(Arg1,Arg2). If i wanted to start job2 (immidietly after job1) , which should take the "file1" (output still being written by above job's map phase) as input and do processing in its own map/reduce phases, and job2 should keep on taking the newly written data to "file1" , untill job1 finishes, what i should do? how can i do that, Please can anybody help? -- *Regards* * Vamshi Krishna * +
Vamshi Krishna 2012-02-08, 06:58
-
Re: job taking input file, which "is being" written by its preceding job's map phaseHarsh J 2012-02-08, 07:17
Vamsi,
Is it not possible to express your M-M-R phase chain as a simple, single M-R? Perhaps look at the ChainMapper class @ http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/ChainMapper.html On Wed, Feb 8, 2012 at 12:28 PM, Vamshi Krishna <[EMAIL PROTECTED]> wrote: > Hi all > i have an important question about mapreduce. > i have 2 hadoop mapreduce jobs. job1 has only mapper but no reducer. Job1 > started and in its map() it is writing to a "file1" using > context(Arg1,Arg2). If i wanted to start job2 (immidietly after job1) , > which should take the "file1" (output still being written by above job's map > phase) as input and do processing in its own map/reduce phases, and job2 > should keep on taking the newly written data to "file1" , untill job1 > finishes, what i should do? > > how can i do that, Please can anybody help? > > -- > Regards > > Vamshi Krishna > -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about +
Harsh J 2012-02-08, 07:17
-
Re: job taking input file, which "is being" written by its preceding job's map phaseVamshi Krishna 2012-02-09, 06:45
thank you harsh for your reply. Here what chainMapper does is, once the
first mapper finishes, then only second map starts using that file written by first mapper. Its just like chain. But what i want is like pipelining i.e after first map starts and before it finishes only second map has to start and kepp on reading from the same file that is being written by first map. It is almost like produce-consumer like scenario, where first map writes in to the file, and second map keeps on reading the same file. So that pipelining effect is seen between two maps. Hope you got what i am trying to tell.. please help.. On Wed, Feb 8, 2012 at 12:47 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Vamsi, > > Is it not possible to express your M-M-R phase chain as a simple, single > M-R? > > Perhaps look at the ChainMapper class @ > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/ChainMapper.html > > On Wed, Feb 8, 2012 at 12:28 PM, Vamshi Krishna <[EMAIL PROTECTED]> > wrote: > > Hi all > > i have an important question about mapreduce. > > i have 2 hadoop mapreduce jobs. job1 has only mapper but no reducer. > Job1 > > started and in its map() it is writing to a "file1" using > > context(Arg1,Arg2). If i wanted to start job2 (immidietly after job1) , > > which should take the "file1" (output still being written by above job's > map > > phase) as input and do processing in its own map/reduce phases, and job2 > > should keep on taking the newly written data to "file1" , untill job1 > > finishes, what i should do? > > > > how can i do that, Please can anybody help? > > > > -- > > Regards > > > > Vamshi Krishna > > > > > > -- > Harsh J > Customer Ops. Engineer > Cloudera | http://tiny.cloudera.com/about > -- *Regards* * Vamshi Krishna * +
Vamshi Krishna 2012-02-09, 06:45
-
Re: job taking input file, which "is being" written by its preceding job's map phaseHarsh J 2012-02-09, 07:15
Vamshi,
What problem are you exactly trying to solve by trying to attempt this? If you are only interested in records being streamed from one mapper into another, why can't it be chained together? Remember that map-only jobs do not sort their data output -- so I still see no benefit here in consuming record-by-record from a whole new task when it could be done from the very same. Btw, ChainMapper is an API abstraction to run several mapper implementations in sequence (chain) for each record input and transform them all along (helpful if you have several utility mappers and want to build composites). It does not touch disk. On Thu, Feb 9, 2012 at 12:15 PM, Vamshi Krishna <[EMAIL PROTECTED]> wrote: > thank you harsh for your reply. Here what chainMapper does is, once the > first mapper finishes, then only second map starts using that file written > by first mapper. Its just like chain. But what i want is like pipelining i.e > after first map starts and before it finishes only second map has to start > and kepp on reading from the same file that is being written by first map. > It is almost like produce-consumer like scenario, where first map writes in > to the file, and second map keeps on reading the same file. So that > pipelining effect is seen between two maps. > Hope you got what i am trying to tell.. > > please help.. > > > On Wed, Feb 8, 2012 at 12:47 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> Vamsi, >> >> Is it not possible to express your M-M-R phase chain as a simple, single >> M-R? >> >> Perhaps look at the ChainMapper class @ >> >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/ChainMapper.html >> >> On Wed, Feb 8, 2012 at 12:28 PM, Vamshi Krishna <[EMAIL PROTECTED]> >> wrote: >> > Hi all >> > i have an important question about mapreduce. >> > i have 2 hadoop mapreduce jobs. job1 has only mapper but no reducer. >> > Job1 >> > started and in its map() it is writing to a "file1" using >> > context(Arg1,Arg2). If i wanted to start job2 (immidietly after job1) , >> > which should take the "file1" (output still being written by above job's >> > map >> > phase) as input and do processing in its own map/reduce phases, and job2 >> > should keep on taking the newly written data to "file1" , untill job1 >> > finishes, what i should do? >> > >> > how can i do that, Please can anybody help? >> > >> > -- >> > Regards >> > >> > Vamshi Krishna >> > >> >> >> >> -- >> Harsh J >> Customer Ops. Engineer >> Cloudera | http://tiny.cloudera.com/about > > > > > -- > Regards > > Vamshi Krishna > -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about +
Harsh J 2012-02-09, 07:15
-
Re: job taking input file, which "is being" written by its preceding job's map phaseWellington Chevreuil 2012-02-09, 14:19
Hi Harsh,
I had noticed that this ChainMapper belongs to the old version package (org.apache.hadoop.mapred instead of org.apache.hadoop.mapreduce). Although it takes generic Class types as it's method argument, is this class able to work with Mappers from the new version package (org.apache.hadoop.mapreduce)? Thanks, Wellington. 2012/2/9 Harsh J <[EMAIL PROTECTED]>: > Vamshi, > > What problem are you exactly trying to solve by trying to attempt > this? If you are only interested in records being streamed from one > mapper into another, why can't it be chained together? Remember that > map-only jobs do not sort their data output -- so I still see no > benefit here in consuming record-by-record from a whole new task when > it could be done from the very same. > > Btw, ChainMapper is an API abstraction to run several mapper > implementations in sequence (chain) for each record input and > transform them all along (helpful if you have several utility mappers > and want to build composites). It does not touch disk. > > On Thu, Feb 9, 2012 at 12:15 PM, Vamshi Krishna <[EMAIL PROTECTED]> wrote: >> thank you harsh for your reply. Here what chainMapper does is, once the >> first mapper finishes, then only second map starts using that file written >> by first mapper. Its just like chain. But what i want is like pipelining i.e >> after first map starts and before it finishes only second map has to start >> and kepp on reading from the same file that is being written by first map. >> It is almost like produce-consumer like scenario, where first map writes in >> to the file, and second map keeps on reading the same file. So that >> pipelining effect is seen between two maps. >> Hope you got what i am trying to tell.. >> >> please help.. >> >> >> On Wed, Feb 8, 2012 at 12:47 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>> >>> Vamsi, >>> >>> Is it not possible to express your M-M-R phase chain as a simple, single >>> M-R? >>> >>> Perhaps look at the ChainMapper class @ >>> >>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/ChainMapper.html >>> >>> On Wed, Feb 8, 2012 at 12:28 PM, Vamshi Krishna <[EMAIL PROTECTED]> >>> wrote: >>> > Hi all >>> > i have an important question about mapreduce. >>> > i have 2 hadoop mapreduce jobs. job1 has only mapper but no reducer. >>> > Job1 >>> > started and in its map() it is writing to a "file1" using >>> > context(Arg1,Arg2). If i wanted to start job2 (immidietly after job1) , >>> > which should take the "file1" (output still being written by above job's >>> > map >>> > phase) as input and do processing in its own map/reduce phases, and job2 >>> > should keep on taking the newly written data to "file1" , untill job1 >>> > finishes, what i should do? >>> > >>> > how can i do that, Please can anybody help? >>> > >>> > -- >>> > Regards >>> > >>> > Vamshi Krishna >>> > >>> >>> >>> >>> -- >>> Harsh J >>> Customer Ops. Engineer >>> Cloudera | http://tiny.cloudera.com/about >> >> >> >> >> -- >> Regards >> >> Vamshi Krishna >> > > > > -- > Harsh J > Customer Ops. Engineer > Cloudera | http://tiny.cloudera.com/about +
Wellington Chevreuil 2012-02-09, 14:19
-
Re: job taking input file, which "is being" written by its preceding job's map phaseHarsh J 2012-02-09, 15:26
The new API ChainMapper/ChainReducer came into the 0.21 release and
are available in 0.22 and 0.23 presently, but not in 0.20.x/1.x releases. You can grab a patch from https://issues.apache.org/jira/browse/MAPREDUCE-372 though. Or perhaps reopen https://issues.apache.org/jira/browse/MAPREDUCE-3673 with a backport patch as https://issues.apache.org/jira/browse/MAPREDUCE-3607 didn't cover this one (was not demanded/provided) - if you need a future apache stable release cut to carry it. I'll be happy to review and commit it in for you. On Thu, Feb 9, 2012 at 7:49 PM, Wellington Chevreuil <[EMAIL PROTECTED]> wrote: > Hi Harsh, > > I had noticed that this ChainMapper belongs to the old version package > (org.apache.hadoop.mapred instead of org.apache.hadoop.mapreduce). > Although it takes generic Class types as it's method argument, is this > class able to work with Mappers from the new version package > (org.apache.hadoop.mapreduce)? > > Thanks, > Wellington. > > 2012/2/9 Harsh J <[EMAIL PROTECTED]>: >> Vamshi, >> >> What problem are you exactly trying to solve by trying to attempt >> this? If you are only interested in records being streamed from one >> mapper into another, why can't it be chained together? Remember that >> map-only jobs do not sort their data output -- so I still see no >> benefit here in consuming record-by-record from a whole new task when >> it could be done from the very same. >> >> Btw, ChainMapper is an API abstraction to run several mapper >> implementations in sequence (chain) for each record input and >> transform them all along (helpful if you have several utility mappers >> and want to build composites). It does not touch disk. >> >> On Thu, Feb 9, 2012 at 12:15 PM, Vamshi Krishna <[EMAIL PROTECTED]> wrote: >>> thank you harsh for your reply. Here what chainMapper does is, once the >>> first mapper finishes, then only second map starts using that file written >>> by first mapper. Its just like chain. But what i want is like pipelining i.e >>> after first map starts and before it finishes only second map has to start >>> and kepp on reading from the same file that is being written by first map. >>> It is almost like produce-consumer like scenario, where first map writes in >>> to the file, and second map keeps on reading the same file. So that >>> pipelining effect is seen between two maps. >>> Hope you got what i am trying to tell.. >>> >>> please help.. >>> >>> >>> On Wed, Feb 8, 2012 at 12:47 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>>> >>>> Vamsi, >>>> >>>> Is it not possible to express your M-M-R phase chain as a simple, single >>>> M-R? >>>> >>>> Perhaps look at the ChainMapper class @ >>>> >>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/ChainMapper.html >>>> >>>> On Wed, Feb 8, 2012 at 12:28 PM, Vamshi Krishna <[EMAIL PROTECTED]> >>>> wrote: >>>> > Hi all >>>> > i have an important question about mapreduce. >>>> > i have 2 hadoop mapreduce jobs. job1 has only mapper but no reducer. >>>> > Job1 >>>> > started and in its map() it is writing to a "file1" using >>>> > context(Arg1,Arg2). If i wanted to start job2 (immidietly after job1) , >>>> > which should take the "file1" (output still being written by above job's >>>> > map >>>> > phase) as input and do processing in its own map/reduce phases, and job2 >>>> > should keep on taking the newly written data to "file1" , untill job1 >>>> > finishes, what i should do? >>>> > >>>> > how can i do that, Please can anybody help? >>>> > >>>> > -- >>>> > Regards >>>> > >>>> > Vamshi Krishna >>>> > >>>> >>>> >>>> >>>> -- >>>> Harsh J >>>> Customer Ops. Engineer >>>> Cloudera | http://tiny.cloudera.com/about >>> >>> >>> >>> >>> -- >>> Regards >>> >>> Vamshi Krishna >>> >> >> >> >> -- >> Harsh J >> Customer Ops. Engineer >> Cloudera | http://tiny.cloudera.com/about -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about +
Harsh J 2012-02-09, 15:26
-
Re: job taking input file, which "is being" written by its preceding job's map phaseVamshi Krishna 2012-02-11, 06:51
Hi harsh, i am trying to find what are all the rowkeys present in two
tables. If userid is the rowKey for two different tables, i want to find all those rowsKeys present in both thae tables. Fo that i need to read from two tables into a mapreduce job. i.e i want to take multiple tables as input to a mapreduce job, so that i can check for the intersection. How can i do that? One more doubt i have is, if two jobs have Htable=new HTable(config, "HT"); (HT is the hbasetable i have created) in their respective maps, and these two jobs reading from other tables T1,T2 and putting into HT table, will there be any problem?? can i do like that. Its just like a scenario, where the data of two tables are being put into a single table, by 2 different jobs. I am getting following errors and jobs are killed automatically. java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:569) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113) ... 3 more Caused by: org.apache.hadoop.hbase.TableNotFoundException: HsetSIintermediate at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:725) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:594) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:559) at org.apache.hadoop.hbase.client.HTable.(HTable.java:173) at org.apache.hadoop.hbase.client.HTable.(HTable.java:147) at Setintersection.SetIntersectionMRFINAL$setIntersectionMapper1.*(SetIntersectionMRFINAL.java:49)* ... 8 more java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:569) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113) ... 3 more Caused by: org.apache.hadoop.hbase.TableNotFoundException: HsetSIintermediate at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:725) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:594) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:559) at org.apache.hadoop.hbase.client.HTable.(HTable.java:173) at org.apache.hadoop.hbase.client.HTable.(HTable.java:147) at Setintersection.SetIntersectionMRFINAL$setIntersectionMapper2.*(SetIntersectionMRFINAL.java:83) * ... 8 more the errors i bolded corresponds to line Htable=new HTable(config, "HT"); in both the jobs. please help.. On Thu, Feb 9, 2012 at 12:45 PM, Harsh J <[EMAIL PROTECTED]> wrote: *Regards* * Vamshi Krishna * +
Vamshi Krishna 2012-02-11, 06:51
-
Re: job taking input file, which "is being" written by its preceding job's map phaseHarsh J 2012-02-11, 13:54
Vamshi,
On Sat, Feb 11, 2012 at 12:21 PM, Vamshi Krishna <[EMAIL PROTECTED]> wrote: > Hi harsh, i am trying to find what are all the rowkeys present in two > tables. If userid is the rowKey for two different tables, i want to find all > those rowsKeys present in both thae tables. Fo that i need to read from two > tables into a mapreduce job. i.e i want to take multiple tables as input to > a mapreduce job, so that i can check for the intersection. How can i do > that? You should probably revisit your schema to eke out a better design if you've come to a point where joins are required - two tables carrying same rowkeys seems like doing it wrong (depends). Try going over the schema design portions of "HBase: The Definitive Guide", its a good read. > One more doubt i have is, if two jobs have Htable=new HTable(config, "HT"); > (HT is the hbasetable i have created) in their respective maps, and these > two jobs reading from other tables T1,T2 and putting into HT table, will > there be any problem?? No, there shouldn't be a problem but the process may be slow (you're doing a join of sorts). > Caused by: org.apache.hadoop.hbase.TableNotFoundException: > HsetSIintermediate Reg. your stacktrace: Apparently one of your requested tables do not exist yet. ^^ -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about +
Harsh J 2012-02-11, 13:54
|