|
|
-
Re: WholeFileInputFormat formatMohammad Tariq 2012-07-11, 22:31
Hello Harsh,
Does Hadoop-0.20.205.0(new API) has Avro support?? Regards, Mohammad Tariq On Wed, Jul 11, 2012 at 1:57 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello Harsh, > > I am sorry to be a pest of questions. Actually I am kinda > stuck. I have to write my MapReduce job such that the comparisons > between each output from both the mappers must be in order. I mean I > have to read one line from the file and extract the desired fields > from the line in one mapper, and in the second mapper I have to read > the values from Hbase table and compare those values with the fields > read in the first mapper. I am wondering how to achieve that since > reducer phase will not start until all the mappers are done. > Maybe a bit of elaboration of my use case would be helpful > in understanding the problem in a better fashion. I have a file that > contains several fields. I have created columns for these fields in my > Hbase table. After that I am extracting value of each field from the > file and storing it in the corresponding Hbase column. Now, I have a > 'support file' for the same file whose values are already stored in > Hbase, but with a totally different format. But the order of fields in > the original file and the order of lines(containing corresponding > fields) in the support file is exactly same. So I am trying to read > one line from the support file, extract the field of interest in one > mapper and read the same field from the Hbase table in second mapper > and send these values to the reducer where the comparison will be made > to conclude the test. > Please help me out by providing your able guidance, as being > a novice I am not able to tackle with the situation.(Pardon my > ignorance) > > May thanks. > > Regards, > Mohammad Tariq > > > On Tue, Jul 10, 2012 at 8:34 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> I don't see why you'd have to use the WholeFileInputFormat for such a >> task. Your task is very similar to joins, and you can see the section >> "General reducer-side join" for what your overall logic should look >> like, under Ricky's >> http://horicky.blogspot.in/2010/08/designing-algorithmis-for-map-reduce.html >> article. >> >> On Tue, Jul 10, 2012 at 7:46 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: >>> Hello Harsh, >>> >>> Thank you so much for the quick response. Actually I have a >>> use case wherein I have to compare values that are coming from 2 >>> mappers to one reducer. For that I am planning to use MultipleInputs >>> class. In one mapper I have a text file (these files may contain >>> 1,00,000 to 2,00,000 lines), and I have to extract bytes from 2-13, >>> 20-25, 32-38 and so on from each line of this file. In the second >>> mapper I have to read values from an Hbase table. The columns of this >>> table correspond to the fields which I am reading from the text file >>> in the first mapper. >>> In the reducer I have to compare the results coming for both >>> the mappers and generate the final result. Need your guidance. Many >>> thanks. >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> On Tue, Jul 10, 2012 at 6:55 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>>> It depends on what you need. If your file is not splittable, or if you >>>> need to read the whole file from a single mapper itself (i.e. you do >>>> not _want_ it to be split), then use WholeFileInputFormats. Otherwise, >>>> you get more parallelism with regular splitting. >>>> >>>> On Tue, Jul 10, 2012 at 6:31 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: >>>>> Hello list, >>>>> >>>>> What could be the approximate maximum size of the files that >>>>> can be handled using WholeFileInputFormat format??I mean, if the file >>>>> is very big, then is it feasible to use WholeFileInputFormat as the >>>>> entire load will go to one mapper??Many thanks. >>>>> >>>>> Regards, >>>>> Mohammad Tariq >>>> >>>> >>>> >>>> -- >>>> Harsh J >> >> >> >> -- >> Harsh J |