|
|
-
Re: How to split a sequence file
Ajay Srivastava 2012-09-12, 05:35
Hi Jason, I am wondering about use case of distributing records on the basis of key to mapper. If possible, could you please share your scenario ? Is it map only job ? Why not distribute records using partitioner and do the processing in reducers ? Regards, Ajay Srivastava On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:
> Hi, > > I have a sequence file written by SequenceFileOutputFormat with key/value type of <Text, BytesWritable>, like below: > > Text BytesWritable > ------------------------------------------------------------- > id_A_01 7F2B3C687F2B3C687F2B3C68 > id_A_02 2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7 > id_A_03 5F2B3C68D77F2B3C687F2B3A > ... > id_B_01 1AB23C68D73C68D76AB23C68D73C68D7 > id_B_02 5AB23C68D73C68D76AB68D76A1 > id_B_03 F2B23C68D7B23C68D7B23C68D7 > > If I want all the records with the same key prefix to be processed by a same mapper, say records with key id_A_XX are processed by a mapper and records with key id_B_XX are processed by another mapper, what should I do? > > Should I implement our own InputFormat inherited from SequenceFileInputFormat ? > > Any help would be appreciated. > -- > YANG, Lin >
+
Ajay Srivastava 2012-09-12, 05:35
-
Re: How to split a sequence file
Jason Yang 2012-09-12, 05:57
hey guys,
Thanks for all your suggestions.
To wrap up, there're two ways to achieve this: 1. use multiple sequence files, then write a WholeFileInputFormat which use each file as a split by overriding the isSeparatable(); 2. Distribute records using partitioner and do the processing in reducers, however, the shuffle would raise some network and IO cost.
BTW, As the computation could be parallelized in both Mapper and Reducer, What's the difference btw them?
2012/9/12 Ajay Srivastava <[EMAIL PROTECTED]>
> Hi Jason, > I am wondering about use case of distributing records on the basis of key > to mapper. If possible, could you please share your scenario ? > Is it map only job ? Why not distribute records using partitioner and do > the processing in reducers ? > > > Regards, > Ajay Srivastava > > > On 12-Sep-2012, at 8:45 AM, Jason Yang wrote: > > > Hi, > > > > I have a sequence file written by SequenceFileOutputFormat with > key/value type of <Text, BytesWritable>, like below: > > > > Text BytesWritable > > ------------------------------------------------------------- > > id_A_01 7F2B3C687F2B3C687F2B3C68 > > id_A_02 2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7 > > id_A_03 5F2B3C68D77F2B3C687F2B3A > > ... > > id_B_01 1AB23C68D73C68D76AB23C68D73C68D7 > > id_B_02 5AB23C68D73C68D76AB68D76A1 > > id_B_03 F2B23C68D7B23C68D7B23C68D7 > > > > If I want all the records with the same key prefix to be processed by a > same mapper, say records with key id_A_XX are processed by a mapper and > records with key id_B_XX are processed by another mapper, what should I do? > > > > Should I implement our own InputFormat inherited from > SequenceFileInputFormat ? > > > > Any help would be appreciated. > > -- > > YANG, Lin > > > > -- YANG, Lin
+
Jason Yang 2012-09-12, 05:57
|
|