|
Cardon, Tejay E
2012-12-12, 18:02
Radim Kolar
2012-12-13, 03:00
Aloke Ghoshal
2012-12-13, 07:47
Robert Evans
2012-12-13, 14:56
Radim Kolar
2012-12-14, 01:03
|
-
One output file per nodeCardon, Tejay E 2012-12-12, 18:02
First, I hope I'm posting this to the right list. I wasn't sure if developer questions belonged here or on the user list.
Second, thanks for your thoughts. So I have a situation in which I'm building an index across many files. I don't want to send ALL the data across the wire by using reducers, so I'd like to use a map only job. However, I don't want one file per mapper, I'd like to consolidate them to only one file per node. Effectively, I'd like to have the output of the combiner go to file, but I know I can't trust combiner to always run on all outputs for the map. Is this possible? Perhaps some crafty partitioner that somehow sends all records to a reducer on the local node?? (I don't see this working) Thanks, Tejay [cid:[EMAIL PROTECTED]8760] Follow me on Eureka<https://eureka.isgs.lmco.com/#people/cardonte> and Brainstorm<http://brainstorm.isgs.lmco.com/Person.aspx?id=1200>
-
Re: One output file per nodeRadim Kolar 2012-12-13, 03:00
you need custom outputcomitter
-
Re: One output file per nodeAloke Ghoshal 2012-12-13, 07:47
Hi Tejay,
Building a consolidated index file for all your source files (for terms within the source files) may not be doable this way. On the other hand, building one index file per node is doable if you run a Reducer per Node & use a Partitioner. - Run one Reducer per node - Let Mapper output carry *NodeHostName:Term* as the key - Use a Partitioner based on the NodeHostName portion of the key (KeyFieldBasedPartitioner) & a GroupingComparator based on the Term portion Regards, Aloke On Wed, Dec 12, 2012 at 11:32 PM, Cardon, Tejay E <[EMAIL PROTECTED]>wrote: > First, I hope I’m posting this to the right list. I wasn’t sure if > developer questions belonged here or on the user list.**** > > Second, thanks for your thoughts.**** > > ** ** > > So I have a situation in which I’m building an index across many files. I > don’t want to send ALL the data across the wire by using reducers, so I’d > like to use a map only job. However, I don’t want one file per mapper, > I’d like to consolidate them to only one file per node. Effectively, I’d > like to have the output of the combiner go to file, but I know I can’t > trust combiner to always run on all outputs for the map.**** > > ** ** > > Is this possible? Perhaps some crafty partitioner that somehow sends all > records to a reducer on the local node?? (I don’t see this working)**** > > > Thanks, > Tejay**** > > ** ** > > ** ** > > > ** ** > > Follow me on Eureka <https://eureka.isgs.lmco.com/#people/cardonte> and > Brainstorm <http://brainstorm.isgs.lmco.com/Person.aspx?id=1200>**** > > ** ** > > ** ** > > ** ** >
-
Re: One output file per nodeRobert Evans 2012-12-13, 14:56
Tejay,
The way the scheduler works you are not guaranteed to get one reducer per node. Reducers are not scheduled based off of locality of any kind, and even if they were the scheduler typically treats rack local the same as node local. The partitioner interface only allows you to say what numeric partition an entry should go to, nothing else. There is no way to map that numeric partition to a particular machine. You could try to play games but they would be very difficult to get right, especially for the corner cases where a task can fail and may be rerun. If your partitioner is not exactly deterministic, you could lose some data, and double count other data in the case of a failure. Why don't you want to send all of the data over the wire? When you write it out to HDFS it will all be sent over the wire. How do you plan on using these indexes after they are generated? Do you plan to read from all of the indexes in parallel to search for a single entry or do you want to merge them together again before actually using them? You could do your original proposal by simulating the combiner within the map itself. If your data is small enough you could aggregate the data within the mapper and then only output the aggregate when all the entries have been processed. If it is too big to fit into memory you could look at having a disk backed data structure with in memory caching, or even simulate Map/Reduce itself and write all of the data out to a local file, sort the data and read it back in already partitioned. --Bobby On 12/13/12 1:47 AM, "Aloke Ghoshal" <[EMAIL PROTECTED]> wrote: >Hi Tejay, > >Building a consolidated index file for all your source files (for terms >within the source files) may not be doable this way. On the other hand, >building one index file per node is doable if you run a Reducer per Node & >use a Partitioner. > >- Run one Reducer per node >- Let Mapper output carry *NodeHostName:Term* as the key >- Use a Partitioner based on the NodeHostName portion of the key >(KeyFieldBasedPartitioner) & a GroupingComparator based on the Term >portion > >Regards, >Aloke > >On Wed, Dec 12, 2012 at 11:32 PM, Cardon, Tejay E ><[EMAIL PROTECTED]>wrote: > >> First, I hope I¹m posting this to the right list. I wasn¹t sure if >> developer questions belonged here or on the user list.**** >> >> Second, thanks for your thoughts.**** >> >> ** ** >> >> So I have a situation in which I¹m building an index across many files. >> I >> don¹t want to send ALL the data across the wire by using reducers, so >>I¹d >> like to use a map only job. However, I don¹t want one file per mapper, >> I¹d like to consolidate them to only one file per node. Effectively, >>I¹d >> like to have the output of the combiner go to file, but I know I can¹t >> trust combiner to always run on all outputs for the map.**** >> >> ** ** >> >> Is this possible? Perhaps some crafty partitioner that somehow sends >>all >> records to a reducer on the local node?? (I don¹t see this working)**** >> >> >> Thanks, >> Tejay**** >> >> ** ** >> >> ** ** >> >> >> ** ** >> >> Follow me on Eureka <https://eureka.isgs.lmco.com/#people/cardonte> and >> Brainstorm <http://brainstorm.isgs.lmco.com/Person.aspx?id=1200>**** >> >> ** ** >> >> ** ** >> >> ** ** >>
-
Re: One output file per nodeRadim Kolar 2012-12-14, 01:03
if you have strong data locality demands, then try
http://peregrine_mapreduce.bitbucket.org/ Its 2x faster then hadoop for multipass job types. It has also very fast node recovery. I plan to do this for hdfs, concept is similar to "virtual nodes". Its not hadoop or HDFS compatible and it has no ecosystem. I am not sure if it is still under development, no commits in last months. https://bitbucket.org/burtonator/peregrine/commits |