Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - unsort algorithmus in map/reduce


Copy link to this message
-
unsort algorithmus in map/reduce
Radim Kolar 2011-10-25, 10:45
Hi, i am having problem implementing unsort for crawler in map/reduce.

I have list of URLs waiting to fetch, they needs to be reordered for
maximum distance between URLs from one domain.

idea is to do
  map URL -> domain, URL

  test.com, http://www.test.com/page1.html
  test.com, http://www.test.com/page2.html
  test.com, http://www.test.com/page3.html
  test2.com, http://www.test2.com/page1.html
  test2.com, http://www.test2.com/page2.html
  test2.com, http://www.test2.com/page3.html

  reduce test.com, <list> -> priority, URL

10, http://www.test.com/page1.html
  9, http://www.test.com/page2.html
  8, http://www.test.com/page3.html
10, http://www.test2.com/page1.html
  9, http://www.test2.com/page2.html
  8, http://www.test2.com/page3.html
Now i need to order output by key

10, http://www.test.com/page1.html
10, http://www.test2.com/page1.html
  9, http://www.test.com/page2.html
  9, http://www.test2.com/page2.html
  8, http://www.test.com/page3.html
  8, http://www.test2.com/page3.html

and write list of URLs in this order to output files. Like 50k urls to
file1, next 50k to file2 and so on.

Can you give me an idea how to sort using mapred and how to process
sorted data and split them into files?