DoomUs 2011-04-01, 04:48
It would help to get a good book. There are several.
For your program, there are several things that will trip you up:
a) lots of little files is going to be slow. You want input that is >100MB
per file if you want speed.
b) That file format is a bit cheesy since it is hard to tell URL's from text
if you concatenate lots of files. Better to use a format like protobufs or
Avro or even sequence files to separate the key and the data unambiguously.
c) I suspect that what you are asking for is to run a mapper so that each
invocation of map gets the URL as key and the text as data. That map
invocation can then tokenize the data and emit records with the URL as key
and each word as data. That isn't much use since the reducer will get the
URL and all the words that were emitted for that URL. If each URL appears
exactly once, then the input already had that. Perhaps you mean to emit the
word as key and URL as data. Then the reducer will see the word as key and
an iterator over all the URLs that mentioned the word.
On Thu, Mar 31, 2011 at 9:48 PM, DoomUs <[EMAIL PROTECTED]> wrote:
> I'm just starting out using Hadoop. I've looked through the java examples,
> and have an idea about what's going on, but don't really get it.
> I'd like to write a program that takes a directory of files. Contained in
> those files are a URL to a website on the first line, and the second line
> the TEXT from that website.
> The mapper should create a map for each word in the text to that URL, so
> every word found on the website would map to the URL.
> The reducer then, would collect all of the URLs that are mapped to via a
> given word.
> Each Word->URL is then written to a file.
> So, it's "simple" as a program designed to run on a single system, but I
> want to be able to distribute the computation and whatnot using Hadoop.
> I'm extremely new to Hadoop, I'm not even sure how to ask all of the
> questions I'd like answers for, I have zero experience in MapReduce, and
> limited experience in functional programming at all. Any programming tips,
> or if I have my "Mapper" or "Reducer" defined incorrectly, corrections, etc
> would be greatly appreciated.
> How do I read (and write) files from hdfs?
> Once I've read them, How do I distribute the files to be mapped?
> I know I need a class to implement the mapper, and one to implement the
> reducer, but how does the class have a return type to output the map?
> Thanks a lot for your help.
> View this message in context:
> Sent from the Hadoop core-user mailing list archive at Nabble.com.