|
edward choi
2010-12-10, 07:27
Harsh J
2010-12-10, 08:30
Edward Choi
2010-12-10, 12:23
Jones, Nick
2010-12-10, 13:36
Ted Dunning
2010-12-10, 16:15
Harsh J
2010-12-10, 17:41
Edward Choi
2010-12-10, 18:12
Edward Choi
2010-12-10, 18:15
Edward Choi
2010-12-10, 18:20
Ted Dunning
2010-12-10, 18:42
Edward Choi
2010-12-11, 02:42
Ted Dunning
2010-12-11, 07:34
edward choi
2010-12-11, 08:30
edward choi
2010-12-11, 09:32
Ted Dunning
2010-12-11, 18:00
Edward Choi
2010-12-12, 03:09
|
-
Is it possible to write file output in Map phase once and write another file output in Reduce phase?edward choi 2010-12-10, 07:27
Hi,
I'm trying to crawl numerous news sites. My plan is to make a file containing a list of all the news rss feed urls, and the path to save the crawled news article. So it would be like this: nytimes_nation, /user/hadoop/nytimes nytimes_sports, /user/hadoop/nytimes latimes_world, /user/hadoop/latimes latimes_nation, /user/hadoop/latimes ... ... ... Each mapper would get a single line and crawl the assigned url, process text, and save the result. So this job does not need any Reducing process. But what I'd also like to do is to create a dictionary at the same time. This could definitely take advantage of Reduce phase. Each mapper can generate output as "KEY:term, VALUE:term_frequency" Then Reducer can merge them all together and create a dictionary. (Of course I would be using many Reducers so the dictionary would be partitioned) I know that I can do this by creating two separate jobs (one for crawling, the other for making dictionary), but I'd like to do this in one-pass. So my design is: Map phase ==> crawl news articles, process text, write the result to a file. II II pass (term, term_frequency) pair to the Reducer II V Reduce phase ==> Merge the (term, term_frequency) pair and create a dictionary Is this at all possible? Or is it inherently impossible due to the structure of Hadoop? If it's possible, could anyone tell me how to do it? Ed.
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Harsh J 2010-12-10, 08:30
Hi,
You can use MultipleOutputs class to achieve this, with tagged names and free indicators of whether the output was from a map or reduce also. On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[EMAIL PROTECTED]> wrote: > Hi, > > I'm trying to crawl numerous news sites. > My plan is to make a file containing a list of all the news rss feed urls, > and the path to save the crawled news article. > So it would be like this: > > nytimes_nation, /user/hadoop/nytimes > nytimes_sports, /user/hadoop/nytimes > latimes_world, /user/hadoop/latimes > latimes_nation, /user/hadoop/latimes > ... > ... > ... > > Each mapper would get a single line and crawl the assigned url, process > text, and save the result. > So this job does not need any Reducing process. > > But what I'd also like to do is to create a dictionary at the same time. > This could definitely take advantage of Reduce phase. Each mapper can > generate output as "KEY:term, VALUE:term_frequency" > Then Reducer can merge them all together and create a dictionary. (Of course > I would be using many Reducers so the dictionary would be partitioned) > > I know that I can do this by creating two separate jobs (one for crawling, > the other for making dictionary), but I'd like to do this in one-pass. > > So my design is: > Map phase ==> crawl news articles, process text, write the result to a file. > II > II pass (term, term_frequency) pair to the Reducer > II > V > Reduce phase ==> Merge the (term, term_frequency) pair and create a > dictionary > > Is this at all possible? Or is it inherently impossible due to the structure > of Hadoop? > If it's possible, could anyone tell me how to do it? > > Ed. > -- Harsh J www.harshj.com
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Edward Choi 2010-12-10, 12:23
Wow thanks for the info. I'll definitely try that.
One question though... Is that "tagged name"and "free indicator" some kind of special class variable provided by MultipleOutputs class? Ed From mp2893's iPhone On 2010. 12. 10., at 오후 5:30, Harsh J <[EMAIL PROTECTED]> wrote: > Hi, > > You can use MultipleOutputs class to achieve this, with tagged names > and free indicators of whether the output was from a map or reduce > also. > > On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I'm trying to crawl numerous news sites. >> My plan is to make a file containing a list of all the news rss feed urls, >> and the path to save the crawled news article. >> So it would be like this: >> >> nytimes_nation, /user/hadoop/nytimes >> nytimes_sports, /user/hadoop/nytimes >> latimes_world, /user/hadoop/latimes >> latimes_nation, /user/hadoop/latimes >> ... >> ... >> ... >> >> Each mapper would get a single line and crawl the assigned url, process >> text, and save the result. >> So this job does not need any Reducing process. >> >> But what I'd also like to do is to create a dictionary at the same time. >> This could definitely take advantage of Reduce phase. Each mapper can >> generate output as "KEY:term, VALUE:term_frequency" >> Then Reducer can merge them all together and create a dictionary. (Of course >> I would be using many Reducers so the dictionary would be partitioned) >> >> I know that I can do this by creating two separate jobs (one for crawling, >> the other for making dictionary), but I'd like to do this in one-pass. >> >> So my design is: >> Map phase ==> crawl news articles, process text, write the result to a file. >> II >> II pass (term, term_frequency) pair to the Reducer >> II >> V >> Reduce phase ==> Merge the (term, term_frequency) pair and create a >> dictionary >> >> Is this at all possible? Or is it inherently impossible due to the structure >> of Hadoop? >> If it's possible, could anyone tell me how to do it? >> >> Ed. >> > > > > -- > Harsh J > www.harshj.com
-
RE: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Jones, Nick 2010-12-10, 13:36
It might be worth looking into Nutch; it can probably be configured to do the type of crawling you need.
Nick Jones -----Original Message----- From: Edward Choi [mailto:[EMAIL PROTECTED]] Sent: Friday, December 10, 2010 6:24 AM To: [EMAIL PROTECTED] Subject: Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase? Wow thanks for the info. I'll definitely try that. One question though... Is that "tagged name"and "free indicator" some kind of special class variable provided by MultipleOutputs class? Ed From mp2893's iPhone On 2010. 12. 10., at 오후 5:30, Harsh J <[EMAIL PROTECTED]> wrote: > Hi, > > You can use MultipleOutputs class to achieve this, with tagged names > and free indicators of whether the output was from a map or reduce > also. > > On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I'm trying to crawl numerous news sites. >> My plan is to make a file containing a list of all the news rss feed urls, >> and the path to save the crawled news article. >> So it would be like this: >> >> nytimes_nation, /user/hadoop/nytimes >> nytimes_sports, /user/hadoop/nytimes >> latimes_world, /user/hadoop/latimes >> latimes_nation, /user/hadoop/latimes >> ... >> ... >> ... >> >> Each mapper would get a single line and crawl the assigned url, process >> text, and save the result. >> So this job does not need any Reducing process. >> >> But what I'd also like to do is to create a dictionary at the same time. >> This could definitely take advantage of Reduce phase. Each mapper can >> generate output as "KEY:term, VALUE:term_frequency" >> Then Reducer can merge them all together and create a dictionary. (Of course >> I would be using many Reducers so the dictionary would be partitioned) >> >> I know that I can do this by creating two separate jobs (one for crawling, >> the other for making dictionary), but I'd like to do this in one-pass. >> >> So my design is: >> Map phase ==> crawl news articles, process text, write the result to a file. >> II >> II pass (term, term_frequency) pair to the Reducer >> II >> V >> Reduce phase ==> Merge the (term, term_frequency) pair and create a >> dictionary >> >> Is this at all possible? Or is it inherently impossible due to the structure >> of Hadoop? >> If it's possible, could anyone tell me how to do it? >> >> Ed. >> > > > > -- > Harsh J > www.harshj.com
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Ted Dunning 2010-12-10, 16:15
That is definitely possible, but may not be very desirable.
Take a look at the Bixo project for a full-scale crawler. There is a lot of subtlety in the fetching of URL's due to the varying quality of different sites and the interaction with crawl choking due to robots.txt considerations. http://bixo.101tec.com/ On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]> wrote: > So my design is: > Map phase ==> crawl news articles, process text, write the result to a > file. > II > II pass (term, term_frequency) pair to the Reducer > II > V > Reduce phase ==> Merge the (term, term_frequency) pair and create a > dictionary > > Is this at all possible? Or is it inherently impossible due to the > structure > of Hadoop? >
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Harsh J 2010-12-10, 17:41
Hi again,
Not sure if you are still on this approach after the previous suggestions, but since you asked: 2010/12/10 Edward Choi <[EMAIL PROTECTED]>: > Wow thanks for the info. I'll definitely try that. > One question though... > Is that "tagged name"and "free indicator" some kind of special class variable provided by MultipleOutputs class? To add a multiple-output collector to your Mapper, you need to do something like a MultipleOutputs.addNamedOutput -- where-in you give a name (a string identifier, what I reffered to as a "tag"). Then while using this collector to write your file from the mapper, you will get files named <tag>-m-00000, <tag>-m-00000 and so on, apart from the usual part-00000 stuff. If you notice, you also got that the output file was created from a "mapper" since there's an "m" in the name itself. This is the free identifier that comes along with no extra config. What's more -- you also get counters for the multiple output collector you defined just by enabling them (and using a reporter)! > > Ed > > From mp2893's iPhone > > On 2010. 12. 10., at 오후 5:30, Harsh J <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> You can use MultipleOutputs class to achieve this, with tagged names >> and free indicators of whether the output was from a map or reduce >> also. >> >> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> I'm trying to crawl numerous news sites. >>> My plan is to make a file containing a list of all the news rss feed urls, >>> and the path to save the crawled news article. >>> So it would be like this: >>> >>> nytimes_nation, /user/hadoop/nytimes >>> nytimes_sports, /user/hadoop/nytimes >>> latimes_world, /user/hadoop/latimes >>> latimes_nation, /user/hadoop/latimes >>> ... >>> ... >>> ... >>> >>> Each mapper would get a single line and crawl the assigned url, process >>> text, and save the result. >>> So this job does not need any Reducing process. >>> >>> But what I'd also like to do is to create a dictionary at the same time. >>> This could definitely take advantage of Reduce phase. Each mapper can >>> generate output as "KEY:term, VALUE:term_frequency" >>> Then Reducer can merge them all together and create a dictionary. (Of course >>> I would be using many Reducers so the dictionary would be partitioned) >>> >>> I know that I can do this by creating two separate jobs (one for crawling, >>> the other for making dictionary), but I'd like to do this in one-pass. >>> >>> So my design is: >>> Map phase ==> crawl news articles, process text, write the result to a file. >>> II >>> II pass (term, term_frequency) pair to the Reducer >>> II >>> V >>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>> dictionary >>> >>> Is this at all possible? Or is it inherently impossible due to the structure >>> of Hadoop? >>> If it's possible, could anyone tell me how to do it? >>> >>> Ed. >>> >> >> >> >> -- >> Harsh J >> www.harshj.com > -- Harsh J www.harshj.com
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Edward Choi 2010-12-10, 18:12
Thanks for the detailed answer! The suggested approaches I also need check out. But since my goal is to just crawl rss feeds, I might be better off just making a small crawler myself :-). Thanks again for the reply.
Ed From mp2893's iPhone On 2010. 12. 11., at 오전 2:41, Harsh J <[EMAIL PROTECTED]> wrote: > Hi again, > > Not sure if you are still on this approach after the previous > suggestions, but since you asked: > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]>: >> Wow thanks for the info. I'll definitely try that. >> One question though... >> Is that "tagged name"and "free indicator" some kind of special class variable provided by MultipleOutputs class? > > To add a multiple-output collector to your Mapper, you need to do > something like a MultipleOutputs.addNamedOutput -- where-in you give a > name (a string identifier, what I reffered to as a "tag"). Then while > using this collector to write your file from the mapper, you will get > files named <tag>-m-00000, <tag>-m-00000 and so on, apart from the > usual part-00000 stuff. > > If you notice, you also got that the output file was created from a > "mapper" since there's an "m" in the name itself. This is the free > identifier that comes along with no extra config. > > What's more -- you also get counters for the multiple output collector > you defined just by enabling them (and using a reporter)! > >> >> Ed >> >> From mp2893's iPhone >> >> On 2010. 12. 10., at 오후 5:30, Harsh J <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> You can use MultipleOutputs class to achieve this, with tagged names >>> and free indicators of whether the output was from a map or reduce >>> also. >>> >>> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> I'm trying to crawl numerous news sites. >>>> My plan is to make a file containing a list of all the news rss feed urls, >>>> and the path to save the crawled news article. >>>> So it would be like this: >>>> >>>> nytimes_nation, /user/hadoop/nytimes >>>> nytimes_sports, /user/hadoop/nytimes >>>> latimes_world, /user/hadoop/latimes >>>> latimes_nation, /user/hadoop/latimes >>>> ... >>>> ... >>>> ... >>>> >>>> Each mapper would get a single line and crawl the assigned url, process >>>> text, and save the result. >>>> So this job does not need any Reducing process. >>>> >>>> But what I'd also like to do is to create a dictionary at the same time. >>>> This could definitely take advantage of Reduce phase. Each mapper can >>>> generate output as "KEY:term, VALUE:term_frequency" >>>> Then Reducer can merge them all together and create a dictionary. (Of course >>>> I would be using many Reducers so the dictionary would be partitioned) >>>> >>>> I know that I can do this by creating two separate jobs (one for crawling, >>>> the other for making dictionary), but I'd like to do this in one-pass. >>>> >>>> So my design is: >>>> Map phase ==> crawl news articles, process text, write the result to a file. >>>> II >>>> II pass (term, term_frequency) pair to the Reducer >>>> II >>>> V >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>>> dictionary >>>> >>>> Is this at all possible? Or is it inherently impossible due to the structure >>>> of Hadoop? >>>> If it's possible, could anyone tell me how to do it? >>>> >>>> Ed. >>>> >>> >>> >>> >>> -- >>> Harsh J >>> www.harshj.com >> > > > > -- > Harsh J > www.harshj.com
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Edward Choi 2010-12-10, 18:15
God I never knew that they had a project like this.
I should definitely check it out. I may even be able to use it at my work place. Thanks for the tip!! From mp2893's iPhone On 2010. 12. 10., at 오후 10:36, "Jones, Nick" <[EMAIL PROTECTED]> wrote: > It might be worth looking into Nutch; it can probably be configured to do the type of crawling you need. > > Nick Jones > > -----Original Message----- > From: Edward Choi [mailto:[EMAIL PROTECTED]] > Sent: Friday, December 10, 2010 6:24 AM > To: [EMAIL PROTECTED] > Subject: Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase? > > Wow thanks for the info. I'll definitely try that. > One question though... > Is that "tagged name"and "free indicator" some kind of special class variable provided by MultipleOutputs class? > > Ed > > From mp2893's iPhone > > On 2010. 12. 10., at 오후 5:30, Harsh J <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> You can use MultipleOutputs class to achieve this, with tagged names >> and free indicators of whether the output was from a map or reduce >> also. >> >> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> I'm trying to crawl numerous news sites. >>> My plan is to make a file containing a list of all the news rss feed urls, >>> and the path to save the crawled news article. >>> So it would be like this: >>> >>> nytimes_nation, /user/hadoop/nytimes >>> nytimes_sports, /user/hadoop/nytimes >>> latimes_world, /user/hadoop/latimes >>> latimes_nation, /user/hadoop/latimes >>> ... >>> ... >>> ... >>> >>> Each mapper would get a single line and crawl the assigned url, process >>> text, and save the result. >>> So this job does not need any Reducing process. >>> >>> But what I'd also like to do is to create a dictionary at the same time. >>> This could definitely take advantage of Reduce phase. Each mapper can >>> generate output as "KEY:term, VALUE:term_frequency" >>> Then Reducer can merge them all together and create a dictionary. (Of course >>> I would be using many Reducers so the dictionary would be partitioned) >>> >>> I know that I can do this by creating two separate jobs (one for crawling, >>> the other for making dictionary), but I'd like to do this in one-pass. >>> >>> So my design is: >>> Map phase ==> crawl news articles, process text, write the result to a file. >>> II >>> II pass (term, term_frequency) pair to the Reducer >>> II >>> V >>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>> dictionary >>> >>> Is this at all possible? Or is it inherently impossible due to the structure >>> of Hadoop? >>> If it's possible, could anyone tell me how to do it? >>> >>> Ed. >>> >> >> >> >> -- >> Harsh J >> www.harshj.com >
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Edward Choi 2010-12-10, 18:20
Thanks for the tip. I guess it's a little different project from Nutch. My understanding is that while Nutch tries to implement a whole web search package, Bixo focuses on the crawling part. I should look into both projects more deeply. Thanks again!!
Ed From mp2893's iPhone On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]> wrote: > That is definitely possible, but may not be very desirable. > > Take a look at the Bixo project for a full-scale crawler. There is a lot of > subtlety in the fetching of URL's > due to the varying quality of different sites and the interaction with crawl > choking due to robots.txt considerations. > > http://bixo.101tec.com/ > > On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]> wrote: > >> So my design is: >> Map phase ==> crawl news articles, process text, write the result to a >> file. >> II >> II pass (term, term_frequency) pair to the Reducer >> II >> V >> Reduce phase ==> Merge the (term, term_frequency) pair and create a >> dictionary >> >> Is this at all possible? Or is it inherently impossible due to the >> structure >> of Hadoop? >>
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Ted Dunning 2010-12-10, 18:42
Regarding the idea of doing word counts during the crawl, I think you are
motivated by the best of principles (read input only once), but in practice, you will be doing many small crawls and saving the content. Word counting should probably not be tied too closely to the crawl because the crawl can be delayed arbitrarily. Better to have a good content repository that is updated as often as crawls complete and run other processing against the repository whenever it seems like a good idea. 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > Thanks for the tip. I guess it's a little different project from Nutch. My > understanding is that while Nutch tries to implement a whole web search > package, Bixo focuses on the crawling part. I should look into both projects > more deeply. Thanks again!! > > Ed > > From mp2893's iPhone > > On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > That is definitely possible, but may not be very desirable. > > > > Take a look at the Bixo project for a full-scale crawler. There is a lot > of > > subtlety in the fetching of URL's > > due to the varying quality of different sites and the interaction with > crawl > > choking due to robots.txt considerations. > > > > http://bixo.101tec.com/ > > > > On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]> wrote: > > > >> So my design is: > >> Map phase ==> crawl news articles, process text, write the result to a > >> file. > >> II > >> II pass (term, term_frequency) pair to the Reducer > >> II > >> V > >> Reduce phase ==> Merge the (term, term_frequency) pair and create a > >> dictionary > >> > >> Is this at all possible? Or is it inherently impossible due to the > >> structure > >> of Hadoop? > >> >
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Edward Choi 2010-12-11, 02:42
Thanks for the advice. But my plan is to crawl news rss feeds every 30 minutes. So I'd be downloading at most 5 to 10 news articles per map task (since news aren't published that often). So I guess I won't have to worry to much about the crawling dealy.
I thought it would be a good idea to make a dictionary during the crawling process. Because I will be needing the a dictionary to calculate tf-idf and I didn't want to have to go through the whole repository everytime a news aricle is added. If I crawl and make a dictionary at the same time, all I need to do to make a dictionary is to merge the new ones (which are generated every 30 minutes) with the existing dictionary which I guess will be computationally cheap. Ed From mp2893's iPhone On 2010. 12. 11., at 오전 3:42, Ted Dunning <[EMAIL PROTECTED]> wrote: > Regarding the idea of doing word counts during the crawl, I think you are > motivated by the best of principles (read > input only once), but in practice, you will be doing many small crawls and > saving the content. Word counting > should probably not be tied too closely to the crawl because the crawl can > be delayed arbitrarily. Better to have > a good content repository that is updated as often as crawls complete and > run other processing against the > repository whenever it seems like a good idea. > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > >> Thanks for the tip. I guess it's a little different project from Nutch. My >> understanding is that while Nutch tries to implement a whole web search >> package, Bixo focuses on the crawling part. I should look into both projects >> more deeply. Thanks again!! >> >> Ed >> >> From mp2893's iPhone >> >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >>> That is definitely possible, but may not be very desirable. >>> >>> Take a look at the Bixo project for a full-scale crawler. There is a lot >> of >>> subtlety in the fetching of URL's >>> due to the varying quality of different sites and the interaction with >> crawl >>> choking due to robots.txt considerations. >>> >>> http://bixo.101tec.com/ >>> >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]> wrote: >>> >>>> So my design is: >>>> Map phase ==> crawl news articles, process text, write the result to a >>>> file. >>>> II >>>> II pass (term, term_frequency) pair to the Reducer >>>> II >>>> V >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>>> dictionary >>>> >>>> Is this at all possible? Or is it inherently impossible due to the >>>> structure >>>> of Hadoop? >>>> >>
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Ted Dunning 2010-12-11, 07:34
If you are only loading articles at that rate, I would suggest that a simple
java or perl or ruby program would be MUCH easy to write and debug than a full on map-reduce program. 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > Thanks for the advice. But my plan is to crawl news rss feeds every 30 > minutes. So I'd be downloading at most 5 to 10 news articles per map task > (since news aren't published that often). So I guess I won't have to worry > to much about the crawling dealy. > I thought it would be a good idea to make a dictionary during the crawling > process. Because I will be needing the a dictionary to calculate tf-idf and > I didn't want to have to go through the whole repository everytime a news > aricle is added. > If I crawl and make a dictionary at the same time, all I need to do to make > a dictionary is to merge the new ones (which are generated every 30 minutes) > with the existing dictionary which I guess will be computationally cheap. > > Ed > > From mp2893's iPhone > > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Regarding the idea of doing word counts during the crawl, I think you are > > motivated by the best of principles (read > > input only once), but in practice, you will be doing many small crawls > and > > saving the content. Word counting > > should probably not be tied too closely to the crawl because the crawl > can > > be delayed arbitrarily. Better to have > > a good content repository that is updated as often as crawls complete and > > run other processing against the > > repository whenever it seems like a good idea. > > > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > > > >> Thanks for the tip. I guess it's a little different project from Nutch. > My > >> understanding is that while Nutch tries to implement a whole web search > >> package, Bixo focuses on the crawling part. I should look into both > projects > >> more deeply. Thanks again!! > >> > >> Ed > >> > >> From mp2893's iPhone > >> > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]> > wrote: > >> > >>> That is definitely possible, but may not be very desirable. > >>> > >>> Take a look at the Bixo project for a full-scale crawler. There is a > lot > >> of > >>> subtlety in the fetching of URL's > >>> due to the varying quality of different sites and the interaction with > >> crawl > >>> choking due to robots.txt considerations. > >>> > >>> http://bixo.101tec.com/ > >>> > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]> wrote: > >>> > >>>> So my design is: > >>>> Map phase ==> crawl news articles, process text, write the result to a > >>>> file. > >>>> II > >>>> II pass (term, term_frequency) pair to the Reducer > >>>> II > >>>> V > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a > >>>> dictionary > >>>> > >>>> Is this at all possible? Or is it inherently impossible due to the > >>>> structure > >>>> of Hadoop? > >>>> > >> >
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?edward choi 2010-12-11, 08:30
I'd start with only a few rss feeds at first, but I plan to expand it to the
scale of a thousands of rss feeds every 30 minutes eventually. That's why I am so eager to implement my system in Hadoop. I skimmed through Nutch and Bixo but I feel that eventually I'm gonna have to build the system from scratch. I'm going to need a very specific index structure to perform what I want. Customizing Nutch or Bixo seems to require more effort and time than writing codes from the bottom. But I can sure refer to their methodology. Ed 2010년 12월 11일 오후 4:34, Ted Dunning <[EMAIL PROTECTED]>님의 말: > If you are only loading articles at that rate, I would suggest that a > simple > java or perl or ruby program would be MUCH easy to write and debug than a > full on map-reduce program. > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > > > Thanks for the advice. But my plan is to crawl news rss feeds every 30 > > minutes. So I'd be downloading at most 5 to 10 news articles per map task > > (since news aren't published that often). So I guess I won't have to > worry > > to much about the crawling dealy. > > I thought it would be a good idea to make a dictionary during the > crawling > > process. Because I will be needing the a dictionary to calculate tf-idf > and > > I didn't want to have to go through the whole repository everytime a news > > aricle is added. > > If I crawl and make a dictionary at the same time, all I need to do to > make > > a dictionary is to merge the new ones (which are generated every 30 > minutes) > > with the existing dictionary which I guess will be computationally cheap. > > > > Ed > > > > From mp2893's iPhone > > > > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > Regarding the idea of doing word counts during the crawl, I think you > are > > > motivated by the best of principles (read > > > input only once), but in practice, you will be doing many small crawls > > and > > > saving the content. Word counting > > > should probably not be tied too closely to the crawl because the crawl > > can > > > be delayed arbitrarily. Better to have > > > a good content repository that is updated as often as crawls complete > and > > > run other processing against the > > > repository whenever it seems like a good idea. > > > > > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > > > > > >> Thanks for the tip. I guess it's a little different project from > Nutch. > > My > > >> understanding is that while Nutch tries to implement a whole web > search > > >> package, Bixo focuses on the crawling part. I should look into both > > projects > > >> more deeply. Thanks again!! > > >> > > >> Ed > > >> > > >> From mp2893's iPhone > > >> > > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > >> > > >>> That is definitely possible, but may not be very desirable. > > >>> > > >>> Take a look at the Bixo project for a full-scale crawler. There is a > > lot > > >> of > > >>> subtlety in the fetching of URL's > > >>> due to the varying quality of different sites and the interaction > with > > >> crawl > > >>> choking due to robots.txt considerations. > > >>> > > >>> http://bixo.101tec.com/ > > >>> > > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]> > wrote: > > >>> > > >>>> So my design is: > > >>>> Map phase ==> crawl news articles, process text, write the result to > a > > >>>> file. > > >>>> II > > >>>> II pass (term, term_frequency) pair to the Reducer > > >>>> II > > >>>> V > > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a > > >>>> dictionary > > >>>> > > >>>> Is this at all possible? Or is it inherently impossible due to the > > >>>> structure > > >>>> of Hadoop? > > >>>> > > >> > > >
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?edward choi 2010-12-11, 09:32
Excuse me but could I ask one more question?
Can I operate Bixo on a cluster other than Amazon EC2? I already am running a Hadoop cluster of my own, so I'd like run Bixo on top of my cluster. But I don't see how to do it in the Bixo's "Getting Started" page. All I see are "running locally", "running locally with eclipse", "running in Amazon EC2". Ed 2010년 12월 11일 오후 4:34, Ted Dunning <[EMAIL PROTECTED]>님의 말: > If you are only loading articles at that rate, I would suggest that a > simple > java or perl or ruby program would be MUCH easy to write and debug than a > full on map-reduce program. > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > > > Thanks for the advice. But my plan is to crawl news rss feeds every 30 > > minutes. So I'd be downloading at most 5 to 10 news articles per map task > > (since news aren't published that often). So I guess I won't have to > worry > > to much about the crawling dealy. > > I thought it would be a good idea to make a dictionary during the > crawling > > process. Because I will be needing the a dictionary to calculate tf-idf > and > > I didn't want to have to go through the whole repository everytime a news > > aricle is added. > > If I crawl and make a dictionary at the same time, all I need to do to > make > > a dictionary is to merge the new ones (which are generated every 30 > minutes) > > with the existing dictionary which I guess will be computationally cheap. > > > > Ed > > > > From mp2893's iPhone > > > > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > Regarding the idea of doing word counts during the crawl, I think you > are > > > motivated by the best of principles (read > > > input only once), but in practice, you will be doing many small crawls > > and > > > saving the content. Word counting > > > should probably not be tied too closely to the crawl because the crawl > > can > > > be delayed arbitrarily. Better to have > > > a good content repository that is updated as often as crawls complete > and > > > run other processing against the > > > repository whenever it seems like a good idea. > > > > > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]> > > > > > >> Thanks for the tip. I guess it's a little different project from > Nutch. > > My > > >> understanding is that while Nutch tries to implement a whole web > search > > >> package, Bixo focuses on the crawling part. I should look into both > > projects > > >> more deeply. Thanks again!! > > >> > > >> Ed > > >> > > >> From mp2893's iPhone > > >> > > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > >> > > >>> That is definitely possible, but may not be very desirable. > > >>> > > >>> Take a look at the Bixo project for a full-scale crawler. There is a > > lot > > >> of > > >>> subtlety in the fetching of URL's > > >>> due to the varying quality of different sites and the interaction > with > > >> crawl > > >>> choking due to robots.txt considerations. > > >>> > > >>> http://bixo.101tec.com/ > > >>> > > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]> > wrote: > > >>> > > >>>> So my design is: > > >>>> Map phase ==> crawl news articles, process text, write the result to > a > > >>>> file. > > >>>> II > > >>>> II pass (term, term_frequency) pair to the Reducer > > >>>> II > > >>>> V > > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a > > >>>> dictionary > > >>>> > > >>>> Is this at all possible? Or is it inherently impossible due to the > > >>>> structure > > >>>> of Hadoop? > > >>>> > > >> > > >
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Ted Dunning 2010-12-11, 18:00
Of course. It is just a set of Hadoop programs.
2010/12/11 edward choi <[EMAIL PROTECTED]> > Can I operate Bixo on a cluster other than Amazon EC2? >
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?Edward Choi 2010-12-12, 03:09
Thanks. Then I should definitely try that. Thanks for all the info :-)
Ed From mp2893's iPhone On 2010. 12. 12., at 오전 3:00, Ted Dunning <[EMAIL PROTECTED]> wrote: > Of course. It is just a set of Hadoop programs. > > 2010/12/11 edward choi <[EMAIL PROTECTED]> > >> Can I operate Bixo on a cluster other than Amazon EC2? >> |