|
|
-
Fully distribute TextInputFormat...
Pierre ANCELOT 2010-05-10, 12:21
Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.comEgo sum quis ego servo "Je suis ce que je protège" "I am what I protect"
-
Re: Fully distribute TextInputFormat...
Jeff Zhang 2010-05-10, 12:52
What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> wrote: > Hi folks :) > I have one big file... I read it with FileInputFormat, this generates only > one task and of course, this doesn't get distributed across the cluster > nodes. > Should I use an other Input class or do I have a bug in my implementation? > > The desired behavior is one task per line. > > Thanks. > > > > -- > http://www.neko-consulting.com> Ego sum quis ego servo > "Je suis ce que je protège" > "I am what I protect" > -- Best Regards Jeff Zhang
-
Re: Fully distribute TextInputFormat...
Pierre ANCELOT 2010-05-10, 13:05
Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > What's the format of this file ? gzip can been split. > > > > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> > wrote: > > Hi folks :) > > I have one big file... I read it with FileInputFormat, this generates > only > > one task and of course, this doesn't get distributed across the cluster > > nodes. > > Should I use an other Input class or do I have a bug in my > implementation? > > > > The desired behavior is one task per line. > > > > Thanks. > > > > > > > > -- > > http://www.neko-consulting.com> > Ego sum quis ego servo > > "Je suis ce que je protège" > > "I am what I protect" > > > > > > -- > Best Regards > > Jeff Zhang > -- http://www.neko-consulting.comEgo sum quis ego servo "Je suis ce que je protège" "I am what I protect"
-
Re: Fully distribute TextInputFormat...
Pierre ANCELOT 2010-05-10, 13:19
Idea is, I want to share the lines of the file equally between nodes... On Mon, May 10, 2010 at 3:05 PM, Pierre ANCELOT <[EMAIL PROTECTED]> wrote: > Simple and pure raw ascii text. One line == one treatment to do. > > > > > On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > >> What's the format of this file ? gzip can been split. >> >> >> >> On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> >> wrote: >> > Hi folks :) >> > I have one big file... I read it with FileInputFormat, this generates >> only >> > one task and of course, this doesn't get distributed across the cluster >> > nodes. >> > Should I use an other Input class or do I have a bug in my >> implementation? >> > >> > The desired behavior is one task per line. >> > >> > Thanks. >> > >> > >> > >> > -- >> > http://www.neko-consulting.com>> > Ego sum quis ego servo >> > "Je suis ce que je protège" >> > "I am what I protect" >> > >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > > -- > http://www.neko-consulting.com> Ego sum quis ego servo > "Je suis ce que je protège" > "I am what I protect" > > -- http://www.neko-consulting.comEgo sum quis ego servo "Je suis ce que je protège" "I am what I protect"
-
Re: Fully distribute TextInputFormat...
Ted Yu 2010-05-10, 16:35
NLineInputFormat seems a fit for your need. On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT <[EMAIL PROTECTED]> wrote: > Simple and pure raw ascii text. One line == one treatment to do. > > > > On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > What's the format of this file ? gzip can been split. > > > > > > > > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> > > wrote: > > > Hi folks :) > > > I have one big file... I read it with FileInputFormat, this generates > > only > > > one task and of course, this doesn't get distributed across the cluster > > > nodes. > > > Should I use an other Input class or do I have a bug in my > > implementation? > > > > > > The desired behavior is one task per line. > > > > > > Thanks. > > > > > > > > > > > > -- > > > http://www.neko-consulting.com> > > Ego sum quis ego servo > > > "Je suis ce que je protège" > > > "I am what I protect" > > > > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > > > > > > -- > http://www.neko-consulting.com> Ego sum quis ego servo > "Je suis ce que je protège" > "I am what I protect" >
-
Re: Fully distribute TextInputFormat...
Edward Capriolo 2010-05-10, 19:06
If you curious, I found out this morning that NLineInputFormat is not ported to the new mapreduce api current yet. (It might be in trunk). So using NLineFormat forces you into the older mapred api. Edward On Mon, May 10, 2010 at 12:35 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > NLineInputFormat seems a fit for your need. > On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT <[EMAIL PROTECTED]> > wrote: > > > Simple and pure raw ascii text. One line == one treatment to do. > > > > > > > > On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > > > What's the format of this file ? gzip can been split. > > > > > > > > > > > > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> > > > wrote: > > > > Hi folks :) > > > > I have one big file... I read it with FileInputFormat, this generates > > > only > > > > one task and of course, this doesn't get distributed across the > cluster > > > > nodes. > > > > Should I use an other Input class or do I have a bug in my > > > implementation? > > > > > > > > The desired behavior is one task per line. > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > > http://www.neko-consulting.com> > > > Ego sum quis ego servo > > > > "Je suis ce que je protège" > > > > "I am what I protect" > > > > > > > > > > > > > > > > -- > > > Best Regards > > > > > > Jeff Zhang > > > > > > > > > > > -- > > http://www.neko-consulting.com> > Ego sum quis ego servo > > "Je suis ce que je protège" > > "I am what I protect" > > >
-
Re: Fully distribute TextInputFormat...
Alex Baranov 2010-05-10, 20:27
If I'm not mistaken LZO compression better suits when splitting needed, not gzip. Alex Baranau http://sematext.comOn Mon, May 10, 2010 at 3:52 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > What's the format of this file ? gzip can been split. > > > > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> > wrote: > > Hi folks :) > > I have one big file... I read it with FileInputFormat, this generates > only > > one task and of course, this doesn't get distributed across the cluster > > nodes. > > Should I use an other Input class or do I have a bug in my > implementation? > > > > The desired behavior is one task per line. > > > > Thanks. > > > > > > > > -- > > http://www.neko-consulting.com> > Ego sum quis ego servo > > "Je suis ce que je protège" > > "I am what I protect" > > > > > > -- > Best Regards > > Jeff Zhang >
-
Re: Fully distribute TextInputFormat...
himanshu chandola 2010-05-11, 03:13
Actually would you have a case when no splitting is needed. Just curious. It seems that you would use LZO or not use any compression at all. H ----- Original Message ---- From: Alex Baranov <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Mon, May 10, 2010 4:27:11 PM Subject: Re: Fully distribute TextInputFormat... If I'm not mistaken LZO compression better suits when splitting needed, not gzip. Alex Baranau http://sematext.comOn Mon, May 10, 2010 at 3:52 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > What's the format of this file ? gzip can been split. > > > > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> > wrote: > > Hi folks :) > > I have one big file... I read it with FileInputFormat, this generates > only > > one task and of course, this doesn't get distributed across the cluster > > nodes. > > Should I use an other Input class or do I have a bug in my > implementation? > > > > The desired behavior is one task per line. > > > > Thanks. > > > > > > > > -- > > http://www.neko-consulting.com> > Ego sum quis ego servo > > "Je suis ce que je protège" > > "I am what I protect" > > > > > > -- > Best Regards > > Jeff Zhang >
-
Re: Fully distribute TextInputFormat...
Alex Baranov 2010-05-11, 05:27
I meant splitting of very huge file to distribute it over multiple Map jobs. Alex. http://sematext.comOn Tue, May 11, 2010 at 6:13 AM, himanshu chandola < [EMAIL PROTECTED]> wrote: > Actually would you have a case when no splitting is needed. Just curious. > > It seems that you would use LZO or not use any compression at all. > > H > > ----- Original Message ---- > From: Alex Baranov <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Mon, May 10, 2010 4:27:11 PM > Subject: Re: Fully distribute TextInputFormat... > > If I'm not mistaken LZO compression better suits when splitting needed, not > gzip. > > Alex Baranau > > http://sematext.com> > On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > What's the format of this file ? gzip can been split. > > > > > > > > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <[EMAIL PROTECTED]> > > wrote: > > > Hi folks :) > > > I have one big file... I read it with FileInputFormat, this generates > > only > > > one task and of course, this doesn't get distributed across the cluster > > > nodes. > > > Should I use an other Input class or do I have a bug in my > > implementation? > > > > > > The desired behavior is one task per line. > > > > > > Thanks. > > > > > > > > > > > > -- > > > http://www.neko-consulting.com> > > Ego sum quis ego servo > > > "Je suis ce que je protège" > > > "I am what I protect" > > > > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > > > > > > >
|
|