It seems logical too that launching 4000 map tasks on a 20 node cluster is going to have a lot of overhead with it. 20 does not seem like the ideal number, but I don't really know the internals of Cassandra that well. You might want to post this question on the Cassandra list to see if they can help you identify a way to increase the number of map tasks.
On 11/5/11 9:33 AM, "Brendan W." <[EMAIL PROTECTED]> wrote:
Yeah, that's my guess now, that somebody must have hacked the Cassandra
libs on me...just wanted to see if there were other possibilities for where
that parameter was being set.
Thanks a lot for the help.
On Fri, Nov 4, 2011 at 2:11 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> Could just be that Cassandra has changed the way their splits generate?
> Was Cassandra client libs changed at any point? Have you looked at its
> input formats' sources?
> On 04-Nov-2011, at 10:05 PM, Brendan W. wrote:
> > Plain Java MR, using the Cassandra inputFormat to read out of Cassandra.
> > Perhaps somebody hacked the inputFormat code on me...
> > But what's weird is that the parameter mapred.map.tasks didn't appear in
> > the job confs before at all. Now it does, with a value of 20 (happens to
> > be the # of machines in the cluster), and that's without the jobs or the
> > mapred-site.xml files themselves changing.
> > The inputSplitSize is set specifically in the jobs, and has not been
> > changed (except I subsequently fiddled with it a little to see if it
> > affected the fact that I was getting 20 splits, and it didn't affect
> > that...just the split size, not the number).
> > After a submit the job, I get a message "TOTAL NUMBER OF SPLIT = 20",
> > before a list of the input splits...sort of looks like a hack but I can't
> > find where it is.
> > On Fri, Nov 4, 2011 at 11:58 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >> Brendan,
> >> Are these jobs (whose split behavior has changed) via Hive/etc. or plain
> >> Java MR?
> >> In case its the former, do you have users using newer versions of them?
> >> On 04-Nov-2011, at 8:03 PM, Brendan W. wrote:
> >>> Hi,
> >>> In the jobs running on my cluster of 20 machines, I used to run jobs
> >>> "hadoop jar ...") that would spawn around 4000 map tasks. Now when I
> >>> the same jobs, that number is 20; and I notice that in the job
> >>> configuration, the parameter mapred.map.tasks is set to 20, whereas it
> >>> never used to be present at all in the configuration file.
> >>> Changing the input split size in the job doesn't affect this--I get the
> >>> size split I ask for, but the *number* of input splits is still capped
> >>> 20--i.e., the job isn't reading all of my data.
> >>> The mystery to me is where this parameter could be getting set. It is
> >> not
> >>> present in the mapred-site.xml file in <hadoop home>/conf on any
> >> in
> >>> the cluster, and it is not being set in the job (I'm running out of the
> >>> same jar I always did; no updates).
> >>> Is there *anywhere* else this parameter could possibly be getting set?
> >>> I've stopped and restarted map-reduce on the cluster with no
> >> effect...it's
> >>> getting re-read in from somewhere, but I can't figure out where.
> >>> Thanks a lot,
> >>> Brendan