-Re: How does sqoop distribute it's data evenly across HDFS?
Aaron Kimball 2011-03-17, 17:37
Sqoop operates in parallel by spawning multiple map tasks that each read
different rows of the database, based on the split key. Each map task
submits a different SQL query which uses the split key column to get a
subset of the rows you want to read. Each map task then writes an output
file to HDFS, covering its subset of the total rows. Hopefully, Sqoop's
partitioning heuristic has resulted in each of these output files being of
roughly the same size. If your row range is "lumpy" (e.g., you have a whole
lot of rows with a column ID=0...10000, then a blank space, then a whole lot
more rows where 5000000 < ID < 6000000), you'll see a bunch of output files
where some may be empty, and one or two contain all the rows. If your row
range is more uniform (e.g., the range of the ID column is more-or-less
fully-occupied between its maximum and minimum values), you'll get much more
But assuming the number of rows read in by each map task are more or less
the same, then the files will be distributed across the cluster using the
underlying platform. In practice, Sqoop relies on MapReduce and HDFS to make
things "just work out." For instance, by spawning 4 map tasks (Sqoop's
default), it is likely that your cluster will have four separate nodes where
there is a free task slot, and that these tasks will be allocated across
those four different nodes. HDFS' replica placement algorithm is "one
replica on the same node as the writer, and the other two replicas
elsewhere" (assuming a single rack -- if you've configured a multiple rack
topology, HDFS goes further and ensures allocation on at least two racks).
So the map tasks will "probably" be spread onto four different nodes, and
HDFS will "probably" put 3 replicas * 4 tasks' output on a reasonably
diverse set of machines in the cluster. Note that HDFS block placement is
actually on a per-block, not a per-file basis, so if each task is writing
multiple blocks of output, the number of datanodes which are candidates for
replica targets goes up substantially.
In theory, you are right: pathological cases can occur, where all Sqoop
tasks run serially on a single node, making that node hold replica #1 of
each task's output. The HDFS namenode could then pick the same two foreign
replica nodes for each of these four tasks' output and have only three nodes
in the cluster hold all the data from an import job. But this is a very
unlikely case, and not one worth worrying about in practice. If this
happens, it's likely because the majority of your nodes are already in a
disk-full condition, or are otherwise unavailable, and only a specific
subset of the nodes are capable of actually receiving the writes. But in
this regard, Sqoop is no different than any other custom MapReduce program
you might write; it's not particularly more or less resilient to any
pathological conditions of the underlying system which might arise.
Hope that helps.
On Thu, Mar 17, 2011 at 2:41 AM, Andy Doddington <[EMAIL PROTECTED]>wrote:
> Ok, I understand about the balancer process which can be run manually, but
> the sqoop documentation seems to imply that it does balancing for you, based
> on the split key, as you note.
> But what causes the various sqoop data import map jobs to write to
> different data nodes? I.e. What stops them all writing to the same node, in
> the ultimate pathological case?
> Andy D
> On 17 Mar 2011, at 00:28, Harsh J <[EMAIL PROTECTED]> wrote:
> > There's a balancer available to re-balance DNs across the HDFS cluster
> > in general. It is available in the $HADOOP_HOME/bin/ directory as
> > start-balancer.sh
> > But what I think sqoop implies is that your data is balanced due to
> > the map jobs it runs for imports (using a provided split factor
> > between maps), which should make it write chunks of data out to
> > different DataNodes.
> > I guess you could get more information on the Sqoop mailing list
> > [EMAIL PROTECTED],