Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop, mail # user - Ports for data returned by MSSQL import


Copy link to this message
-
Re: Ports for data returned by MSSQL import
Abraham Elmahrek 2013-10-01, 22:53
Doug,

I'm going to assume you're using Sqoop V1.

The MapReduce tasks that Sqoop starts will be the intermediary and transfer
the information from the database to HDFS. Hadoop network comm. is layered
on top of TCP/IP, so all clients choose a random port and the servers are
listening on a particular port. You can configure Hadoop to listen on any
port you'd like (see
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml),
but I believe the client port is random. Also, assuming you're using
MapReduce version 1, the status of the JobTracker will be polled on a
configurable port (http://hadoop.apache.org/docs/r1.2.1/mapred-default.html).
As usual, the sqoop client will poll the job tracker and use a random
client port.

I'm not sure if you can communicate this over email, but what does your
data center setup look like? I would think that the entire hadoop cluster
would be placed behind a firewall and then clients would simply start jobs?
This means that you'll need to configure your firewall to allow clients to
communicate with the job tracker (which means allow traffic to the job
tracker port). The rest should be taken care of for you?

-Abe
On Tue, Oct 1, 2013 at 11:26 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:

> Yes, that is really the problem.
>
> We need to be able to control, or at least predict, which ports are used,
> since we need to be able to supply credentials to a given client database
> and have the sqoop import deliver it to our cluster behind our firewall.
> The TaskNodes request the data from MSSQL Server listening on port 1433,
> but when the MSSQL Server sends the data back, is there a sqoop argument or
> proxy method so we can control what port the data goes back to HDFS on?
> According to Microsoft, winsock client calls are answered via 3-way
> handshake on a random port between 1024-5000 for data delivery. So:
> TaskNode requests connection SQL on 1433, SQL acks on 1433, then opens a
> random port between 1024 - 5000 to the IP it just acknowledged to send the
> data. For a firewall, we cannot leave all ports 1024-5000 open and still
> pass vulnerability scans for compliance to our auditing bodies.
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Tue, Oct 1, 2013 at 1:21 PM, Abraham Elmahrek <[EMAIL PROTECTED]> wrote:
>
>> Hey There,
>>
>> Your TaskNodes and JobTracker node will be contacting your RDBMS.
>> Checkout
>> http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_connecting_to_a_database_serverfor more information.
>>
>> -Abe
>>
>>
>> On Tue, Oct 1, 2013 at 7:22 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>>
>>> Hi all!
>>>
>>> I have a broad question that is proving difficult to answer conclusively.
>>>
>>> When you import from MSSQL, we (my co-workers and I) understand that the
>>> initial connector communicates on port 1433 by default. However, when the
>>> map task created by sqoop imports the data to the data nodes, are the data
>>> nodes connecting to MSSQL via port 1433, or are arbitrary ports opened
>>> between the data nodes and the SQL Server?
>>>
>>> We need to know because we are interested in hosting data for a variety
>>> of clients, and need to be able to place firewall rules for our data center
>>> to manage access to our cluster while still connecting to various
>>> environments.
>>>
>>> Thank you,
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>
>>
>