Yes, that is really the problem.
We need to be able to control, or at least predict, which ports are used,
since we need to be able to supply credentials to a given client database
and have the sqoop import deliver it to our cluster behind our firewall.
The TaskNodes request the data from MSSQL Server listening on port 1433,
but when the MSSQL Server sends the data back, is there a sqoop argument or
proxy method so we can control what port the data goes back to HDFS on?
According to Microsoft, winsock client calls are answered via 3-way
handshake on a random port between 1024-5000 for data delivery. So:
TaskNode requests connection SQL on 1433, SQL acks on 1433, then opens a
random port between 1024 - 5000 to the IP it just acknowledged to send the
data. For a firewall, we cannot leave all ports 1024-5000 open and still
pass vulnerability scans for compliance to our auditing bodies.
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Tue, Oct 1, 2013 at 1:21 PM, Abraham Elmahrek <[EMAIL PROTECTED]> wrote:
> Hey There,
> Your TaskNodes and JobTracker node will be contacting your RDBMS. Checkout
> http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_connecting_to_a_database_serverfor more information.
> On Tue, Oct 1, 2013 at 7:22 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>> Hi all!
>> I have a broad question that is proving difficult to answer conclusively.
>> When you import from MSSQL, we (my co-workers and I) understand that the
>> initial connector communicates on port 1433 by default. However, when the
>> map task created by sqoop imports the data to the data nodes, are the data
>> nodes connecting to MSSQL via port 1433, or are arbitrary ports opened
>> between the data nodes and the SQL Server?
>> We need to know because we are interested in hosting data for a variety
>> of clients, and need to be able to place firewall rules for our data center
>> to manage access to our cluster while still connecting to various
>> Thank you,
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com