Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Intermittent problems with sqoop using Oracle JDBC driver


Copy link to this message
-
RE: Intermittent problems with sqoop using Oracle JDBC driver
I think I worked out why my emails weren't getting through to the mailing list - my email has changed from @quest.com to @software.dell.com.

Have you tried OraOop? It would be interesting to know if it works out of the box - as I have hopefully fixed it so this wouldn't occur. It may still not get set for the initial connection - as some Sqoop code was calling the security manager before I could get a chance to set the option - but it should print a warning if that happens so you know to update your config. It will also fix a lot of throughput issues that you would get with the standard connector for large data volumes / large number of mappers.

David Robson
Software Developer
Dell | R&D, Quest Software
office +61 3 9811 8082

Quest Software is now part of Dell

From: Andre Araujo [mailto:[EMAIL PROTECTED]]
Sent: Monday, 15 July 2013 1:12 PM
To: David Robson
Cc: [EMAIL PROTECTED]
Subject: Re: Intermittent problems with sqoop using Oracle JDBC driver

Thanks, David.

My blog post is pending revision and should be published soon. I'll post the final link when it does.
For the time being, please see below a copy of it without the formatting. What worked for me was a combination of two things:

  *   passing the "-D mapred.child.java.opts="-Djava.security.egd=file:/dev/../dev/urandom"" parameter to sqoop
  *   setting the java.security.egd parameter in the HADOOP_OPTS variable, so that it was passed to "${HADOOP_COMMON_HOME}/bin/hadoop  org.apache.sqoop.Sqoop"
Regards,
Andre

-----------------------------------

I've been using Sqoop to load data into HDFS from Oracle. I'm using version 1.4.3 of Sqoop, running on a Linux machine and using the Oracle JDBC driver with JDK 1.6.

I was getting intermittent connection resets when trying to import data. After much troubleshooting, I eventually found the problem to be related to a known issue with the JDBC driver and found a way to work around it, which is described in the post
The problem

I noticed that when I was importing data at times where the machine I was running the sqoop client at was mostly idle, everything would run just fine. However, at times when others started to work on the same machine and it became a bit busier, I would start to get the errors below intermittently:

[araujo@client ~]$ sqoop import --connect jdbc:oracle:thin:user/pwd@host/orcl -m 1 --query 'select 1 from dual where $CONDITIONS' --target-dir test
13/07/12 09:35:39 INFO manager.SqlManager: Using default fetchSize of 1000
13/07/12 09:35:39 INFO tool.CodeGenTool: Beginning code generation
13/07/12 09:37:53 ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: Connection reset
            at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:467)
            at oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:546)
        ...
Caused by: java.net.SocketException: Connection reset
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
            ... 24 more
After some troubleshooting and googling, I found that the problem seemed to be related to the issue described in the following articles:

http://stackoverflow.com/questions/2327220/oracle-jdbc-intermittent-connection-issue/
https://forums.oracle.com/message/3701989/

Confirming the problem

To ensure the problem was the same as the one described in the articles, and not something else intrinsic to Sqoop, I created a small Java program that simply connected to the database. I was able to reproduce the issue using it:

[araujo@client TestConn]$ time java TestConn
Exception in thread "main" java.sql.SQLRecoverableException: IO Error: Connection reset
...
Caused by: java.net.SocketException: Connection reset
...
            ... 8 more

real      1m20.481s
user      0m0.491s
sys       0m0.051s
The workaround suggested in the articles also worked:

[araujo@client TestConn]$ time java -Djava.security.egd=file:/dev/../dev/urandom TestConn
Connection successful!

real      0m0.419s
user      0m0.498s
sys       0m0.036s
Applying the fix to Sqoop

It took me a while to figure out how to use the workaround above with Sqoop. Many tentatives to specify the parameter in the Sqoop command line, in many different forms, didn't work as expected.

The articles mention that the java.security.egd parameter can be centrally set in the $JAVA_HOME/jre/lib/security/java.security file. Unfortunately, this didn't work for me. Using strace, I confirmed that Sqoop was actually reading the java.security file but the setting just didn't take effect. I couldn't figure out why not and eventually gave up that alternative.

After a bit more of stracing and troubleshooting, though, I finally figured a way out.

Sqoop seems to use the JDBC driver in two different ways:

First, it connects to the Oracle database directly. It does that to gather more information about the tables (or query) from where the data is extracted and generate the map reduce job that it will run.
Second, the map reduce job generated by Sqoop uses the JDBC driver to connect to the database and perform the actual data import.
I was hitting the problem in the first case above, but I believe in both cases there's a potential for the problem to occur. So, ideally, we should apply the workaround to both cases.

The Sqoop documentation clearly gives us an option to address the second case: using the following parameter to Sqoop allows us to pass Java command line options to the map reduce job:

sqoop import -D mapred.child.java.opts="-Djava.security.egd=file:/dev/../dev/urandom" ...
Even though I couldn't fully prove the above, since I couldn't consistently reproduce the problem for the map reduce tasks, I believe (and hope) it should work well.

The Sqoop direct connection to Oracle

The problem with the direct connection from Sqoop to Oracle, though, wasn't resolved by that option. Trying to pass the "-Djava.s
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB