Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig times out when talking to busy hcatalog

Copy link to this message
Pig times out when talking to busy hcatalog
In PIG I am doing query like this:

sdp1 = load 'db1.table1' using org.apache.hcatalog.pig.HCatLoader;
sdp = FILTER sdp1 BY key1=='value1' AND key2=='value2';
ll = LIMIT sdp 100;
dump ll;

and hcatalog starts talking for few minutes to mysql asking for metadata,
in the meantime after few seconds pig
does: org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out

Number of partitions I have:
hive -e 'use db1; show partitions table1' |wc -l
Time taken: 1.467 seconds

When I run the same query on a different environment where I have only
~1000 partitions all works fine.

Also problem does not exist on cdh3 and hcatalog-0.4.0.

In hcatalog's logs I can see:
(note the timestamp, I run the query at 17:10:45,216)

2013-08-27 17:10:46,275 INFO  DataNucleus.MetaData
(Log4JLogger.java:info(77)) - Listener found initialisation for persistable
class org.apache.hadoop.hive.metastore.model.MPartition

2013-08-27 17:14:23,661 DEBUG metastore.ObjectStore
(ObjectStore.java:listMPartitionsByFilter(1832)) - Done retrieving all
objects for listMPartitionsByFilter

2013-08-27 17:22:32,410 INFO  metastore.ObjectStore
(ObjectStore.java:getPartitionsByFilter(1699)) - # parts after pruning 37748

After that the hcatalog continues to:
2013-08-27 17:30:14,631 DEBUG DataNucleus.Transaction
(Log4JLogger.java:debug(58)) - Transaction committed in 462221 ms

Please note that I have datanucleus set to DEBUG and that slows things down
significantly, without that, it still takes around 7 minutes for hcatalog to

Also datanucleus settings from the hcatalog's logs:

 datanucleus.autoStartMechanismMode = checked
 javax.jdo.option.Multithreaded = true
 datanucleus.identifierFactory = datanucleus
 datanucleus.transactionIsolation = read
 datanucleus.validateTables = false
 javax.jdo.option.ConnectionURL = jdbc:mysql://XXX
 javax.jdo.option.DetachAllOnCommit = true
 javax.jdo.option.NonTransactionalRead = true
 datanucleus.validateConstraints = false
 javax.jdo.option.ConnectionDriverName = com.mysql.jdbc.Driver
 javax.jdo.option.ConnectionUserName = hive
 datanucleus.validateColumns = false
 datanucleus.cache.level2 = false
 datanucleus.plugin.pluginRegistryBundleCheck = LOG
 datanucleus.cache.level2.type = none
 javax.jdo.PersistenceManagerFactoryClass org.datanucleus.jdo.JDOPersistenceManagerFactory
 datanucleus.autoCreateSchema = true
 datanucleus.storeManagerType = rdbms
 datanucleus.connectionPoolingType = DBCP

This runs on CDH4 4.3.0
hcatalog version: 0.5.0+9-1.cdh4.3.0.p0.12~precise-cdh4.3.0
Does anyone know is it possible to increase pig's timeout?
I already have hive.metastore.client.socket.timeout set to 3600 and pig
times out in about 5-8 seconds.