Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig times out when talking to busy hcatalog


Copy link to this message
-
Pig times out when talking to busy hcatalog
In PIG I am doing query like this:

sdp1 = load 'db1.table1' using org.apache.hcatalog.pig.HCatLoader;
sdp = FILTER sdp1 BY key1=='value1' AND key2=='value2';
ll = LIMIT sdp 100;
dump ll;

and hcatalog starts talking for few minutes to mysql asking for metadata,
in the meantime after few seconds pig
does: org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out

Number of partitions I have:
hive -e 'use db1; show partitions table1' |wc -l
Time taken: 1.467 seconds
37748

When I run the same query on a different environment where I have only
~1000 partitions all works fine.

Also problem does not exist on cdh3 and hcatalog-0.4.0.

In hcatalog's logs I can see:
(note the timestamp, I run the query at 17:10:45,216)

2013-08-27 17:10:46,275 INFO  DataNucleus.MetaData
(Log4JLogger.java:info(77)) - Listener found initialisation for persistable
class org.apache.hadoop.hive.metastore.model.MPartition

2013-08-27 17:14:23,661 DEBUG metastore.ObjectStore
(ObjectStore.java:listMPartitionsByFilter(1832)) - Done retrieving all
objects for listMPartitionsByFilter

2013-08-27 17:22:32,410 INFO  metastore.ObjectStore
(ObjectStore.java:getPartitionsByFilter(1699)) - # parts after pruning 37748

After that the hcatalog continues to:
2013-08-27 17:30:14,631 DEBUG DataNucleus.Transaction
(Log4JLogger.java:debug(58)) - Transaction committed in 462221 ms

Please note that I have datanucleus set to DEBUG and that slows things down
significantly, without that, it still takes around 7 minutes for hcatalog to
settle.

Also datanucleus settings from the hcatalog's logs:

 datanucleus.autoStartMechanismMode = checked
 javax.jdo.option.Multithreaded = true
 datanucleus.identifierFactory = datanucleus
 datanucleus.transactionIsolation = read
 datanucleus.validateTables = false
 javax.jdo.option.ConnectionURL = jdbc:mysql://XXX
 javax.jdo.option.DetachAllOnCommit = true
 javax.jdo.option.NonTransactionalRead = true
 datanucleus.validateConstraints = false
 javax.jdo.option.ConnectionDriverName = com.mysql.jdbc.Driver
 javax.jdo.option.ConnectionUserName = hive
 datanucleus.validateColumns = false
 datanucleus.cache.level2 = false
 datanucleus.plugin.pluginRegistryBundleCheck = LOG
 datanucleus.cache.level2.type = none
 javax.jdo.PersistenceManagerFactoryClass org.datanucleus.jdo.JDOPersistenceManagerFactory
 datanucleus.autoCreateSchema = true
 datanucleus.storeManagerType = rdbms
 datanucleus.connectionPoolingType = DBCP

This runs on CDH4 4.3.0
hcatalog version: 0.5.0+9-1.cdh4.3.0.p0.12~precise-cdh4.3.0
Does anyone know is it possible to increase pig's timeout?
I already have hive.metastore.client.socket.timeout set to 3600 and pig
times out in about 5-8 seconds.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB