Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors


Copy link to this message
-
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
agateaaa 2013-08-29, 18:22
Sorry hit send too soon ...

Hi All:

Put some debugging code in TUGIContainingTransport.getTransport() and I
tracked it down to

@Override
public TUGIContainingTransport getTransport(TTransport trans) {

// UGI information is not available at connection setup time, it will be
set later
// via set_ugi() rpc.
transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));

//return transMap.get(trans); //<-change
          TUGIContainingTransport retTrans = transMap.get(trans);

          if ( retTrans == null ) {
             LOGGER.error (" cannot find transport that was in map !!")
           }  else {
             LOGGER.debug (" cannot find transport that was in map !!")
             return retTrans;
       }
}

When we run this in our test environment, see that we run into the problem
just after GC runs,
and "cannot find transport that was in the map!!" message gets logged.

Could the GC be collecting entries from transMap, just before the we get it

Tried a minor change which seems to work

public TUGIContainingTransport getTransport(TTransport trans) {

   TUGIContainingTransport retTrans = transMap.get(trans);

    if ( retTrans == null ) {
// UGI information is not available at connection setup time, it will be
set later
// via set_ugi() rpc.
transMap.putIfAbsent(trans, retTrans);
    }
   return retTrans;
}
My questions for hive and  thrift experts

1.) Do we need to use a ConcurrentMap
ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new
MapMaker().weakKeys().weakValues().makeMap();
It does use == to compare keys (which might be the problem), also in this
case we cant rely on the trans to be always there in the transMap, even
after a put, so in that case change above
probably makes sense
2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was
looking at org.apache.thrift.transport.TSaslServerTransport, esp change
made by THRIFT-1468)

e.g.
private static Map<TTransport, WeakReference<TUGIContainingTransport>>
transMap3 = Collections.synchronizedMap(new WeakHashMap<TTransport,
WeakReference<TUGIContainingTransport>>());

getTransport() would be something like

public TUGIContainingTransport getTransport(TTransport trans) {
WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
if (ret == null || ret.get() == null) {
ret = new WeakReference<TUGIContainingTransport>(new
TUGIContainingTransport(trans));
transMap3.put(trans, ret); // No need for putIfAbsent().
// Concurrent calls to getTransport() will pass in different TTransports.
}
return ret.get();
}
I did try 1.) above in our test environment and it does seem to resolve the
problem, though i am not sure if I am introducing any other problem
Can someone help ?
Thanks
Agatea

On Thu, Aug 29, 2013 at 10:57 AM, agateaaa <[EMAIL PROTECTED]> wrote:

> Hi All:
>
> Put some debugging code in TUGIContainingTransport.getTransport() and I
> tracked it down to
>
> @Override
> public TUGIContainingTransport getTransport(TTransport trans) {
>
> // UGI information is not available at connection setup time, it will be
> set later
> // via set_ugi() rpc.
> transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
>
> //return transMap.get(trans); <-change
>           TUGIContainingTransport retTrans = transMap.get(trans);
>
>           if ( retTrans == null ) {
>
>
>
> }
>
>
>
>
>
> On Wed, Jul 31, 2013 at 9:48 AM, agateaaa <[EMAIL PROTECTED]> wrote:
>
>> Thanks Nitin
>>
>> There arent too many connections in close_wait state only 1 or two when
>> we run into this. Most likely its because of dropped connection.
>>
>> I could not find any read or write timeouts we can set for the thrift
>> server which will tell thrift to hold on to the client connection.
>>  See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt
>> seem to have been implemented yet. We do have set a client connection
>> timeout but cannot find
>> an equivalent setting for the server.
>>
>> We have  a suspicion that this happens when we run two client processes