On Jan 28, 2011, at 15:50 , Greg Roelofs wrote:
> Does your .so depend on any other potentially thread-unsafe .so that other
> (non-Hadoop) processes might be using? System libraries like zlib are safe
> (else they wouldn't make very good system libraries), but maybe some other
> research library or something? (That's a long shot, but I'm pretty much
> grasping at straws here.)
Yeah, I dunno. It's a very complicated system that hits all kinds of popular conventional libraries: boost, eigen, countless other things. I doubt of it is being access however. This is a dedicated cluster so if my task is the only one running, then it's only concurrent with the OS itself (and the JVM and Hadoop).
>> Yes, not thread safe, but what difference could that make if I
>> don't use the library in a multi-threaded fashion. One map task,
>> one node, one Java thread calling JNI and using the native code?
>> How do thread safety issues factor into this? I admit, it's
>> my theory that threads might be involved somehow, but I don't
>> understand how, I'm just shooting in the dark since I can't
>> solve this problem any other way yet.
> Since you can reproduce it in standalone mode, can you enable core dumps
> so you can see the backtrace of the code that segfaults? Knowing what
> specifically broke and how it got there is always a big help.
Yep, I've got core dumps and I've run them through gdb. I know that the code often dies very deep inside ostensibly standard libraries, like eigen for example...which leads me to believe the memory corruption happened long before the code reached that point.
> Btw, keep in mind that there are memory-related bugs that don't show up
> until there's something big in memory that pushes the code in question
> up into a region with different data patterns in it (most frequently zero
> vs. non-zero, but others are possible). IOW, maybe the code is dependent
> on uninitialized memory, but you were getting lucky when you ran it outside
> of Hadoop. Have you run it through valgrind or Purify or similar?
Valgrind has turned out to be almost useless. It can't "reach" through the JVM through JNI to the .so code. If I don't tell valgrind to following children, it obviously produces no relevant output, but if I do tell it to follow children, it can't successfully launch a VM to run Java in:
Error occurred during initialization of VM
Unknown x64 processor: SSE2 not supported
Sigh...any thoughts on running Valgrind on Hadoop->JVM->JNI->native code?
Keith Wiley [EMAIL PROTECTED] www.keithwiley.com
"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
-- Edwin A. Abbott, Flatland