Debugging Hadoop Core Dumps

Eric Abbott

A long time pain point for HubSpot Engineering has been intermittent core dumps with Hadoop jobs. The issue wasn't destructive; all jobs are idempotent and missing processes are restarted and rerun. Occasionally, however, the resulting queue processing delay would be enough to trigger a PagerDuty alert in the middle of the night. Having tried several times to resolve the issue, engineering was unhappily resigned to living with it -- until our recently-formed SRE team decided to take a shot at resolving the issue.

Tl;dr

  1. As expected, java is pretty flexible with minor versions in regards to compiling and runtime even with JNI involved.
  2. The -Xcheck:jni command line option can cause Hadoop seg faults if native libs are used.
  3. Don't use obscure 3rd party libs in production.

Workaround

The known workaround was to rename a hadoop directory containing *.so dynamic libraries such that they weren't found upon jvm startup. The native *.so libraries are optimized compression algorithms to be used in favor of their java equivalents. 

mv /usr/lib/hadoop/lib/native /usr/lib/hadoop/lib/native.disabled

That’s good, bad and awful. Good because there is a workaround. Bad because using the workaround adds minutes to jobs and is not viable long term. Engineers end up using the workaround for an hour or so then undo it once they think the job queue has been unclogged. Awful because its generally considered fruitless to debug 3rd party native C jni code. On the plus side the hadoop code is open source and available.

Information Gathering

Ok, so the workaround is insufficient and we need an actual fix. What are the facts on the ground?

  • Disabling the native libs allows the job to work, albiet slower.
  • Job failures are intermittent and quite often clustered.
  • Happens to hadoop jobs for multiple teams (not isolated one-off code usage of a random barely supported feature).
  • An experienced engineer already spent a week on the issue several months ago.
  • Recompiling the native .so files did not resolve the issue.
  • The trace files left behind as the jvm crashed were varied and mostly implicated the linux loader. One would assume the linux loader above reproach.
    # C  [ld-linux-x86-64.so.2+0x17412]  short+0x2

hs_err_pid4967.log

A typical thread dump on jvm failure is shown below. Traces such as these are most useful if they can pinpoint the problematic code.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fec8f7e7412, pid=4967, tid=140653696374528
#
# JRE version: 6.0_39-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.14-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [ld-linux-x86-64.so.2+0x17412]  short+0x2
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
 
---------------  T H R E A D  ---------------
 
Current thread (0x00007fec80124000):  JavaThread "pool-7-thread-1" [_thread_in_native, id=6973, stack(0x00007fec7d8bd000,0x00007fec7d9be000)]
 
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x0000000000000706
 
….
Stack: [0x00007fec7d8bd000,0x00007fec7d9be000],  sp=0x00007fec7d9b8d18,  free space=1007k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [ld-linux-x86-64.so.2+0x17412]  short+0x2
 
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;)V+0
j  java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+300
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+347
j  java.lang.Runtime.loadLibrary0(Ljava/lang/Class;Ljava/lang/String;)V+54
j  java.lang.System.loadLibrary(Ljava/lang/String;)V+7
j  org.apache.hadoop.util.NativeCodeLoader.()V+36
v  ~StubRoutines::call_stub
j  org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.()V+4
v  ~StubRoutines::call_stub
j  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Ljava/lang/reflect/Constructor;[Ljava/lang/Object;)Ljava/lang/Object;+0
j  sun.reflect.NativeConstructorAccessorImpl.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+72
j  sun.reflect.DelegatingConstructorAccessorImpl.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+5
j  java.lang.reflect.Constructor.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+92
j  org.apache.hadoop.util.ReflectionUtils.newInstance(Ljava/lang/Class;Lorg/apache/hadoop/conf/Configuration;)Ljava/lang/Object;+46
j  org.apache.hadoop.security.Groups.(Lorg/apache/hadoop/conf/Configuration;)V+29
j  org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/security/Groups;+32
j  org.apache.hadoop.security.UserGroupInformation.initUGI(Lorg/apache/hadoop/conf/Configuration;)V+80
j  org.apache.hadoop.security.UserGroupInformation.initialize(Lorg/apache/hadoop/conf/Configuration;Z)V+1
j  org.apache.hadoop.security.UserGroupInformation.ensureInitialized()V+16
j  org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled()Z+0
j  org.apache.hadoop.security.UserGroupInformation.getLoginUser()Lorg/apache/hadoop/security/UserGroupInformation;+14
j  org.apache.hadoop.security.UserGroupInformation.getCurrentUser()Lorg/apache/hadoop/security/UserGroupInformation;+28
v  ~StubRoutines::call_stub
j  sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
j  sun.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+87
j  sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
j  java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+161
j  org.apache.hadoop.hbase.util.Methods.call(Ljava/lang/Class;Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Class;[Ljava/lang/Object;)Ljava/lang/Object;+13
j  org.apache.hadoop.hbase.security.User.call(Lorg/apache/hadoop/security/UserGroupInformation;Ljava/lang/String;[Ljava/lang/Class;[Ljava/lang/Object;)Ljava/lang/Object;+6
j  org.apache.hadoop.hbase.security.User.callStatic(Ljava/lang/String;)Ljava/lang/Object;+4
j  org.apache.hadoop.hbase.security.User.access$400(Ljava/lang/String;)Ljava/lang/Object;+1
j  org.apache.hadoop.hbase.security.User$SecureHadoopUser.()V+7
j  org.apache.hadoop.hbase.security.User$SecureHadoopUser.(Lorg/apache/hadoop/hbase/security/User$1;)V+1
j  org.apache.hadoop.hbase.security.User.getCurrent()Lorg/apache/hadoop/hbase/security/User;+11
j  org.apache.hadoop.hbase.client.HConnectionManager$HConnectionKey.(Lorg/apache/hadoop/conf/Configuration;)V+78
j  org.apache.hadoop.hbase.client.HConnectionManager.getConnection(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/hbase/client/HConnection;+5
j  org.apache.hadoop.hbase.client.HTable.(Lorg/apache/hadoop/conf/Configuration;[B)V+42
j  org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(Lorg/apache/hadoop/conf/Configuration;[B)Lorg/apache/hadoop/hbase/client/HTableInterface;+6
j  org.apache.hadoop.hbase.client.HTablePool.createHTable(Ljava/lang/String;)Lorg/apache/hadoop/hbase/client/HTableInterface;+12
j  org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(Ljava/lang/String;)Lorg/apache/hadoop/hbase/client/HTableInterface;+18
j  org.apache.hadoop.hbase.client.HTablePool.getTable(Ljava/lang/String;)Lorg/apache/hadoop/hbase/client/HTableInterface;+2
j  com.hubspot.hbase.utils.AbstractHBaseDataStore.getTable(Ljava/lang/String;)Lorg/apache/hadoop/hbase/client/HTableInterface;+62
j  com.performable.contacts.hbase.ContactsHBaseDataStore.getContactListsMetadataTable()Lorg/apache/hadoop/hbase/client/HTableInterface;+3
j  com.performable.contacts.lists.BaseContactListSegByListTask.doWork(Lcom/hubspot/taskserviceapi/client/PullTask;)Z+40
j  com.performable.contacts.lists.BaseContactListSegByListTask.executeTask(Lcom/hubspot/taskserviceapi/client/PullTask;)Z+2
j  com.hubspot.taskserviceapi.client.pull.PullTaskRunnable.run()V+94
j  java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4
j  java.util.concurrent.FutureTask$Sync.innerRun()V+30
j  java.util.concurrent.FutureTask.run()V+4
j  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V+66
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+33
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

Another with:

Current thread (0x00007f3fe25d1800):  JavaThread "ContactListSegByListTaskWorker" [_thread_in_native, id=7333, stack(0x00007f3fd64ce000,0x00007f3fd65cf000)]
 
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;)V+0
j  java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+300
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+347
j  java.lang.Runtime.loadLibrary0(Ljava/lang/Class;Ljava/lang/String;)V+54
j  java.lang.System.loadLibrary(Ljava/lang/String;)V+7
j  org.apache.hadoop.util.NativeCodeLoader.()V+36
v  ~StubRoutines::call_stub
j  org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.()V+4
v  ~StubRoutines::call_stub
j  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Ljava/lang/reflect/Constructor;[Ljava/lang/Object;)Ljava/lang/Object;+0
j  sun.reflect.NativeConstructorAccessorImpl.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+72
j  sun.reflect.DelegatingConstructorAccessorImpl.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+5
j  java.lang.reflect.Constructor.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+92
j  org.apache.hadoop.util.ReflectionUtils.newInstance(Ljava/lang/Class;Lorg/apache/hadoop/conf/Configuration;)Ljava/lang/Object;+46
j  org.apache.hadoop.security.Groups.(Lorg/apache/hadoop/conf/Configuration;)V+29
j  org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/security/Groups;+32
j  org.apache.hadoop.security.UserGroupInformation.initUGI(Lorg/apache/hadoop/conf/Configuration;)V+80
j  org.apache.hadoop.security.UserGroupInformation.initialize(Lorg/apache/hadoop/conf/Configuration;Z)V+1
j  org.apache.hadoop.security.UserGroupInformation.ensureInitialized()V+16

Another with:

Current thread (0x00007f30cc307800):  JavaThread "ContactsFutureListSegEnqueuerTaskWorker" [_thread_in_native, id=6602, stack(0x00007f30c16ba000,0x00007f30c17bb000)]
 
C  [ld-linux-x86-64.so.2+0x17412]  short+0x2
 
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;)V+0
j  java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+300
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+347
j  java.lang.Runtime.loadLibrary0(Ljava/lang/Class;Ljava/lang/String;)V+54
j  java.lang.System.loadLibrary(Ljava/lang/String;)V+7
j  org.apache.hadoop.util.NativeCodeLoader.()V+36
v  ~StubRoutines::call_stub
j  org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.()V+4
v  ~StubRoutines::call_stub
j  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Ljava/lang/reflect/Constructor;[Ljava/lang/Object;)Ljava/lang/Object;+0
j  sun.reflect.NativeConstructorAccessorImpl.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+72
j  sun.reflect.DelegatingConstructorAccessorImpl.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+5
j  java.lang.reflect.Constructor.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+92
j  org.apache.hadoop.util.ReflectionUtils.newInstance(Ljava/lang/Class;Lorg/apache/hadoop/conf/Configuration;)Ljava/lang/Object;+46
j  org.apache.hadoop.security.Groups.(Lorg/apache/hadoop/conf/Configuration;)V+29
j  org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/security/Groups;+32
j  org.apache.hadoop.security.UserGroupInformation.initUGI(Lorg/apache/hadoop/conf/Configuration;)V+80
j  org.apache.hadoop.security.UserGroupInformation.initialize(Lorg/apache/hadoop/conf/Configuration;Z)V+1
j  org.apache.hadoop.security.UserGroupInformation.ensureInitialized()V+16

The traces all pointed to the linux dynamic loader -- which is stock Centos 6 and does not have any issues of this type reported against it. The currently executing java thread varies, implying there are multiple ways to get to the failure situation. The java stack traces shown end after loading a native library. Somewhere inside the native code something blew up, we'll have no opportunity to debug on the java side. With no corrosponding core dumps, the trail ended here.

Native libraries

With the utility lsof we can examine the open files of a running process. Here's a partial list of open *.so files for one of our hadoop processes.

$ sudo lsof -p 5521 | grep "\.so"
java    5521 root  mem    REG              202,1    110960    132442 /lib64/libresolv-2.12.so
java    5521 root  mem    REG              202,1     27424    132430 /lib64/libnss_dns-2.12.so
java    5521 root  mem    REG              202,1     65928    132432 /lib64/libnss_files-2.12.so
java    5521 root  mem    REG              202,1     88600    132811 /lib64/libz.so.1.2.3
java    5521 root  mem    REG              202,1    234934    160387 /usr/java/jdk1.6.0_39/jre/lib/amd64/libjava.so
java    5521 root  mem    REG              202,1      8236    160386 /usr/java/jdk1.6.0_39/jre/lib/amd64/libjaas_unix.so
java    5521 root  mem    REG              202,1     78856    153875 /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
java    5521 root  mem    REG              202,1    991112     12839 /tmp/snappy-1.0.4.1-libsnappyjava.so
java    5521 root  mem    REG              202,1     48338     12653 /tmp/libjansi-64-1.9.so
java    5521 root  mem    REG              202,1     43832    132444 /lib64/librt-2.12.so

The native libs converge to four major groupings: linux system libs, jvm libs, hadoop libs and some libs in /tmp. The system (/lib64/) libs and jvm libs are expected and seem normal. The hadoop libs are certainly expected. What do stand out are the /tmp libs. It's unexpected that a library would be stored there, however snappy compression is used and ansi certainly doesn't raise any red flags. The files exist in the filesystem and have appropriate permissions.

$ ls -altr /tmp/*.so
-rwxr-xr-x 1 root root 991112 Aug  5 18:15 /tmp/snappy-1.0.4.1-libsnappyjava.so
-rwxr-xr-x 1 root root  48340 Oct 11 15:24 /tmp/libjansi-64-1.9.so

Running objdump on the libs confirmed all requirements/dependencies were present and accounted for.

-Xcheck:jni, and bugs in -Xcheck:jni

In addition to enabling core dumps we hoped the command line option -Xcheck:jni would provide some insight.

Unfortunately changing the system in two ways caused a minor amount of confusion since the -Xcheck:jni feature exposed a different bug in Hadoop. It took awhile to sift through the data and determine there were two types of core dumps. One type of core dump was similar to what the devs have been seeing all along, and the other a new one associated with the -Xcheck:jni feature.

The -Xcheck:jni core dumps were basically identical:

(gdb) bt
#0  0x00007fe11032f8a5 in raise () from /lib64/libc.so.6
#1  0x00007fe110331085 in abort () from /lib64/libc.so.6
#2  0x00007fe10fdfaa37 in os::abort(bool) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#3  0x00007fe10fc0be03 in jniCheck::validate_class(JavaThread*, _jclass*, bool) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#4  0x00007fe10fc27809 in checked_jni_GetStaticObjectField () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#5  0x00007fe0ff0d5f79 in Java_org_apache_hadoop_io_compress_zlib_ZlibCompressor_deflateBytesDirect () from /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0

Paired with the core dumps were java exceptions and stack traces in the hadoop logs.

FATAL ERROR in native method: JNI received a class argument that is not a class
at org.apache.hadoop.io.compress.zlib.ZlibCompressor.deflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibCompressor.compress(ZlibCompressor.java:362)
locked <0x00000000f8eb10e8> - locked <0x00000000f8eb10e8> (a org.apache.hadoop.io.compress.zlib.ZlibCompressor)
(a org.apache.hadoop.io.compress.zlib.ZlibCompressor)
at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:80)
at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:90)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1418)
locked <0x00000000f8eb0e58> - locked <0x00000000f8eb0e58> (a org.apache.hadoop.io.SequenceFile$BlockCompressWriter)
(a org.apache.hadoop.io.SequenceFile$BlockCompressWriter)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1433)
locked <0x00000000f8eb0e58> - locked <0x00000000f8eb0e58> (a org.apache.hadoop.io.SequenceFile$BlockCompressWriter)
(a org.apache.hadoop.io.SequenceFile$BlockCompressWriter)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.close(SequenceFile.java:1456)
locked <0x00000000f8eb0e58> - locked <0x00000000f8eb0e58> (a org.apache.hadoop.io.SequenceFile$BlockCompressWriter)
(a org.apache.hadoop.io.SequenceFile$BlockCompressWriter)
at com.performable.contacts.imports.csv.ContactsImportJobInputWriter.reallyWriteInputToPath(ContactsImportJobInputWriter.java:136)
at com.performable.contacts.imports.csv.ContactsImportJobInputWriter.writeInputToPath(ContactsImportJobInputWriter.java:93)

This mapped back to the following function in the Hadoop source code:

public synchronized int compress(byte[] b, int off, int len) throws IOException {
  ...
  // Re-initialize the zlib's output direct buffer
  compressedDirectBuf.rewind();
  compressedDirectBuf.limit(directBufferSize);
  // Compress data
  n = deflateBytesDirect();
  compressedDirectBuf.limit(n);
  // Check if zlib consumed all input buffer
  // set keepUncompressedBuf properly
  if (uncompressedDirectBufLen <= 0)
  ...
  return n;
}
There are no line numbers for the ZlibCompression.c code. However we are happy to observe there is only one reference to GetStaticObjectField.
JNIEXPORT jint JNICALL
Java_org_apache_hadoop_io_compress_zlib_ZlibCompressor_deflateBytesDirect(
JNIEnv *env, jobject this) {
  // Get members of ZlibCompressor
  z_stream *stream = ZSTREAM((*env)->GetLongField(env, this, ZlibCompressor_stream));
  if (!stream)
    { THROW(env, "java/lang/NullPointerException", NULL); return (jint)0; }
  // Get members of ZlibCompressor
  jobject clazz = (*env)->GetStaticObjectField(env, this, ZlibCompressor_clazz);
  jobject uncompressed_direct_buf = (*env)->GetObjectField(env, this, ZlibCompressor_uncompressedDirectBuf);
  jint uncompressed_direct_buf_off = (*env)->GetIntField(env, this, ZlibCompressor_uncompressedDirectBufOff);
  jint uncompressed_direct_buf_len = (*env)->GetIntField(env, this, ZlibCompressor_uncompressedDirectBufLen);
  jobject compressed_direct_buf = (*env)->GetObjectField(env, this, ZlibCompressor_compressedDirectBuf);
  jint compressed_direct_buf_len = (*env)->GetIntField(env, this, ZlibCompressor_directBufferSize);
  jboolean finish = (*env)->GetBooleanField(env, this, ZlibCompressor_finish);
  // Get the input direct buffer
  LOCK_CLASS(env, clazz, "ZlibCompressor");
  Bytef* uncompressed_bytes = (*env)->GetDirectBufferAddress(env,
  uncompressed_direct_buf);
  UNLOCK_CLASS(env, clazz, "ZlibCompressor");
  if (uncompressed_bytes == 0)
    { return (jint)0; }

Locating the jni method docs we find the signature for GetStaticObjectField

NativeType GetStatic<type>Field(JNIEnv *env, jclass clazz, jfieldID fieldID);

The value for jfieldID is staticly stored in field ZlibCompressor_clazz

JNIEXPORT void JNICALL
Java_org_apache_hadoop_io_compress_zlib_ZlibCompressor_initIDs( JNIEnv *env, jclass class)
{
  ...
  ZlibCompressor_clazz = (*env)->GetStaticFieldID(env, class, "clazz", "Ljava/lang/Class;");

That looks like a real issue. The fieldID ZlibCompressor_clazz is pulled from the Class object, while the call to get the field is against the object instance 'this'. Does the call always return the class object? Also, all sync in the java class is done on the instance. Syncing in the C code on the Class objects broadens the lock. All in all this seems like a potential problem.

We spent some time altering this code to properly pass the Class object through to the C code. After hacking in what appeared to be a functional replacement the original core dumps persisted.

While this may be an issue, it's not the one we're after.

Version(s) Compatibility

Back when core dumps were being enabled I poked around and determined that CDH has a recommended java version of jdk1.6.0_31 and that we were on a mixture of jdk1.6.0_27 and jdk1.6.0_39. Our java code was all compiled with jdk1.6.0_27, the CDH code was all compiled with their certified version jdk1.6.0_31.

We spent some time compiling our code on jdk1.6.0_31 and running the code on jvm _31. Still no luck, core dumps kept happening.

We had also lagged a bit on updates to our Linux distro. Presumably an OS upgrade will involve an upgrade to libc, which is certainly in the the picture and implicated by the non-Xcheck:jni core dumps.

We installed a newer distro, recompiled where needed, and still encountered issues. This did have the positive side effect of proving that our relatively lax adherence to specific versions wasn't hurting us.

Libc based core dump analysis

The non-Xcheck:jni core dumps were a bit more varied and didn't follow as precise a pattern. While they all died in the linux linker, the originating calls from java came from all over the place. There really wasn't a pattern as to what java/hadoop code could be triggering some kind of buffer overflow or other hard to find C error. The last thing we could do is to track down as closely as possible what exactly happened via gdb and hope something comes out of it.

(Un?)fortunately my previous experience with gdb never made it further than ‘bt<enter>’, so there was a good deal of thrashing around to get even rudimentary information like function parameters, register values and local variables.

The gdb stack traces all had the following pattern to them, and as is evident to your average linux hacker there are no debug symbols.

(gdb) 
#0 0x00007ff98eb9b8a5 in raise () from /lib64/libc.so.6
#1 0x00007ff98eb9d085 in abort () from /lib64/libc.so.6
#2 0x00007ff98e66c317 in os::abort(bool) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#3 0x00007ff98e7bf898 in VMError::report_and_die() () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#4 0x00007ff98e7c0411 in crash_handler(int, siginfo*, void*) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#5 
#6 0x00007ff98ec8f044 in _dl_addr () from /lib64/libc.so.6
#7 0x00007ff98e67021f in os::find(unsigned char*, outputStream*) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#8 0x00007ff98e669667 in os::print_location(outputStream*, long, bool) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#9 0x00007ff98e6735ff in os::print_register_info(outputStream*, void*) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#10 0x00007ff98e7beee3 in VMError::report(outputStream*) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#11 0x00007ff98e7bf775 in VMError::report_and_die() () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#12 0x00007ff98e672cb5 in JVM_handle_linux_signal () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#13 0x00007ff98e66efde in signalHandler(int, siginfo*, void*) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#14 
#15 0x00007ff98f334392 in strcmp () from /lib64/ld-linux-x86-64.so.2
#16 0x00007ff98f3258d7 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#17 0x00007ff98f32f9b4 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#18 0x00007ff98f32b196 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#19 0x00007ff98f32f46a in _dl_open () from /lib64/ld-linux-x86-64.so.2
#20 0x00007ff98eefcf66 in dlopen_doit () from /lib64/libdl.so.2
#21 0x00007ff98f32b196 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#22 0x00007ff98eefd29c in _dlerror_run () from /lib64/libdl.so.2
#23 0x00007ff98eefcee1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#24 0x00007ff98e66c87f in os::dll_load(char const*, char*, int) () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#25 0x00007ff98e4c9a54 in JVM_LoadLibrary () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/server/libjvm.so
#26 0x00007ff98d7a7c94 in Java_java_lang_ClassLoader_00024NativeLibrary_load () from /usr/java/jdk64_1.6.0_27/jre/lib/amd64/libjava.so
#27 0x00007ff985010d6e in ?? ()
#28 0x00007ff97f4e5508 in ?? ()

With a new test server as a playground, we grabbed the glibc debug info package. With flags like that it's hard not to come to the conclusion someone was trying to hide the debug symbols.

sudo yum --nogpgcheck --enablerepo=debug install glibc-debuginfo

Now here's a trace with the debug symbols. Arguments to functions, and even the occasional line number!

(gdb) bt
#0  0x00007f7aceca28a5 in raise () from /lib64/libc.so.6
#1  0x00007f7aceca4085 in abort () from /lib64/libc.so.6
#2  0x00007f7ace76da37 in os::abort(bool) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#3  0x00007f7ace8c1f28 in VMError::report_and_die() () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#4  0x00007f7ace8c2aa1 in crash_handler(int, siginfo*, void*) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#5  
#6  0x00007f7aced96044 in _dl_addr () from /lib64/libc.so.6
#7  0x00007f7ace77195f in os::find(unsigned char*, outputStream*) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#8  0x00007f7ace76ad87 in os::print_location(outputStream*, long, bool) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#9  0x00007f7ace774d3f in os::print_register_info(outputStream*, void*) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#10 0x00007f7ace8c1573 in VMError::report(outputStream*) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#11 0x00007f7ace8c1e05 in VMError::report_and_die() () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#12 0x00007f7ace7743f5 in JVM_handle_linux_signal () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#13 0x00007f7ace7706fe in signalHandler(int, siginfo*, void*) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#14 
#15 strcmp () at ../sysdeps/x86_64/strcmp.S:136
#16 0x00007f7acf42c8e7 in _dl_map_object (loader=0x410817a0, name=0x7f7ac04e6460 "/usr/lib/hadoop/lib/native/libhadoop.so.1.0.0", type=2, trace_mode=0, mode=-1879048191, nsid=0)
   at dl-load.c:2016
#17 0x00007f7acf436a34 in dl_open_worker (a=0x7f7abe2218c0) at dl-open.c:227
#18 0x00007f7acf4321a6 in _dl_catch_error (objname=0x7f7abe221910, errstring=0x7f7abe221908, mallocedp=0x7f7abe22191f, operate=0x7f7acf436910 <dl_open_worker>, args=0x7f7abe2218c0)
   at dl-error.c:178
#19 0x00007f7acf4364ea in _dl_open (file=0x7f7ac04e6460 "/usr/lib/hadoop/lib/native/libhadoop.so.1.0.0", mode=-2147483647, caller_dlopen=0x7f7ace76df9f, nsid=-2, argc=23,
   argv=, env=0x7f7ac8082090) at dl-open.c:569
#20 0x00007f7acf003f66 in dlopen_doit (a=0x7f7abe221ae0) at dlopen.c:67
#21 0x00007f7acf4321a6 in _dl_catch_error (objname=0x7f7ac06c5270, errstring=0x7f7ac06c5278, mallocedp=0x7f7ac06c5268, operate=0x7f7acf003f00 <dlopen_doit>, args=0x7f7abe221ae0)
   at dl-error.c:178
#22 0x00007f7acf00429c in _dlerror_run (operate=0x7f7acf003f00 <dlopen_doit>, args=0x7f7abe221ae0) at dlerror.c:164
#23 0x00007f7acf003ee1 in __dlopen (file=, mode=) at dlopen.c:88
#24 0x00007f7ace76df9f in os::dll_load(char const*, char*, int) () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#25 0x00007f7ace5cac94 in JVM_LoadLibrary () from /usr/java/jdk1.6.0_39/jre/lib/amd64/server/libjvm.so
#26 0x00007f7acd8a6d34 in Java_java_lang_ClassLoader_00024NativeLibrary_load () from /usr/java/jdk1.6.0_39/jre/lib/amd64/libjava.so

In a perfect world, the cause of the failure would be the top line of the stack trace. However raise(), abort(), os::abort() don't sound like causes but symptoms. Scanning through the entire trace, there are two signals raised. On the assumption that signals are for special occasions only, it's reasonable to assume whatever raised the signal is where a failure happened. There are two signals raised, which one should we target? The first signal could only have happened because of our error, the second could be our error or a complication from our error. Starting with the first signal should get us closest to a clean view of the error. So what's going in in frame 15 that raises a signal.

(gdb) f 15
#15 strcmp () at ../sysdeps/x86_64/strcmp.S:136
136 cmpb (%rsi), %al
Strcmp() takes two char arrays as arguments and returns an integer. Vegas odds would be on strcmp() only raising a signal if one of the passed in arguments was bad. As confirmation, the line of assembly shows the failure was on instruction cmpb (compare bytes). To find the arguments and where they came from we head to frame 16. 
(gdb) f 16
#16 0x00007f7acf42c8e7 in _dl_map_object (loader=0x410817a0, name=0x7f7ac04e6460 "/usr/lib/hadoop/lib/native/libhadoop.so.1.0.0", type=2, trace_mode=0, mode=-1879048191, nsid=0)   at dl-load.c:2016
2016  if (strcmp (name, soname) != 0)

Showing the frame in gdb with 'f 16' prints a mess of information including the method name, it's arguments, the file its from and the last executed line number in that frame. The other piece of data shown is this example, '2016 if (strcmp(name, soname) != 0)', shows the particular line of code that triggered the next frame on the stack. Lets have a look at that method.

internal_function
_dl_map_object (struct link_map *loader, const char *name,
int type, int trace_mode, int mode, Lmid_t nsid)
{
  int fd;
  char *realname;
  char *name_copy;
  struct link_map *l;
  struct filebuf fb; 

  assert (nsid >= 0);
  assert (nsid < GL(dl_nns));

  /* Look for this name among those already loaded. */
  for (l = GL(dl_ns)[nsid]._ns_loaded; l; l = l->l_next)
  {
    /* If the requested name matches the soname of a loaded object,
       use that object. Elide this check for names that have not
       yet been opened. */
    if (__builtin_expect (l->l_faked, 0) != 0
            || __builtin_expect (l->l_removed, 0) != 0)
      continue;
    if (!_dl_name_match_p (name, l))
    {
      const char *soname;

      if (__builtin_expect (l->l_soname_added, 1)
                || l->l_info[DT_SONAME] == NULL)
        continue;

        soname = ((const char *) D_PTR (l, l_info[DT_STRTAB])
                + l->l_info[DT_SONAME]->d_un.d_val);
      if (strcmp (name, soname) != 0)
        continue;

The code is a little on the dense side, but the method is called _dl_map_object and the comment before the loop is 'Look for this name among those already loaded'. 'this name' is probably referring to the name argument passed in, "/usr/lib/hadoop/lib/native/libhadoop.so.1.0.0". The things already loaded are probably the other stuff in the native libraries. So far so good. The strcmp line takes the name argument to the method along with soname. Soname appears to be generated from the stuff being looped through. So lets see what the value of soname is.

(gdb) info locals
soname = <value optimized out>
fd = <value optimized out>
realname = <value optimized out>
name_copy = <value optimized out>
l = 0x7fba082bd670
fb = {len = 140436846644544, 
buf = "7}\306\r\272\177\000\000\200\340\240\375\271\177\000\000@\215[\000\272\177\000\000\000\b?\n\272\177\000\000\000\000\000\000\000\000\000\000\260\340\240\375\271\177\000\000s\374\306\r\272\177\000\000\000\b?\n\272\177\000\000\000\000\000\000\000\000\000\000\300\340\240\375\271\177\000\000\357\372\306\r\272\177\000\000\330\363\036\016\272\177\000\000\000\020\323\016\272\177\000\000\360\340\240\375\271\177\000\000s\374\306\r\272\177\000\000\000\b?\n\272\177\000\000\330\363\036\016\272\177\000\000\340X\036\016\272\177\000\000\000\020\323\016\272\177\000\000P\341\240\375\271\177\000\000$.\307\r\272\177\000\000\000\b?\n\272\177\000\000'\000\000\000\000\000\000\000\000\b?\n\272\177\000\000 \000\000\000\272\177\000\000\340\210\257\t\272\177\000\000'\000\000\000\000\000\000\000@\215[\000\272\177\000\000\000\000\000\000\272\177", '\000' "\272, \177\000\000\060\362\240\375\271\177\000\000\320\t?\n\272\177\000\000\220\t\030@\a\000\000\000\360\210\257\t\272\177\000\000\000"...}
found_other_class = <value optimized out>
stack_end = <value optimized out>
(gdb) print soname
$1 = <value optimized out>

Soname's value is <value optimized out> and according to a question on stackoverflow <value optimized out> means the value resides in a register. Printing out the registers didn't reveal much, but then again we're not really looking for the value soname which we know caused strcmp() to blow up. We want to see the structures and data that soname was created from. Looking back at the C code soname was generated from l and l_info. Gdb seems pretty smart and will do a great job of printing out the structure of l for us.

(gdb) print l
$1 = (struct link_map *) 0x7fec89ba3ad0

(gdb) print *l
$2 = {l_addr = 140165160083456, l_name = 0x7f7ac9d52210 "/tmp/libjansi-64-1.9.so", l_ld = 0x7f7abeb3ade0, l_next = 0x7f7ac9db7a50, l_prev = 0x7f7ac9d94920, l_real = 0x7f7ac9e836f0, l_ns = 0, l_libname = 0x7f7ac9e83b60, l_info = {0x0, 0x7f7abeb3ade0, 0x7f7abeb3ae80, 0x7f7abeb3ae70, 0x0, 0x7f7abeb3ae30, 0x7f7abeb3ae40, 0x7f7abeb3aeb0, 0x7f7abeb3aec0, 0x7f7abeb3aed0, 0x7f7abeb3ae50, 0x7f7abeb3ae60, 0x7f7abeb3ae00, 0x7f7abeb3ae10, 0x7f7abeb3adf0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f7abeb3ae90, 0x0, 0x0, 0x7f7abeb3aea0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f7abeb3aef0, 0x7f7abeb3aee0, 0x0, 0x0, 0x0, 0x0, 0x7f7abeb3af10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f7abeb3af00, 0x0 , 0x7f7abeb3ae20}, l_phdr = 0x7f7abe93a040, l_entry = 140165160085632, l_phnum = 5, l_ldnum = 25, l_searchlist = {r_list = 0x7f7ac9d523a0, r_nlist = 3}, l_symbolic_searchlist = {r_list = 0x7f7ac9e83b58, r_nlist = 0}, l_loader = 0x0, l_versions = 0x7f7ac9c85dc0, l_nversions = 3, l_nbuckets = 17, l_gnu_bitmask_idxbits = 1, l_gnu_shift = 7, l_gnu_bitmask = 0x7f7abe93a168, {l_gnu_buckets = 0x7f7abe93a178, l_chain = 0x7f7abe93a178}, {l_gnu_chain_zero = 0x7f7abe93a1a4, l_buckets = 0x7f7abe93a1a4}, l_direct_opencount = 1, l_type = lt_loaded, l_relocated = 1, l_init_called = 1, l_global = 0, l_reserved = 0, l_phdr_allocated = 0, l_soname_added = 0, l_faked = 0, l_need_tls_init = 0, l_auditing = 0, l_audit_any_plt = 0, l_removed = 0, l_contiguous = 1, l_symbolic_in_local_scope = 0, l_free_initfini = 1, l_rpath_dirs = {dirs = 0x0, malloced = 0}, l_reloc_result = 0x0, l_versyms = 0x7f7abe93a722, l_origin = 0x7f7ac8ac9310 "/tmp", l_map_start = 140165160083456, l_map_end = 140165162184664, l_text_end = 140165160087552, l_scope_mem = {0x7f7acf645440, 0x7f7ac9e839a8, 0x0, 0x0}, l_scope_max = 4, l_scope = 0x7f7ac9e83a48, l_local_scope = {0x7f7ac9e839a8, 0x0}, l_dev = 51713, l_ino = 12837, l_runpath_dirs = {dirs = 0x0, malloced = 0}, l_initfini = 0x7f7ac9d52380, l_reldeps = 0x0, l_reldepsmax = 0, l_used = 1, l_feature_1 = 0, l_flags_1 = 0, l_flags = 0, l_idx = 0, l_mach = {plt = 0, gotplt = 0, tlsdesc_table = 0x0}, l_lookup_cache = {sym = 0x7f7abe93a328, type_class = 0, value = 0x7f7ac9e836f0, ret = 0x7f7abe93a328}, l_tls_initimage = 0x0, l_tls_initimage_size = 0, l_tls_blocksize = 0, l_tls_align = 0, l_tls_firstbyte_offset = 0, l_tls_offset = 0, l_tls_modid = 0, l_relro_addr = 0, l_relro_size = 0, l_serial = 15, l_audit = 0x7f7ac9e836f0}

And now we've got a view of the struct and the field names and values/addresses. At first glance its all a bit incomprehensible, except for the char array l_name = "/tmp/libjansi-64-1.9.so" sticking out. So while looking for a native hadoop lib, it loops through all native stuff loaded and seg faults on a non-hadoop native library? What is that ansi lib for anyway? Google finds the library at jansi.fusesource.org, and it isn't some internal jvm ansi lib as originally assumed. According to the website jansi is for "Eliminating boring console output". Why would a java library to eliminate boring console output have a native dynamic shared object? More practically, how could a java jar have a packaged shared object? A quick look at the other core dumps verifies they all died in the _dl_map_object method while examining the libjansi.so object.

A bit more research turns up the pattern for including native code in a .jar file. Our codebase had two such libraries, snappy compression and jansi. During static initialization a java class will reach into the jar file with the class loader and extract an appropriate dynamic library dropping it onto the file system, typically in the tmp directory. After extraction the java code will run System.load() or System.loadLibrary() making the native code available. One integral part of the pattern that jansi appears to have missed is to only extract the binary if it's not already present on the filesystem. Looking at /tmp we note the time differences in the two *.so files. The libjansi timestamp being younger than the libsnappy is a clue the jansi code overwrites its binary each time its loaded.

$ ls -altr /tmp/*.so
-rwxr-xr-x 1 root root 991112 Aug 5 18:15 /tmp/snappy-1.0.4.1-libsnappyjava.so
-rwxr-xr-x 1 root root 48340 Oct 11 15:24 /tmp/libjansi-64-1.9.so

So at this point the working theory is that the linux dynamic loader gets fatally tripped up because a native library loaded by the jvm is overwritten with a binary duplicate of itself. The failure would appear to be triggered by the hadoop library attempting to load/access it's own native routines. Almost certainly there is a race condition involved, and the longer a server is unchanged the more likely it is to be stable and vice versa.   

While my education of shared objects and linking in linux is more than a decade out of date, I was pretty firm in my belief that once an object was loaded you couldn't get rid of it, kind of like how you can delete a file from the file system but the file will stick around and be visible with lsof if there is an open filehandle. In this case however verification was straightforward; on a system that hasn't been cored in several hours overwrite the file with itself and wait. Since there's also a race condition involved, we'd probably have to run the test several times w/o causing a core dump to be confident this theory was incorrect. On a server running w/o incident for 8hrs the libjansi binary was copied over itself; 15m later there was a core dump.

Boom we have our culprit. It took an hour to remove jansi and redeploy, and the core dumps were gone.

Jansi's Purpose

Jansi was originally added to provide colored consoled output in a developers IDE. Adoption never made it past the initial use case of identifying the origin source for configuration parameters. Only useful in development and never used after container startup, no one complained after it was removed.

(ZK) abemail.variant.delete.error.output=true
(ZK) abemail.wait.on.missing.metadata.millis=5000
(DEF) addons.cache.prefix=aA
(DEF) addons.memory.expiry.millis=60000
(DEF) airbrake.apikey=**(REDACTED)**
(DEF) airbrake.notice.api.url=https://airbrake.io/notifier_api/v2/notices
(ZK) analytics.api.batchSize=100000
(ZK) analytics.api.cache.endpoints=true

And for a bit of HubSpot color, there was a spirited debate as to the merits of the original PR, with the cons argument now getting the last laugh.

Conclusion

Don't include obscure third party libraries in production. Since QA and development will often mirror production, unnecessary libs are right out.

All the time spent tweaking different compilation and runtime java minor versions had no noticeable effect. Overall that's a good thing to have a positive experience with java's Compile Once, Run Anywhere claim. This held true even with the complication of third party native C libs.

And finally we learned that native jni C libs are pretty scary when compared to/combined with the java world. The addition of the jvm command line arg -Xcheck:jni showed that several year old native C code was not written correctly according to the jni spec. Yet the code has been battle tested in the field and is working well enough.