Out of Kernel Memory:
An Analysis Session

 

This writeup was originally an article written for the Sun Catalyst newsletter (January 96). I've edited out some of the introduction so that we can jump right into the technical details.

Due to the age of the original article, some of the technical data limits mentioned are no longer applicable. And I've certainly seen crash dumps larger than 200mb since 1996!     :-)

...Let's focus on one example of system panics and hangs. Using a crash dump from a real customer, I will show you how I was able to debug and resolve a situation where the customer's server was occasionally hanging...

System History

When analysing a crash dump, there are two common approaches used. Some engineers prefer to jump right into the crash dump and have a dig around first, while others like to get a feel for the history of the system, then look at the crash. Personally, I usually tend to ask about the system beforehand.

The system history can make a big difference in how I approach a crash dump. If the system only crashes or hangs at 3am on Tuesdays, I'm going to want to look at the cron jobs. If the system has a recent history of random crashes, hangs, watchdog resets, and sudden drops down to the OK prompt, I'm going to be very suspicious about the reliability of the hardware.

Unfortunately, too many customers spend hundreds of thousands of dollars on a large computer system, but won't invest the additional $1.95 and few minutes a day to maintain a system log note pad kept near the console. So, all too often, getting a good idea of the history of a system is not always easily achieved.

System Hangs

System hangs are generally nasty situations to debug. They can be caused by everything from faulty hardware to kernel-level bugs. When the hangs occur, they can be instantaneous, or the system might creep and crawl for a while and then finally lock up.

To analyze a system hang, one first needs to get a snapshot of the hung system. To do this, we convert the hang into a panic by forcing the system to an abrupt halt (L1-A) and then running the boot prom command: sync. Hopefully, depending on the reason for the hang, the system will cooperate, allow you to reach the boot prom, and the panic() routine will dump the image of memory to the primary swap device.

Not all hangs can be converted into system panics. When a system is suffering from hard, uninterruptable hangs, you are stuck with cycling power, and losing the contents of memory and thus the evidence as to why the system was hung.

Preparing for a Panic

If you know the system is hanging or panic'ing, get savecore ready to go. On Solaris 1 systems (SunOS 4.X), edit the /etc/rc.local shell script. On Solaris 2 systems (SunOS 5.X), savecore is called from the /etc/rc2.d/S20sysetup script.

Always make sure that savecore is pointing to a directory which has enough space to hold a few crash dumps without causing trouble. For example, do not use /var (which is the default) if /var is on your root filesystem. Have savecore put the crashes somewhere where you've got room to spare.

Also, when preparing for a panic, look at your primary swap device. When a system is panic'ing, it gets back to basics. Fancy disk mirroring and other disk management techniques can sometimes baffle the panic() routine. Also, panic() doesn't know anything about how multiple swap partitions can be used to make up one large swap area. Remember, keep the primary swap partition simple and straight forward!

Make sure your primary swap partition is large enough to hold an image of your system's memory. If you have a massive server with 2gb of memory, you can probably safely limit the primary swap partition to 300mb. To date, I've never seen a crash dump larger than that.

A Hung SC2000

Let's look at the customer's system hang now.

This server was hanging occasionally and also panic'ing at times. The customer turned on savecore and soon captured a crash image of one of the panics. I found that the panic was caused by a known kernel bug, one which was addressed in a recent version of the jumbo kernel patch. The patch was installed soon after. However, the hangs continued.

After forcing one of the hangs into a panic via L1-A and sync, the customer forwarded the files to me for analysis. The vmcore file from this crash was quite large, over 200mb in size. The unix file was around 1 mb, quite normal.

Initial Analysis - The Strings Command

Using strings, I always take a look at the vmcore file first. This way I can see right away what the panic message was and usually which version of the OS is running. Here are the important "clips" from the strings output.

  Hiya...  strings vmcore.1 | more 
  Generic_101945-33                   <--- Kernel patch
  zero
  rfsproccnt_v2
  misc
  With SMT7.2 Version 2.3.1  04189
  Network Peripherals SBus FDDI
  nfs_server
  misc
  unix
  rpc_server
  ....
  .... (stuff edited out)
  ....
  nf0 at sbi3:  SBus3 slot 1 0x7ff0 SBus level 4 sparc ipl 7
  nf0 is /io-unit@f,e3200000/sbi@0,0/nf@1,7ff0
  dump on /dev/dsk/c1t2d0s1 size 1024452K
  panic[cpu0]/thread=0xe3901ec0: zero         <--- Panic message
  syncing file systems... 4830 4830 4830 4830 4830 done
  49479 static and sysmap kernel pages
    546 dynamic kernel data pages
    620 kernel-pageable pages
      7 segkmap kernel pages
  ....
  ...  (output continues)

This core dump actually contains 604,179 lines of strings output. Fortunately, the kernel message buffer is up near the front of crash dumps.

As we see here, the jumbo kernel patch for Solaris 2.4 is installed. At the time of the crash, this was a fairly recent release. However, it is interesting to note that rev -33 never was released formally, so the customer was trying a "T" or temporary patch. Revision -33 was not found to be a problem during its testing, so I have no complaint with the customer running this revision for the moment.

Also in this output, we see the panic message which appeared on the console. A "panic: zero" is indicative of someone forcing a hang to become a panic. So, I now know that this is a hung system, which means I will not be doing a stack traceback. Instead, I'm going to have to look at ALL of the threads of the system to see who's doing what. This is not a difficult task, however, considering the massive size of the crash dump, this will be a long, tedious one.

Uptime

Before we look at the threads, just out of curiousity, let's see how long the system was up before it hung. We can do this using the absolute debugger, adb. To do this, we need to be on a sun4d system running Solaris 2.4, a system similar to the one which crashed.

  Hiya...  adb -k unix.1 vmcore.1
  physmem 23d48
  time/Y
  time:
  time:            1995 Oct 30 00:56:50
  *time-(*lbolt%0t100)=Y
                   1995 Oct 25 07:16:58

Time and lbolt are two kernel symbols used to maintain the current time, time, and the number of hundredths of a second since boot time, lbolt. Using a simple equation within adb, we can get the time of the crash, and thus the uptime of the system. In this case, this server had been up less than 5 days.

Tracing All Threads

If this system had simply panic'ed, we would go look at the thread which triggered the panic. However, we already know that the system was hung. Thus, the thread which panic'ed in this crash dump wasn't necessarily one of the threads which were hung. In fact, chances are excellent that we will find several hung threads.

Okay, it's time to get a stack trace of all of the threads. Since this is guaranteed to generate a lot of output (1937 lines in this case), we will tell adb to redirect output for the moment into a file called fred. Note that we are still in the adb session started above.


  $> fred
  $< threadlist
  $>
  time/Y
  time:
  time:           1995 Oct 30 00:56:50

The $>fred command sends adb output to file fred. The next command says to run the threadlist macro which generates the output we require. (For Solaris 1 systems, we would use the traceall macro instead.) Typing ahead, $> says to stop sending output to fred. Then, just so I know adb is done, I've asked to see time again.

Here are some snippets from the threadlist output.


                  thread_id e3922ec0
  ?(?) + effff4a8
  cv_wait(0xe01c5b7c,0xe01c5b7c,0x2705bef,0x80000000,0xe01c8ac4,0x0) + 44
  callout_thread(0x0,0x0,0xe01d6950,0xe01d6950,0xe01c5b90,0x0) + 1c
  
                  thread_id e3b08ec0
  ?(?) + effff4a8
  poll_obp_mbox(0xfc,0xfc,0x0,0x404010e6,0x5b0c0,0x0) + bc
  level14_handler(0xe3b08d54) + 4
  L14_front_end(?) + 68
  spl5(0x404000e0)
  disp_getwork(0xf7660d20,0x404000e0,0x0,0xffffffff,0xe01c13f0,0x0) + c
  idle(0xf7660d20,0x0,0xf7660d24,0xf7660e3c,0xe,0x0) + cc
 
                  thread_id f76eda40
  ?(?) + effff4a8
  cv_wait_sig(0x0,0xe01ccf00,0xf76fa9a0,0xf76faa0c,0xf775caa8,0x0)
  strwaitbuf(0xf4c8,0x1,0xffffffff,0xfffffffc,0xf7854198,0x0) + 50
  strmakemsg(0x0,0x1,0x0,0x1,0xe3d2ed74,0x58) + 1bc
  strputmsg(0x0,0xf76ae500,0xe3d2ee70,0x0,0xf76ae54c,0x3) + 294
  msgio(0xe3d2efe0,0xe3d2ef50,0x2,0xe3d2eeef,0xe3d2eee8,0x3) + 34c
  putmsg(0xe3d2efe0,0xe3d2ef50,0x7ff,0x3,0x0,0x29ea8) + 4c
  syscall_ap(0x3) + 6c

                  thread_id f8ae7480
  ?(?) + effff4a8
  cv_wait_sig_swap(0x0,0xf8c83f24,0xf8db4000,0xf8c83ed8,0xf8ac4048,0x0)
  fifo_read(0xf7443264,0xe3eb3e48,0xf8c83ed8,0xf8c7fd28,0x0,0x0) + 12c
  rw(0xe3eb3fe0,0xe3eb3f50,0x1,0x0,0x1,0x204) + 2f8
  syscall_ap(0x0) + 6c

These are pretty common threads. Nothing odd going on here.

The 2nd thread shows an idle CPU waking up to acknowledge that the system was panic'ing. In fact, on this system, 19 out of 20 CPUs were idle. One was actually processing the "panic: zero". The fact that the system was completely idle is a clue to the problem we will find.

Tucked away in the output, we find some not so normal looking threads. As a matter of fact, there were quite a few threads which could be added to this list.


                  thread_id fc35ec60
  ?(?) + effff4a8
  cv_wait(0xf7204010,0xf7204010,0xe1af63ff,0x0,0xf89b5564,0x0)
  rmalloc_wait(0xf7204000,0x1,0xf7204018,0x1,0xf7204010,0x0) + 3c
  kmem_getpages(0x1,0x0,0xe01ccc00,0x1,0x1,0x0) + 78
  kmem_page_alloc(0x1,0x0,0xe1af63ff,0x1000,0x0,0xe01d7044) + 154
  kmem_slab_create(0xf720d8a0,0x0,0x0,0x100,0x80,0xe01b92fc) + 5c
  kmem_cache_alloc(?) + 64
  kmem_alloc(0xf720d8a0,0x0,0xe1af63ff,0xfa73ccea,0x0,0xf720d8c8)
  vol_enqueue(0x3,0xe3fa1e28,0x0,0x1,0x0,0x0) + 8
  volclose(?) + 1a0
  device_close(0xfa73cca8,0x3,0x2,0xf7f05558,0xf89b54f4,0xfcd1ea4c)
  spec_close(0xf89b5b54,0x3,0xf89b5564,0xf89b556c,0xf89b5564,0x0) + d8
  closef(0xf9d45540,0xe3fa1f50,0x16c0000,0xe3fa1f50,0x0,0xf89b5b54) + 68
  syscall_ap(0x7) + 6c

  
                  thread_id f935e140
  ?(?) + effff4a8
  cv_wait(0xf7204010,0xf7204010,0x400000e0,0x0,0x400000e0,0x0)
  rmalloc_wait(0xf7204000,0x1,0xf7204018,0x1,0xf7204010,0x0) + 3c
  ppmapin(0xe0930a7c,0x3,0xffffffff,0xe01c5578,0xe0930a7c,0xf6828000) + c
  pagezero(0xe0930a7c,0xc00,0x400,0xf6827000,0xe0930a7c,0xe3e3bb24) + 38
  ufs_getpage_miss(0xf8187a68,0x9000,0xe0930a7c,0xf73d7fa0,0x0,0x0) + 190
  ufs_getpage(0xf8187a68,0xf6827000,0x9000,0x1000,0x1000,0xe3e3bc58) + 588
  segmap_fault(0xf729b000,0xf73d7fa0,0xf6827000,0x1000,0x0,0x2) + e4
  segmap_getmapflt(0xf73d7fa0,0x1,0x1800,0xf7224fc0,0xf6826000,0x1bff) + 308
  wrip(?) + 434
  rwip(0xf8187a60,0x8000,0x1800,0x0,0x400,0x1)
  ufs_write(0xf8187a68,0xe3e3be48,0x0,0xf8febf68,0x8058,0xf8187ac8) + 1a8
  rw(0xe3e3bfe0,0xe3e3bf50,0x2,0x0,0x0,0x400) + 2f8
  syscall_ap(0x4) + 6c

Threads shouldn't have to wait for kernel memory, but that is what we see happening here.

Crash & kmastat

At this point, we get of out adb and use another utility. The crash program is a good tool for dumping kernel memory allocation statistics. Again, the output is a bit lengthy, so I'll show you the things that caught my eye.

  Hiya...  crash -d vmcore.1 -n unix.1 
  dumpfile = vmcore.1, namelist = unix.1, outfile = stdout
  > kmastat 
                         buf   buf   buf   memory    #allocations
  cache name            size avail total   in use    succeed fail
  ----------           ----- ----- ----- --------    ------- ----
  ....
  kmem_alloc_240         240    15    16     4096       7667    0
  kmem_alloc_256         256   421   765   208896     721890    0
  kmem_alloc_320         320     2 261312 89194496     328393    0
  kmem_alloc_384         384  1265  6430  2633728     127825    0
  ....
  inode_cache            320    41 20340  6942720      30188    0
  fnode_cache            132    20    28     4096      41553    0
  pipe_cache             252    38   105    28672     206902    0
  ----------           ----- ----- ----- --------    ------- ----
  permanent                -     -     -   155648       1358   43
  oversize                 -     -     -  4788224     114667   10
  ----------           ----- ----- ----- --------    ------- ----
  Total                    -     -     - 143306752 1688297597 31308688
  > 

On this system, the kernel map is huge! The hard limit on the kernel map is around 140mb for this older release of the Solaris. We've hit the limit. But why? The kernel isn't supposed to eat up memory!

The kmem_alloc_320 pool usage is much too heavy. Digging through SunSolve, we find that there were a couple of kernel memory leak problems, but the of one most interesting was a bug which talks about how the volume manager used to eat up kmem_alloc_320. This was fixed in the volume manager jumbo patch, which the customer now has installed. No system hangs of this type have been seen on this server since installing the patch provided.