High Load Average (SUMMARY)
2007-12-25 7:22:00
>System: 4/380
>OS: 4.1.1
>
>Patches:
>
>100173-03 NFS Jumbo Patch.
>100174-01 Fix tmpfs bugs.
>100188-01 TIOCCONS bug.
>100192-01 Fix color problem (white on white instead of white on black).
>
>Problem:
>One of our users reported the CONSOLE froze up early this morning while using X
>(MIT version). He tried L1-A, and nothing happened. He had to the power off
>and on to get the system to reboot.
>
>Later this morning, I started up X, and xterm'd a few windows, when the same
>thing happened to me. I tried *lots* of L1-A's, and after a while, I got a
>response. I tried continuing things, and see what would happen. However, the
>system gave me abort messages when I tried to continue. So, I decided to force
>a dump. I was able to bring up the system in single user mode, and save the
>dump to a file. I ran a "ps" command on the dump and found 28 actively running
>processes, including:
>
> swapper ypserv in.routed syslogd
> nfsd update cron
After a few more occurrences, I remember that there had been some mention of
a high load average problem on a previous sun-managers posting. Upon checking
around on my system, I found the following:
>Date: Wed, 21 Nov 90 17:11:21 EST
>From: Kennedy Lemke <Kennedy_J_Lemke@princeton.edu>
>Subject: Load peaking problem
>
>We've been having a strange problem; I hope someone else has
>seen this and knows how to fix it: we have a Sun 4/490 with
>one local IPI disk attached (a 1.2 GB CDC 9720). We are running
>SunOS 4.1. On this system, we have on the average about 50
>users running mostly interactive programs, with a few long-running
>number-crunching programs as well. We have a total of around
>3000 users, whose files live on a server machine.
>
>Soon after we made the system available, we started seeing this
>problem: occasionally, the load average of the machine shoots up
>very high (to anywhere between 10 and 100--we see this clearly
>with xnetload). The load stays high for perhaps 10 to 60 seconds,
>then returns to normal almost as quickly. During the time that
>the load is rising, all processes on the machine seem to be "hung".
>For example, if I press "return" at a prompt, nothing echoes on
>my screen, and I don't get another prompt.
>
>Once when this was happening, we halted the machine and got a
>core dump (with "g 0"). We examined the processes from the dump,
>and sure enough there were about 70 runnable processes (with an
>"R" in the state column from ps).
>
>After awhile, a user noticed that this seemed to be happening
>whenever he did "ls -l" on /dev. We confirmed that this was the
>problem. trace showed that the machine was hanging when stat(2)
>was called with /dev/id000b as the first argument; we do paging
>to the local IPI disk, of course on this partition.
>
>So today I brought the machine down, removed id000b, and did
>"MAKEDEV id000" (creating a new id000b node), but I get the
>same results.
>
>Have any of you experienced a similar problem? Anybody suspect
>this is a hardware problem? We have not noticed any problems with
>paging activity and the like--this all seems normal. This only
>occurs when stat(2) is called on /dev/id000b (and only when the
>machine is in multiuser mode).
The summary was:
>Date: Thu, 29 Nov 90 00:18:22 EST
>From: Kennedy Lemke <Kennedy_J_Lemke@princeton.edu>
>Subject: Re: Load peaking problem (SUMMARY)
>
>About a week ago I posted a query about the load average on my
>Sun 4/490 going out of control whenever stat(2) was called on
>/dev/id000b. I received 5 responses to the query with various
>good advice (installing the NFS jumbo patch, which I had done,
>installing the PMEG patch, which I hadn't done, increasing the
>number of maxusers, etc.).
>
>The easiest and most obvious solution came from trinkle@cs.purdue.edu
>who suggested simply removing the device node altogether (which
>I didn't know I could do). I did so, and now I haven't seen the
>problem since. I don't know the exact cause of the problem, nor
>the "perfect" fix, but this has done the trick for us.
>
>Perhaps this problem won't appear in 4.1.1 :-) [and maybe it will...kwt]
It then occurred to me that I made some extra tty and pty devices yesterday
as we seemed to be running low. I didn't check to see if the kernel was
configured to use them. I accessed them via stat(2), which at least didn't
hang in 4.1.1. However, when I ran X, and the device files were there, I would
always hang the system. When I removed the device files, and ran X, I would
have no problem.
Whether it was X itself or an application that was trying do something with the
ttys? or ptys? devices, I don't know.
If there is an OS patch to avoid this problem, I'd appreciate hearing about it.
Kevin W. Thomas
National Severe Storms Laboratory
Norman, Oklahoma
Comments
Got something to say?
You must be logged in to post a comment.

