(2): File system full

2007-12-25 8:51:00

Dear managers,

I have received many answers to my question, so I will summarize them,

and inform you of the cause of my problems.

What happened was:

The machine is a printspoolserver. One client sends as root big files

to the server. When the files are printed, the tmp-spool-files on the

server are not removed, because of a bug in the implementation of the

lp protocol. Then someone tried to rm the spool files, and then rm

hang. Rm was killed, and that left the filesystem in a mess. I had to

run fsck on / about six times, before it didn't complain anymore. I

have installed the newest printerpatch 101317-12 on the server. Hope

this fixes the problem.

Thanks to

Jim Wright <jwright@phy.ucsf.edu>

Ed Haggerty <haggerty_edward@jpmorgan.com>

jerryr@gcm.com (Jerry Ratner)

"Jan L. Peterson" <jlp@math.byu.edu>

Wolli Steiner <Wolli.Steiner@Rhein.DE>

John Benjamins <johnb@blas.cis.mcmaster.ca>

konc@fnts07.fnal.gov (John Konc - Fermi National Accelerator Lab.)

stlee@alc.com

Steven Overhauser <spo@ee.duke.edu>

Graeme Robertson <graemer@unisys.co.nz>

Stuart.Roe@ncl.ac.uk (Stuart Roe)

Mike Rembis 66520 <ebumfr@ebu.ericsson.se>

thomas@wiwi.hu-berlin.de (Thomas Koetter)

mike@trdlnk.com (Michael Sullivan)

Ric Anderson <ric@seagull.rtd.com>

lar@trib.com (Larry Ash)

vic@raven1.imatron.com (Victor Churchill)

rscott@otter.wsipc.wednet.edu (Rob Scott)

George Pallas <gpallas@freenet.columbus.oh.us>

dav@ipc.litronic.com (David L. Markowitz)

Gene Rackow <rackow@mcs.anl.gov>

John Goggin - LTX Tech Support <jgoggin@ltx.com>

stuart@TO.mobil.com (Stuart Pearlman - RDR)

They gave me the following advices. If you are interested in the

complete answers, I'll be glad to send them to you.

1)

if I had removed a file which is still used by another process

resources will be unavailable until the process terminates. To fix,

find the process and kill it or reboot. Below more info on finding the

process.

2)

In SunOS 4, files sometimes loose connections to their parents. The

Openwin filemanager is known for this. To fix, I run fsck -f

/dev/sd0a (or sd0g or .....)

3)

About files with holes, I got two advices:

--If I had sparse files (files with holes), my problem would be

reversed: df would report *less* space than du.

--Files with holes would not cause this effect. df, du and ls -ls all

correctly report the actual disk usage of files with holes.

4)

A remote printer was not working, as a result huge amounts of messages

were being set to /var/lp/logs/lpNet and to /var/lp/logs/lpSched

filling up the filesystem. Removing the files did not free up the

disk space! It wasn't until I killed the lpNet and lpSched daemons

and restarted lpSched that the disk space appeared. If this is the

problem your having a simple reboot should fix it.

5)

Memoryproblems:

I had a sparc II do something similar but I was using 4.1.3. Turned out

to be a problem with memory. I swapped the ram with a different motherboard

and the problem went away, and never came back in either machine.

Really weird.

6)

The filesystem preserves some space for defragmentation algorithms

etc, and for root, who can usually get more than 100% on an

filesystem. The size of this space (*minfree*) can be set vith 'tunefs

-m <arg>' where the arg is the number of percent. The default is

10%. df does not see this space.

7)

Nonexisting device:

The number one useruper of space on / that I know of is people with

root access trying to write to a device that is not there.

Try this:

        ls -ls /dev | sort -n | tail

There should not be any files larger than the file MAKEDEV. If there

are then they should be the last file or two in the list and they are

very likely to be the whole problem. I will even bet you that it|they

has a name something like sto (almost st0) or fdo (almost fd0). just

blow those accedents away and / should be fine again.

8)

Check if any files are hidden under a mountpoint

About finding a process holding a open file:

-------------------------------------------

First scan the file system to identify the removed file:

        fsck -n /dev/rsd0a

This should report an unreferenced file. Note the inode number. The

-n option is very important since the file system is mounted and must

not be modified.

Now run the command:

        lsof /

lsof is a freely distributable utility available via anonymous FTP

>from ftp.cc.purdue.edu. It will list open files on the root file

system. Match the inode number reported by fsck to identify the

guilty process. Based on the command the process is running, you can

then decide whether there is a way to get the process to close the

file, or if you should just kill the process. In either case, the disk

space occupied by the unreferenced file will then be freed.

Others suggest the program ofiles og fuser instead of lsof.

One says that running "fsck -n" >>might<< show you what's happening.

Comments

Got something to say?

You must be logged in to post a comment.