BSD

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • server locking up

    6 answers - 745 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    of our file servers (running 3.99.11) has started locking up every couple
    of days. It gets into a state where any process will run fine until it tries
    to access the disk at which point it stops responding. Updating the kernel
    to a current from a couple of weeks ago makes no difference.
    I've had zero luck in tracking down whats causing this.
    However I set up an external-mode watchdog to panic the machine if a loop of
    sleep 20; ls -l /a/local/directory /dev/null ; wdogctl -t
    failed to tickle the watchdog for a minute, so I now have a core dump from
    such a panic. I'd like some suggestions on what to look for/how to poke at
    this core dump to try to find whats happening.
    cheers
    mark
  • No.1 | | 844 bytes | |

    Thu, Jul 06, 2006 at 11:27:09PM +1200, Mark Davies wrote:
    of our file servers (running 3.99.11) has started locking up every couple
    of days. It gets into a state where any process will run fine until it tries
    to access the disk at which point it stops responding. Updating the kernel
    to a current from a couple of weeks ago makes no difference.
    I've had zero luck in tracking down whats causing this.
    However I set up an external-mode watchdog to panic the machine if a loop of
    sleep 20; ls -l /a/local/directory /dev/null ; wdogctl -t
    failed to tickle the watchdog for a minute, so I now have a core dump from
    such a panic. I'd like some suggestions on what to look for/how to poke at
    this core dump to try to find whats happening.

    of `ps -axl -N /netbsd -M core' is a good starting point.
  • No.2 | | 3787 bytes | |

    Friday 07 July 2006 01:21, Juergen Hannken-Illjes wrote:
    of `ps -axl -N /netbsd -M core' is a good starting point.

    Sorry I should have included that. Nothing leapt out at me:

    UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME CMMAND
    0 0 0 0 -18 0 0 0 schedule RWKs ? 0:00.00 [swapper]
    0 1 0 0 10 0 68 0 wait RWs ? 0:00.00 init
    0 2 0 0 14 0 0 0 crypto_w RWK ? 0:00.00
    [cryptoret
    0 3 0 0 -6 0 0 0 sccomp RWK ? 0:00.00
    [scsibus0]
    0 4 0 0 -6 0 0 0 sccomp RWK ? 0:00.00
    [scsibus1]
    0 5 0 0 10 0 0 0 usbevt RWK ? 0:00.00 [usb0]
    0 6 0 0 10 0 0 0 usbtsk RWK ? 0:00.00 [usbtask]
    0 7 0 0 -6 0 0 0 atath RWK ? 0:00.00 [atabus0]
    0 8 0 0 -6 0 0 0 atath RWK ? 0:00.00 [atabus1]
    0 9 0 0 10 0 0 0 pmsreset RWK ? 0:00.00 [pms0]
    0 10 0 0 -6 0 0 0 sccomp RWK ? 0:00.00
    [atapibus0
    0 11 0 0 -18 0 0 0 pgdaemon RWK ? 0:00.00
    [pagedaemo
    0 12 0 0 18 0 0 0 syncer RWK ? 0:00.00 [ioflush]
    0 13 0 0 -18 0 0 0 - RWK ? 0:00.00
    [aiodoned]
    0 50 0 0 -6 0 0 0 physiod RWK ? 0:00.00 [physiod]
    0 160 0 0 -20 0 52 0 temp RWL ? 0:00.00 ssh -e
    non
    0 161 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 (nfsd)
    0 162 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 163 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 164 0 0 2 0 52 0 nfsd RWL ? 0:00.00 nfsd:
    mast
    0 165 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 166 0 0 -20 0 52 0 temp RWL ? 0:00.00 nfsd:
    serv
    0 167 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 nfsd:
    serv
    0 168 0 0 -20 0 52 0 temp RWL ? 0:00.00 nfsd:
    serv
    0 169 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 170 0 0 2 0 216 0 select RWs ? 0:00.00
    (rpc.statd
    0 171 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 (nfsd)
    0 172 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 (nfsd)
    0 302 0 0 18 0 1052 0 pause RWs ? 0:00.00 (ntpd)
    0 351 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 365 0 1318 2 0 304 0 select RWs ? 21:58.00 (sshd)
    0 399 0 0 2 0 184 0 - RWs ? 0:00.00 (syslogd)
    0 408 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 442 0 0 2 0 340 0 poll RWs ? 0:00.00 (rpcbind)
    0 444 0 0 18 0 7100 0 sigwait RWsa ? 0:00.00 (named)
    0 454 0 0 2 0 440 0 select RWs ? 0:00.00 (sshd)
    0 473 0 0 10 0 220 0 mfsidl RWs ? 0:00.00
    (mount_mfs
    0 528 0 0 2 0 632 0 select RWs ? 0:00.00 (amd)
    0 533 0 0 -22 0 0 0 actwat RWK ? 0:00.00
    [acctwatch
    0 534 0 0 2 0 608 0 select RWs ? 0:00.00 (mountd)
    0 541 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 563 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 599 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 603 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 (nfsd)
    0 625 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 628 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 637 0 0 2 0 172 0 poll RWs ? 0:00.00 (nfsd)
    0 657 0 0 2 0 224 0 select RWs ? 0:00.00
    (rpc.lockd
    0 831 0 54760 18 0 3120 0 pause RW ? 912:40.05 (smbd)
    0 879 0 0 2 0 1064 0 select RWs ? 0:00.00 (nmbd)
    0 894 0 2406 2 0 3132 0 select RWs ? 40:06.00 (smbd)
    0 924 0 0 2 0 508 0 select RW ? 0:00.00 (afpd)
    0 979 0 59576 2 0 72 0 select RWs ? 992:56.06
    (mntauthd)
    0 1015 0 0 2 0 216 0 kqread RWs ? 0:00.00 (inetd)
    0 1072 0 0 10 0 232 0 nanoslee RWs ? 0:00.00 (cron)
    32767 1083 0 0 2 0 60 0 select RW ? 0:00.00
    (rpc.ruser
    0 1118 0 0 2 0 440 0 select RWs ? 0:00.00 (sshd)
    0 193 0 0 10 0 156 0 ppwait RW ttyp0 0:00.00 (sh)
    1002 986 0 4523 18 0 1176 0 pause RWs ttyp0 75:23.00 (tcsh)
    0 1142 0 0 3 0 1284 0 ttyin RW+ ttyp0 0:00.00 (tcsh)
    0 26986 0 0 -20 0 156 0 temp RWV ttyp0 0:00.00 (sh)
    1111 452 0 374 18 0 1172 0 pause RWs ttyp1 6:14.00 (tcsh)
    0 976 0 0 3 0 1292 0 ttyin RW+ ttyp1 0:00.00 (tcsh)
    0 1077 0 47737 3 0 52 0 ttyin RWs+ ttyE0 795:37.05 (getty)

    cheers
    mark
  • No.3 | | 1020 bytes | |

    Mark Davies wrote:
    of our file servers (running 3.99.11) has started locking up every couple
    of days. It gets into a state where any process will run fine until it tries
    to access the disk at which point it stops responding. Updating the kernel
    to a current from a couple of weeks ago makes no difference.
    I've had zero luck in tracking down whats causing this.
    However I set up an external-mode watchdog to panic the machine if a loop of
    sleep 20; ls -l /a/local/directory /dev/null ; wdogctl -t
    failed to tickle the watchdog for a minute, so I now have a core dump from
    such a panic. I'd like some suggestions on what to look for/how to poke at
    this core dump to try to find whats happening.

    cheers
    mark

    Are you using filesystem snapshots? I've had this problem on FreeBSD
    when I used filesystem snapshots with certain RAID controllers (Dell
    PERC 3/di and Mylex AcceleRAID). I never tracked it down, and ended up
    giving up on snapshots altogether.
  • No.4 | | 756 bytes | |

    Fri, Jul 07, 2006 at 02:27:14AM +1200, Mark Davies wrote:
    Friday 07 July 2006 01:21, Juergen Hannken-Illjes wrote:
    of `ps -axl -N /netbsd -M core' is a good starting point.

    Sorry I should have included that. Nothing leapt out at me:

    UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME CMMAND
    [snip]
    0 160 0 0 -20 0 52 0 temp RWL ? 0:00.00 ssh -e
    non
    0 161 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 (nfsd)
    0 162 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 163 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)

    Looks like an "out of memory situation". WCHAN == "temp" is waiting for
    memory (malloc type M_TEMP) to become available. What gives
    `vmstat -N /netbsd -M core -s' for "pages free"?
    Do you use tempfs?
  • No.5 | | 150 bytes | |

    Friday 07 July 2006 02:46, Skylar Thompson wrote:
    Are you using filesystem snapshots?
    No I'm not.
    cheers
    mark
  • No.6 | | 4510 bytes | |

    Friday 07 July 2006 04:28, Juergen Hannken-Illjes wrote:

    Looks like an "out of memory situation". WCHAN == "temp" is waiting for
    memory (malloc type M_TEMP) to become available. What gives
    `vmstat -N /netbsd -M core -s' for "pages free"?

    37893 pages free

    Do you use tempfs?

    no. (/tmp is mfs).

    The machine did it again last night so here is the pages free and ps for that
    core:

    2554 pages free

    UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME CMMAND
    0 0 0 0 -18 0 0 0 schedule RWKs ? 0:00.00 [swapper]
    0 1 0 0 10 0 68 0 wait RWs ? 0:00.00 init
    0 2 0 0 14 0 0 0 crypto_w RWK ? 0:00.00
    [cryptoret
    0 3 0 0 -6 0 0 0 sccomp RWK ? 0:00.00
    [scsibus0]
    0 4 0 0 -6 0 0 0 sccomp RWK ? 0:00.00
    [scsibus1]
    0 5 0 0 10 0 0 0 usbevt RWK ? 0:00.00 [usb0]
    0 6 0 0 10 0 0 0 usbtsk RWK ? 0:00.00 [usbtask]
    0 7 0 0 -6 0 0 0 atath RWK ? 0:00.00 [atabus0]
    0 8 0 0 -6 0 0 0 atath RWK ? 0:00.00 [atabus1]
    0 9 0 0 10 0 0 0 pmsreset RWK ? 0:00.00 [pms0]
    0 10 0 0 -6 0 0 0 sccomp RWK ? 0:00.00
    [atapibus0
    0 11 0 0 -18 0 0 0 pgdaemon RWK ? 0:00.00
    [pagedaemo
    0 12 0 0 18 0 0 0 syncer RWK ? 0:00.00 [ioflush]
    0 13 0 0 -18 0 0 0 - RWK ? 0:00.00
    [aiodoned]
    0 50 0 0 -6 0 0 0 physiod RWK ? 0:00.00 [physiod]
    0 146 0 0 2 0 52 0 nfsd RWL ? 0:00.00 (nfsd)
    0 160 0 0 2 0 52 0 nfsd RWL ? 0:00.00 ssh -e
    non
    0 161 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 162 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 163 0 0 2 0 52 0 nfsd RWL ? 0:00.00 (nfsd)
    0 164 0 0 2 0 52 0 nfsd RWL ? 0:00.00 nfsd:
    mast
    0 167 0 0 2 0 52 0 nfsd RWL ? 0:00.00 nfsd:
    serv
    0 168 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 nfsd:
    serv
    0 169 0 0 -2 0 52 0 vnlock RWL ? 0:00.00 (nfsd)
    0 170 0 0 2 0 52 0 nfsd RWL ? 0:00.00 (nfsd)
    0 171 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 172 0 0 2 0 52 0 nfsd RWL ? 0:00.00 (nfsd)
    0 376 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 399 0 0 2 0 184 0 - RWs ? 0:00.00 (syslogd)
    0 442 0 0 2 0 340 0 poll RWs ? 0:00.00 (rpcbind)
    0 444 0 0 18 0 7924 0 sigwait RWsa ? 0:00.00 (named)
    0 447 0 0 2 0 172 0 poll RWs ? 0:00.00 (nfsd)
    0 473 0 0 10 0 220 0 mfsidl RWs ? 0:00.00
    (mount_mfs
    0 474 0 0 2 0 608 0 select RWs ?
    0:00.00 /usr/sbin/
    0 510 0 0 2 0 632 0 select RWs ?
    0:00.00 /usr/sbin/
    0 529 0 0 -22 0 0 0 actwat RWK ? 0:00.00
    [acctwatch
    0 560 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 563 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 565 0 0 10 0 0 0 nfsidl RWK ? 0:00.00 [nfsio]
    0 594 0 0 2 0 440 0 select RWs ? 0:00.00 (sshd)
    0 598 0 0 2 0 52 0 nfsd RWL ? 0:00.00 (nfsd)
    0 613 0 0 2 0 508 0 select RW ? 0:00.00 (afpd)
    0 635 0 0 -20 0 52 0 temp RWL ? 0:00.00 (nfsd)
    0 656 0 0 2 0 52 0 nfsd RWL ? 0:00.00 (nfsd)
    0 659 0 0 2 0 216 0 select RWs ? 0:00.00
    (rpc.statd
    0 661 0 0 2 0 224 0 select RWs ? 0:00.00
    (rpc.lockd
    0 664 0 1207 2 0 304 0 select RWs ? 20:07.00 (sshd)
    0 689 0 0 2 0 52 0 nfsd RWL ? 0:00.00 (nfsd)
    0 695 0 0 18 0 1052 0 pause RWs ? 0:00.00 (ntpd)
    0 860 0 61381 2 0 72 0 select RWs ? 1023:01.06
    (mntauthd)
    0 919 0 49642 18 0 3120 0 pause RW ? 827:22.05 (smbd)
    0 957 0 1935 2 0 3132 0 select RWs ? 32:15.00 (smbd)
    0 980 0 0 2 0 1064 0 select RWs ? 0:00.00 (nmbd)
    0 1051 0 0 10 0 232 0 nanoslee RWs ? 0:00.00 (cron)
    0 1076 0 0 2 0 208 0 kqread RWs ? 0:00.00 (inetd)
    0 7752 0 0 2 0 4684 0 select RW ? 0:00.00 (smbd)
    0 8479 0 0 2 0 440 0 select RWs ? 0:00.00 (sshd)
    0 15340 0 0 2 0 440 0 select RWs ? 0:00.00 (sshd)
    0 22082 0 0 10 0 148 0 ppwait RWs ? 0:00.00 (sh)
    0 23243 0 0 -20 0 148 0 temp RWV ? 0:00.00 (sh)
    0 25253 0 0 2 0 248 0 piperd RW ? 0:00.00 (cron)
    0 28157 0 0 2 0 440 0 select RWs ? 0:00.00 (sshd)
    0 1109 0 0 3 0 1240 0 ttyin RWs+ ttyp0 0:00.00 (tcsh)
    0 1133 0 0 10 0 156 0 ppwait RW ttyp0 0:00.00 (sh)
    0 23901 0 0 -20 0 156 0 temp RWV ttyp0 0:00.00 (sh)
    1111 15595 0 1972 18 0 1172 0 pause RWs ttyp1 32:52.00 (tcsh)
    0 16441 0 0 3 0 1172 0 ttyin RW+ ttyp1 0:00.00 (tcsh)
    0 29002 0 0 10 0 64 0 - TW ttyp2 0:00.00 (man)
    0 29044 0 0 3 0 1348 0 ttyin RWs+ ttyp2 0:00.00 (tcsh)
    0 29153 0 0 10 0 152 0 - TW ttyp2 0:00.00 (sh)
    0 29240 0 0 28 0 180 0 - TW ttyp2 0:00.00 (more)
    1002 12726 0 0 3 0 1336 0 ttyin RWs+ ttyp3 0:00.00 (tcsh)
    0 1007 0 50966 3 0 52 0 ttyin RWs+ ttyE0 849:26.05 (getty)

    cheers
    mark

Re: server locking up


max 4000 letters.
Your nickname that display:
In order to stop the spam: 8 + 7 =
QUESTION ON "BSD"

EMSDN.COM