INFO: task docker blocked for more than 120 seconds.
Issue
-
Docker daemon is stucked in one of the openshift nodes, so the Openshift masters see the node as "not ready" and deploys are failing.
-
There are a few messages in dmesg speaking about this getting stucked:
[ 4082.854242] INFO: task docker:111571 blocked for more than 120 seconds.
[ 4082.855441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4082.856124] docker D 0000000000000000 0 111571 1 0x00000080
[ 4082.856127] ffff881c01527ab0 0000000000000086 ffff881c332f5080 ffff881c01527fd8
[ 4082.856130] ffff881c01527fd8 ffff881c01527fd8 ffff881c332f5080 ffff881c01527bf0
[ 4082.856132] ffff881c01527bf8 7fffffffffffffff ffff881c332f5080 0000000000000000
[ 4082.856135] Call Trace:
[ 4082.856142] [<ffffffff8163a909>] schedule+0x29/0x70
[ 4082.856144] [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 4082.856149] [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 4082.856151] [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 4082.856156] [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 4082.856159] [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 4082.856166] [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 4082.856171] [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 4082.856173] [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 4082.856191] [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 4082.856197] [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 4082.856202] [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 4082.856207] [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 4082.856211] [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 4082.856215] [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 4082.856220] [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 4082.856224] [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 4082.856227] [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 4082.856229] [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 4082.856232] [<ffffffff816408d9>] ? do_async_page_fault+0x29/0xe0
[ 4082.856235] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
Following this guide (https://access.redhat.com/solutions/31453) I've tried to reproduced by stopping the docker service (after being unschedule the node) with "systemctl stop docker" and the prompt was stucked but using other ssh connection I was able to collect the required files in that guide. Also journal logs for docker service:
dic 01 16:40:16 hostname.example.com systemd[1]: Stopping Docker Application Container Engine...
dic 01 16:40:16 hostname.example.com docker[38182]: time="2015-12-01T16:40:16.387328403+01:00" level=info msg="Processing signal 'terminated'"
dic 01 16:41:46 hostname.example.com systemd[1]: docker.service stop-final-sigterm timed out. Killing.
dic 01 16:43:16 hostname.example.com systemd[1]: docker.service still around after final SIGKILL. Entering failed mode.
dic 01 16:43:16 hostname.example.com systemd[1]: Stopped Docker Application Container Engine.
dic 01 16:43:16 hostname.example.com systemd[1]: Unit docker.service entered failed state.
dic 01 16:43:16 hostname.example.com systemd[1]: docker.service failed.
Environment
- Openshift 3.1
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.