Sunday, July 19, 2015

Failed CoreOS Services

I'm not going to describe how fleetd and systemd work together. That's better researched on the CoreOS site. But I am going to describe a condition that I often find myself in.  My rig is like this.
  • I have a Chromebook running "terminal"
  • my remote CoreOS servers are typically at Google Compute Engine
  • and I use a blend of tools installed in a dedicated docker container
However, from time to time; when I log into my CoreOS instance I get this crazy error message about some "failed unit". After a quick investigation:
  • systemctl status
  • systemctl list-units
  • systemctl --failed
I determined that my ssh session had terminated and left some breadcrumbs behind. I'm exactly sure why the session died but it is common. There must be some idle timer on the chrome-terminal application that I have not configured properly because when I have a similar session open on my macbook it remains open longer.

On one occasion I noticed that I had multiple failed units and I was never able to determine what happened. After some additional systemd research I was able to find away to remove the symptoms.
  • systemctl stop sshd@60338-x.y.z.a:22-a.b.c.d:54566.service
  • systemctl unload sshd@60338-x.y.z.a:22-a.b.c.d:54566.service
  • systemctl reset-failed sshd@60338-x.y.z.a:22-a.b.c.d:54566.service
I'm fairly certain it was the last one that actually cleared the dead unit. The other two did not seem to do anything meaningful.  Also, I needed to prefix the commands with the 'sudo' command.

systemctl --failed | grep "service loaded" | sed -e 's/^.*\(sshd.*service\).*$/\1/' | xargs -n 1 -I {} sudo systemctl stop {}

systemctl --failed | grep "service loaded" | sed -e 's/^.*\(sshd.*service\).*$/\1/' | xargs -n 1 -I {} sudo systemctl reset-failed {}

UPDATE:  In the previous UPDATE I wrote two one-liners.  They both work, and that's all you need. However, while testing the results I executed the two scripts from the clipboard; then I logged off and reconnected very quickly.  I noticed that that CoreOS' motd still displayed the 10 failed units I had previously cleaned. I tried the root command 'systemctl --failed' and it returned '0 failed'.  I then logged out and back in (second time) and the failed units were no longer displayed in the motd. I must have been too fast on the keyboard when I sent the initial cleanup one-liners.

No comments:

Post a Comment

another bad day for open source

One of the hallmarks of a good open source project is just how complicated it is to install, configure and maintain. Happily gitlab and the ...