Suggestions for ways to avoid in the future:

  • restart zincati periodically

    • allows the process to get out of any stuck state it may be in
      • I think there have been at least two issues where this would have helped
    • Should have almost no risk / no cost
  • Switch Zincati to a periodic systemd timer

    • Instead of having a permanently running background daemon, use a systemd timer to trigger zincati checks at a regular interval
    • DWM: one problem with this approach may be the periodic timer stuff for finalizing and rebooting the update.
      • TR: The timer would still be triggered every 5 minutes by default which should cover this case
    • JL: i think this would require a rework of zincati, and we don't have much zincati expertise currently. also, need to sanity-check it meshes well with other update strategies like fleet_lock. so overall, definite risks of regressions in trying to do this
  • Prepare Zincati for the container-first workflow

    • Something we need to do anyway
    • JL: that doesn't necessarily address that specific issue. the leaking happened in sd_notify, which presumably zincati would still do.
  • bake zincati in next first

  • add monitoring to our persistent systems our team uses

    • build nodes
    • archive-repo-manager
Select a repo