# Docker loading test note ###### tags: `docker` 1024 containers in one single docker network ref: https://success.mirantis.com/article/maximum-containers-per-engine https://stackoverflow.com/questions/21799382/is-there-a-maximum-number-of-containers-running-on-a-docker-host https://blog.widodh.nl/2015/10/maximum-amount-of-docker-containers-on-a-single-host/ **** HPC: 100.74.53.2 Distributor ID: Ubuntu Description: Ubuntu 18.04.5 LTS Release: 18.04 Codename: bionic Try case 1. `docker network create -d macvlan --subnet=100.74.53.0/24 --gateway=100.74.53.1 mlsteam-net` Each containers will got different private ip. Master node can not match server and repository can not connected. 2. `docker network create -d macvlan --subnet=172.10.0.0/16 --gateway=172.10.0.1 mlsteam-net` Master node match server ip, repository can connected. But all containers can not ping 8.8.8.8 If we want to use swarm mode with macvlan, maybe we can reference this page. https://collabnix.com/docker-17-06-swarm-mode-now-with-macvlan-support/ (not try yet) > 2020/11/10 Update ## Docker loading test ### Case 1 1. Create and run at once 2. More than 2000 containers ```python= import subprocess import time num = 2000 for i in range(num): cmd = "docker create ...." subprocess.Popen(cmd, shell=True) cmd = "docker start ...." subprocess.Popen(cmd, shell=True) ``` :::danger Error message 1. `docker container state OCI runtime create failed: container with id exists:unknown` 2. `Error response from daemon: failed to create endpoint u9999 on network mlsteam-net: adding interface vethfb3035c to bridge br-16b2b4e4cc93 failed: exchange full` 3. `Oct 27 08:51:50 hpc1 python3[17041]: runtime/cgo: pthread_create failed: Resource temporarily unavailable Oct 27 08:51:50 hpc1 python3[17041]: SIGABRT: abort` 4. `Error response from daemon: ttrpc: closed: unknown Error: failed to start containers: n267` ::: ### Case 2 1. Pre-create 2000 containers at once 2. After 60 secs, start all containers at once ```python= import subprocess import time num = 2000 for i in range(num): cmd = "docker create ...." subprocess.Popen(cmd, shell=True) time.sleep(60) for i in range(num): cmd = "docker start ...." subprocess.Popen(cmd, shell=True) ``` :::danger Error message 1. ~~`docker container state OCI runtime create failed: container with id exists:unknown`~~ 2. `Error response from daemon: failed to create endpoint u9999 on network mlsteam-net: adding interface vethfb3035c to bridge br-16b2b4e4cc93 failed: exchange full` 3. `Oct 27 08:51:50 hpc1 python3[17041]: runtime/cgo: pthread_create failed: Resource temporarily unavailable Oct 27 08:51:50 hpc1 python3[17041]: SIGABRT: abort` 4. ~~`Error response from daemon: ttrpc: closed: unknown Error: failed to start containers: n267`~~ ::: ### Case 3 1. Pre-create 1000 containers 2. After 60 secs, start all containers at once ```python= import subprocess import time num = 1000 for i in range(num): cmd = "docker create ...." subprocess.Popen(cmd, shell=True) time.sleep(60) for i in range(num): cmd = "docker start ...." subprocess.Popen(cmd, shell=True) ``` :::danger Error message 1. ~~`docker container state OCI runtime create failed: container with id exists:unknown`~~ 2. ~~`Error response from daemon: failed to create endpoint u9999 on network mlsteam-net: adding interface vethfb3035c to bridge br-16b2b4e4cc93 failed: exchange full`~~ 3. `Oct 27 08:51:50 hpc1 python3[17041]: runtime/cgo: pthread_create failed: Resource temporarily unavailable Oct 27 08:51:50 hpc1 python3[17041]: SIGABRT: abort` 4. ~~`Error response from daemon: ttrpc: closed: unknown Error: failed to start containers: n267`~~ ::: ### Case 4 1. Pre-create 1000 containers at once 2. run about 300 containers each batch ```python= import subprocess import time num = 1000 internal = 3 for i in range(num): cmd = "docker create ...." subprocess.Popen(cmd, shell=True) time.sleep(60) # 0-334 # 333-667 # 666-1000 for i in range(internal): for i in range(i*num, (i+1)*num+1): cmd = "docker start ...." subprocess.Popen(cmd, shell=True) ``` :::success Success !! total 1000 containers ::: ### Case 5 1. Pre-create 1000 containers at once 2. run about 300 containers each batch 3. connect different docker-engine (default, mlsteam-net) ```python= import subprocess import time num = 1000 internal = 3 for i in range(num): cmd = "docker create ...." subprocess.Popen(cmd, shell=True) cmd = "docker create network=mlsteam-net" subprocess.Popen(cmd, shell=True) # default time.sleep(60) # 0-334 # 333-667 # 666-1000 for i in range(internal): for i in range(i*num, (i+1)*num+1): cmd = "docker start ...." subprocess.Popen(cmd, shell=True) # mlsteam-net time.sleep(60) # 0-334 # 333-667 # 666-1000 for i in range(internal): for i in range(i*num, (i+1)*num+1): cmd = "docker start ...." subprocess.Popen(cmd, shell=True) ``` :::success All Success !! total 2000 containers ::: ## MLSteam loading test 2020/11/02 ``` Percentage of the requests completed within given times Type Name # reqs 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% ------------------------------------------------------------------------------------------------------------------------------------------------------ POST /api/auth/login 466 410 440 560 640 880 1100 1200 1300 1400 1400 1400 POST /api/labs 463 170 220 370 440 680 870 1200 1400 1800 1800 1800 ------------------------------------------------------------------------------------------------------------------------------------------------------ None Aggregated 929 380 420 460 560 800 1000 1200 1300 1800 1800 1800 ``` ### Case 1 :::warning After 700 containers python catch: `error waiting for container: context canceled"\nError response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \\"rootfs_linux.go:109: jailing process inside rootfs caused \\\\\\"pivot_root invalid argument\\\\\\"\\"": unknown\n` ::: :::warning After 800 containers ``` 2020-11-02 06:41:07 [DEBUG] b'time="2020-11-02T06:40:58Z" level=error msg="error waiting for container: context canceled"\nError response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \\"rootfs_linux.go:109: jailing process inside rootfs caused \\\\\\"pivot_root invalid argument\\\\\\"\\"": unknown\n' 2020-11-02 06:43:08 [ERROR] [Lab.wait_alive] : lab ue0d23ac response timeout=15. 2020-11-02 06:43:08 [DEBUG] b'time="2020-11-02T06:42:59Z" level=error msg="error waiting for container: context canceled"\nError response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \\"process_linux.go:432: running prestart hook 1 caused \\\\\\"error running hook: exit status 2, stdout: , stderr: runtime: failed to create new OS thread (have 2 already; errno=11)\\\\\\\\nruntime: may need to increase max user processes (ulimit -u)\\\\\\\\nfatal error: newosproc\\\\\\\\n\\\\\\\\nruntime stack:\\\\\\\\nruntime.throw(0x53e895, 0x9)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/panic.go:1116 +0x72\\\\\\\\nruntime.newosproc(0xc00002e000)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/os_linux.go:161 +0x1ba\\\\\\\\nruntime.newm1(0xc00002e000)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:1753 +0xdc\\\\\\\\nruntime.newm(0x548970, 0x0)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:1732 +0x8f\\\\\\\\nruntime.main.func1()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:134 +0x36\\\\\\\\nruntime.systemstack(0x45bd74)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/asm_amd64.s:370 +0x66\\\\\\\\nruntime.mstart()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:1041\\\\\\\\n\\\\\\\\ngoroutine 1 [running]:\\\\\\\\nruntime.systemstack_switch()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc00002a788 sp=0xc00002a780 pc=0x45be70\\\\\\\\nruntime.main()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:133 +0x70 fp=0xc00002a7e0 sp=0xc00002a788 pc=0x4332d0\\\\\\\\nruntime.goexit()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00002a7e8 sp=0xc00002a7e0 pc=0x45de01\\\\\\\\n\\\\\\"\\"": unknown\n' 2020-11-02 06:43:08 [DEBUG] [lab.run] : stop lab ue0d23ac, exception, wait alive failed ``` ::: :::warning It shows lot of error message after 900 containers. ::: :::danger docker : Error response from daemon: removal of container ### is already in progress ![](https://i.imgur.com/KNko1s6.png) ::: :::danger After 840 containers ``` File "/usr/lib/python3/dist-packages/requests/models.py", line 935, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.40/containers/68d86854380b0b4111cd5dd5467499d3217a577ecf753272237b3d660eb5601f?v=False&link=False&force=False During handling of the above exception, another exception occurred: Traceback (most recent call last): File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run File "/build/mlsteam_agent/mlsteam_agent/lab.py", line 451, in run self.remove() File "/build/mlsteam_agent/mlsteam_agent/lab.py", line 133, in remove raise e File "/build/mlsteam_agent/mlsteam_agent/lab.py", line 130, in remove docker.from_env().containers.get(self.uuid).remove() File "/usr/local/lib/python3.6/dist-packages/docker/models/containers.py", line 351, in remove return self.client.api.remove_container(self.id, **kwargs) File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 19, in wrapped return f(self, resource_id, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/docker/api/container.py", line 1009, in remove_container self._raise_for_status(res) File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 261, in _raise_for_status raise create_api_error_from_http_exception(e) File "/usr/local/lib/python3.6/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception raise cls(e, response=response, explanation=explanation) docker.errors.APIError: 409 Client Error: Conflict ("You cannot remove a running container 68d86854380b0b4111cd5dd5467499d3217a577ecf753272237b3d660eb5601f. Stop the container before attempting removal or force remove") 2020-11-02T06:47:28Z <Greenlet at 0x7fb0334fdb48: <bound method Lab.run of <mlsteam_agent.lab.Lab object at 0x7fb0334d5320>>> failed with APIError ``` ::: :::danger ``` After 900 containers runtime: failed to create new OS thread (have 24 already; errno=11) runtime: may need to increase max user processes (ulimit -u) fatal error: newosproc runtime stack: runtime.throw(0x1af5fbe, 0x9) /usr/local/go/src/runtime/panic.go:616 +0x81 runtime.newosproc(0xc4204eac00, 0xc420526000) /usr/local/go/src/runtime/os_linux.go:164 +0x1af runtime.newm1(0xc4204eac00) /usr/local/go/src/runtime/proc.go:1879 +0x113 runtime.newm(0x1bc48b0, 0x0) /usr/local/go/src/runtime/proc.go:1858 +0x9b runtime.startTheWorldWithSema(0x1, 0x4579f3) /usr/local/go/src/runtime/proc.go:1155 +0x1d0 runtime.gcMarkTermination.func3() /usr/local/go/src/runtime/mgc.go:1647 +0x26 runtime.systemstack(0x0) /usr/local/go/src/runtime/asm_amd64.s:409 +0x79 runtime.mstart() /usr/local/go/src/runtime/proc.go:1175 goroutine 36 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:363 fp=0xc4200bb550 sp=0xc4200bb548 pc=0x453e00 runtime.gcMarkTermination(0x3fed900e52bfd36e) /usr/local/go/src/runtime/mgc.go:1647 +0x407 fp=0xc4200bb720 sp=0xc4200bb550 pc=0x4187d7 runtime.gcMarkDone() /usr/local/go/src/runtime/mgc.go:1513 +0x22c fp=0xc4200bb748 sp=0xc4200bb720 pc=0x41836c runtime.gcBgMarkWorker(0xc420053400) /usr/local/go/src/runtime/mgc.go:1912 +0x2e7 fp=0xc4200bb7d8 sp=0xc4200bb748 pc=0x4192e7 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200bb7e0 sp=0xc4200bb7d8 pc=0x456811 created by runtime.gcBgMarkStartWorkers /usr/local/go/src/runtime/mgc.go:1723 +0x79 goroutine 1 [runnable]: net/http.(*persistConn).roundTrip(0xc4206da000, 0xc420686e70, 0x0, 0x0, 0x0) /usr/local/go/src/net/http/transport.go:2033 +0x5a7 net/http.(*Transport).RoundTrip(0xc4201260f0, 0xc420688500, 0xc4201260f0, 0x0, 0x0) /usr/local/go/src/net/http/transport.go:422 +0x8f2 net/http.send(0xc420688500, 0x1c91720, 0xc4201260f0, 0x0, 0x0, 0x0, 0xc42000c870, 0xc42024d710, 0xc4206a36c8, 0x1) /usr/local/go/src/net/http/client.go:252 +0x185 net/http.(*Client).send(0xc420686d50, 0xc420688500, 0x0, 0x0, 0x0, 0xc42000c870, 0x0, 0x1, 0x417e1a) /usr/local/go/src/net/http/client.go:176 +0xfa net/http.(*Client).Do(0xc420686d50, 0xc420688500, 0xc42003c0c0, 0xc420688500, 0xc420023c20) /usr/local/go/src/net/http/client.go:615 +0x28d github.com/docker/cli/vendor/golang.org/x/net/context/ctxhttp.Do(0x1cc40c0, 0xc42003c0c0, 0xc420686d50, 0xc420688400, 0x0, 0x27c1be0, 0x2) /go/src/github.com/docker/cli/vendor/golang.org/x/net/context/ctxhttp/ctxhttp.go:30 +0x6e github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).doRequest(0xc42023de80, 0x1cc40c0, 0xc42003c0c0, 0xc420688400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/cli/vendor/github.com/docker/docker/client/request.go:132 +0xbe github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).Ping(0xc42023de80, 0x1cc40c0, 0xc42003c0c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/cli/vendor/github.com/docker/docker/client/ping.go:17 +0x17c github.com/docker/cli/cli/command.(*DockerCli).initializeFromClient(0xc420411b00) /go/src/github.com/docker/cli/cli/command/cli.go:213 +0x8b github.com/docker/cli/cli/command.(*DockerCli).Initialize(0xc420411b00, 0xc420481400, 0xc420686930, 0x3) /go/src/github.com/docker/cli/cli/command/cli.go:197 +0x192 main.newDockerCommand.func2(0xc420672500, 0xc420686930, 0x3, 0x3, 0x0, 0x0) /go/src/github.com/docker/cli/cmd/docker/docker.go:43 +0x71 github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).execute(0xc420672500, 0xc420038160, 0x3, 0x3, 0xc420672500, 0xc420038160) /go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:741 +0x579 github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc420416c80, 0xc420583fb0, 0x17fc840, 0xc420583fc0) /go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:852 +0x30a github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).Execute(0xc420416c80, 0xc420416c80, 0x1c91a00) /go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:800 +0x2b main.main() /go/src/github.com/docker/cli/cmd/docker/docker.go:180 +0xdc goroutine 5 [syscall]: os/signal.signal_recv(0x0) /usr/local/go/src/runtime/sigqueue.go:139 +0xa6 os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:22 +0x22 created by os/signal.init.0 /usr/local/go/src/os/signal/signal_unix.go:28 +0x41 goroutine 52 [chan receive]: github.com/docker/cli/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x27c0f20) /go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:882 +0x8b created by github.com/docker/cli/vendor/github.com/golang/glog.init.0 /go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:410 +0x203 goroutine 114 [select]: net/http.(*persistConn).readLoop(0xc4206da000) /usr/local/go/src/net/http/transport.go:1717 +0x743 created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1237 +0x95a goroutine 115 [runnable]: internal/poll.(*FD).writeUnlock(0xc4206e4080) /usr/local/go/src/internal/poll/fd_mutex.go:246 +0x5a internal/poll.(*FD).Write(0xc4206e4080, 0xc4206f7000, 0x50, 0x1000, 0x50, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:261 +0x2de net.(*netFD).Write(0xc4206e4080, 0xc4206f7000, 0x50, 0x1000, 0x6, 0x6, 0xc4204a2f88) /usr/local/go/src/net/fd_unix.go:220 +0x4f net.(*conn).Write(0xc420400018, 0xc4206f7000, 0x50, 0x1000, 0x0, 0x0, 0x0) /usr/local/go/src/net/net.go:188 +0x6a net/http.persistConnWriter.Write(0xc4206da000, 0xc4206f7000, 0x50, 0x1000, 0x6, 0x1aecc32, 0x3) /usr/local/go/src/net/http/transport.go:1253 +0x52 bufio.(*Writer).Flush(0xc420356040, 0x1c8f0e0, 0xc420356040) /usr/local/go/src/bufio/bufio.go:573 +0x7e net/http.(*persistConn).writeLoop(0xc4206da000) /usr/local/go/src/net/http/transport.go:1838 +0x38d created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1238 +0x97f 2020-11-02 06:48:41 [INFO ] => (Server) : {'title': 'task_heartbeat_update', 'data': {'uuid': 'uca0661a'}, 'host': 'hpc1'} Error response from daemon: Container 1b94f2b8ce91f6276c62814e8e ``` ::: :::info After delete all projects Almost 200 containers fails and dead ::: ### Case 1 ```python=414 # mlsteam_agent/lab.py if not self.wait_alive(): raise Exception('wait alive failed') self.post_run() self.set_status(RunState.RUN) ### add while True: time.sleep(1) return ### while self.docker_process.poll() is None: self.heartbeat() if self.should_stop.is_set(): self.set_status(RunState.SAVING) self.close(timeout=STOP_TIMEOUT, err=False) break if len(self.commit_que): self.set_status(RunState.SAVING) self.commit() self.set_status(RunState.RUN) time.sleep(1) ``` :::warning It did not happened docker core dump, but agent will stuck after 500 containers and disconnect sockerio. ::: ### Case 2 ```python=414 # mlsteam_agent/lab.py if not self.wait_alive(): raise Exception('wait alive failed') self.post_run() self.set_status(RunState.RUN) while self.docker_process.poll() is None: self.heartbeat() if self.should_stop.is_set(): self.set_status(RunState.SAVING) self.close(timeout=STOP_TIMEOUT, err=False) break ### Remove ''' if len(self.commit_que): self.set_status(RunState.SAVING) self.commit() self.set_status(RunState.RUN) ''' ### time.sleep(1) ``` :::warning It shows lot of error message after 900 containers. ::: :::danger docker : Error response from daemon: removal of container ### is already in progress ![](https://i.imgur.com/KNko1s6.png) ::: ### Case 3 ```python=156 # mlsteam_agent/lab.py with self.commit_que_lock: while True: try: image, tag = self.commit_que.pop() logger.debug("[Lab.commit] : lab {} start commit {}:{}".format(self.uuid, image, tag)) self.sync() if image_obj is None: self.send_container_size() cli.containers.get(self.uuid).commit(image, tag) image_obj = cli.images.get(image + ':' + tag) else: image_obj.tag(image + ':' + tag) logger.debug("[Lab.commit] : lab {} image saved: {}:{}".format( self.uuid, image, tag)) if self.repository: self.set_status(RunState.PUSH) logger.debug("[lab.push] : lab {} pushing...".format(self.uuid)) from mlsteam_agent.imagepusher import ImagePusher username = self.repository.get('username') password = self.repository.get('password') server = self.repository.get('server') uuid = "{}{}".format("p", self.uuid) auth = {'username': username, 'password': password} data = { 'uuid': uuid, 'server': server, 'image': image, 'tag': tag, 'auth': auth } ### Remove ''' imgpush = ImagePusher(agent=self.agent, **data) self.agent.runs[imgpush.uuid] = imgpush if not imgpush.run(): raise Exception('push failed') logger.debug("[lab.push] : lab {} pushed".format(self.uuid)) ''' ### ``` :::warning After 700 containers python catch: `error waiting for container: context canceled"\nError response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \\"rootfs_linux.go:109: jailing process inside rootfs caused \\\\\\"pivot_root invalid argument\\\\\\"\\"": unknown\n` ::: :::warning After 800 containers ``` 2020-11-02 06:41:07 [DEBUG] b'time="2020-11-02T06:40:58Z" level=error msg="error waiting for container: context canceled"\nError response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \\"rootfs_linux.go:109: jailing process inside rootfs caused \\\\\\"pivot_root invalid argument\\\\\\"\\"": unknown\n' 2020-11-02 06:43:08 [ERROR] [Lab.wait_alive] : lab ue0d23ac response timeout=15. 2020-11-02 06:43:08 [DEBUG] b'time="2020-11-02T06:42:59Z" level=error msg="error waiting for container: context canceled"\nError response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \\"process_linux.go:432: running prestart hook 1 caused \\\\\\"error running hook: exit status 2, stdout: , stderr: runtime: failed to create new OS thread (have 2 already; errno=11)\\\\\\\\nruntime: may need to increase max user processes (ulimit -u)\\\\\\\\nfatal error: newosproc\\\\\\\\n\\\\\\\\nruntime stack:\\\\\\\\nruntime.throw(0x53e895, 0x9)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/panic.go:1116 +0x72\\\\\\\\nruntime.newosproc(0xc00002e000)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/os_linux.go:161 +0x1ba\\\\\\\\nruntime.newm1(0xc00002e000)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:1753 +0xdc\\\\\\\\nruntime.newm(0x548970, 0x0)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:1732 +0x8f\\\\\\\\nruntime.main.func1()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:134 +0x36\\\\\\\\nruntime.systemstack(0x45bd74)\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/asm_amd64.s:370 +0x66\\\\\\\\nruntime.mstart()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:1041\\\\\\\\n\\\\\\\\ngoroutine 1 [running]:\\\\\\\\nruntime.systemstack_switch()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc00002a788 sp=0xc00002a780 pc=0x45be70\\\\\\\\nruntime.main()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/proc.go:133 +0x70 fp=0xc00002a7e0 sp=0xc00002a788 pc=0x4332d0\\\\\\\\nruntime.goexit()\\\\\\\\n\\\\\\\\t/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00002a7e8 sp=0xc00002a7e0 pc=0x45de01\\\\\\\\n\\\\\\"\\"": unknown\n' 2020-11-02 06:43:08 [DEBUG] [lab.run] : stop lab ue0d23ac, exception, wait alive failed ``` ::: :::danger After 840 containers ``` File "/usr/lib/python3/dist-packages/requests/models.py", line 935, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.40/containers/68d86854380b0b4111cd5dd5467499d3217a577ecf753272237b3d660eb5601f?v=False&link=False&force=False During handling of the above exception, another exception occurred: Traceback (most recent call last): File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run File "/build/mlsteam_agent/mlsteam_agent/lab.py", line 451, in run self.remove() File "/build/mlsteam_agent/mlsteam_agent/lab.py", line 133, in remove raise e File "/build/mlsteam_agent/mlsteam_agent/lab.py", line 130, in remove docker.from_env().containers.get(self.uuid).remove() File "/usr/local/lib/python3.6/dist-packages/docker/models/containers.py", line 351, in remove return self.client.api.remove_container(self.id, **kwargs) File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 19, in wrapped return f(self, resource_id, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/docker/api/container.py", line 1009, in remove_container self._raise_for_status(res) File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 261, in _raise_for_status raise create_api_error_from_http_exception(e) File "/usr/local/lib/python3.6/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception raise cls(e, response=response, explanation=explanation) docker.errors.APIError: 409 Client Error: Conflict ("You cannot remove a running container 68d86854380b0b4111cd5dd5467499d3217a577ecf753272237b3d660eb5601f. Stop the container before attempting removal or force remove") 2020-11-02T06:47:28Z <Greenlet at 0x7fb0334fdb48: <bound method Lab.run of <mlsteam_agent.lab.Lab object at 0x7fb0334d5320>>> failed with APIError ``` ::: :::danger ``` After 900 containers runtime: failed to create new OS thread (have 24 already; errno=11) runtime: may need to increase max user processes (ulimit -u) fatal error: newosproc runtime stack: runtime.throw(0x1af5fbe, 0x9) /usr/local/go/src/runtime/panic.go:616 +0x81 runtime.newosproc(0xc4204eac00, 0xc420526000) /usr/local/go/src/runtime/os_linux.go:164 +0x1af runtime.newm1(0xc4204eac00) /usr/local/go/src/runtime/proc.go:1879 +0x113 runtime.newm(0x1bc48b0, 0x0) /usr/local/go/src/runtime/proc.go:1858 +0x9b runtime.startTheWorldWithSema(0x1, 0x4579f3) /usr/local/go/src/runtime/proc.go:1155 +0x1d0 runtime.gcMarkTermination.func3() /usr/local/go/src/runtime/mgc.go:1647 +0x26 runtime.systemstack(0x0) /usr/local/go/src/runtime/asm_amd64.s:409 +0x79 runtime.mstart() /usr/local/go/src/runtime/proc.go:1175 goroutine 36 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:363 fp=0xc4200bb550 sp=0xc4200bb548 pc=0x453e00 runtime.gcMarkTermination(0x3fed900e52bfd36e) /usr/local/go/src/runtime/mgc.go:1647 +0x407 fp=0xc4200bb720 sp=0xc4200bb550 pc=0x4187d7 runtime.gcMarkDone() /usr/local/go/src/runtime/mgc.go:1513 +0x22c fp=0xc4200bb748 sp=0xc4200bb720 pc=0x41836c runtime.gcBgMarkWorker(0xc420053400) /usr/local/go/src/runtime/mgc.go:1912 +0x2e7 fp=0xc4200bb7d8 sp=0xc4200bb748 pc=0x4192e7 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200bb7e0 sp=0xc4200bb7d8 pc=0x456811 created by runtime.gcBgMarkStartWorkers /usr/local/go/src/runtime/mgc.go:1723 +0x79 goroutine 1 [runnable]: net/http.(*persistConn).roundTrip(0xc4206da000, 0xc420686e70, 0x0, 0x0, 0x0) /usr/local/go/src/net/http/transport.go:2033 +0x5a7 net/http.(*Transport).RoundTrip(0xc4201260f0, 0xc420688500, 0xc4201260f0, 0x0, 0x0) /usr/local/go/src/net/http/transport.go:422 +0x8f2 net/http.send(0xc420688500, 0x1c91720, 0xc4201260f0, 0x0, 0x0, 0x0, 0xc42000c870, 0xc42024d710, 0xc4206a36c8, 0x1) /usr/local/go/src/net/http/client.go:252 +0x185 net/http.(*Client).send(0xc420686d50, 0xc420688500, 0x0, 0x0, 0x0, 0xc42000c870, 0x0, 0x1, 0x417e1a) /usr/local/go/src/net/http/client.go:176 +0xfa net/http.(*Client).Do(0xc420686d50, 0xc420688500, 0xc42003c0c0, 0xc420688500, 0xc420023c20) /usr/local/go/src/net/http/client.go:615 +0x28d github.com/docker/cli/vendor/golang.org/x/net/context/ctxhttp.Do(0x1cc40c0, 0xc42003c0c0, 0xc420686d50, 0xc420688400, 0x0, 0x27c1be0, 0x2) /go/src/github.com/docker/cli/vendor/golang.org/x/net/context/ctxhttp/ctxhttp.go:30 +0x6e github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).doRequest(0xc42023de80, 0x1cc40c0, 0xc42003c0c0, 0xc420688400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/cli/vendor/github.com/docker/docker/client/request.go:132 +0xbe github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).Ping(0xc42023de80, 0x1cc40c0, 0xc42003c0c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/cli/vendor/github.com/docker/docker/client/ping.go:17 +0x17c github.com/docker/cli/cli/command.(*DockerCli).initializeFromClient(0xc420411b00) /go/src/github.com/docker/cli/cli/command/cli.go:213 +0x8b github.com/docker/cli/cli/command.(*DockerCli).Initialize(0xc420411b00, 0xc420481400, 0xc420686930, 0x3) /go/src/github.com/docker/cli/cli/command/cli.go:197 +0x192 main.newDockerCommand.func2(0xc420672500, 0xc420686930, 0x3, 0x3, 0x0, 0x0) /go/src/github.com/docker/cli/cmd/docker/docker.go:43 +0x71 github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).execute(0xc420672500, 0xc420038160, 0x3, 0x3, 0xc420672500, 0xc420038160) /go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:741 +0x579 github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc420416c80, 0xc420583fb0, 0x17fc840, 0xc420583fc0) /go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:852 +0x30a github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).Execute(0xc420416c80, 0xc420416c80, 0x1c91a00) /go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:800 +0x2b main.main() /go/src/github.com/docker/cli/cmd/docker/docker.go:180 +0xdc goroutine 5 [syscall]: os/signal.signal_recv(0x0) /usr/local/go/src/runtime/sigqueue.go:139 +0xa6 os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:22 +0x22 created by os/signal.init.0 /usr/local/go/src/os/signal/signal_unix.go:28 +0x41 goroutine 52 [chan receive]: github.com/docker/cli/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x27c0f20) /go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:882 +0x8b created by github.com/docker/cli/vendor/github.com/golang/glog.init.0 /go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:410 +0x203 goroutine 114 [select]: net/http.(*persistConn).readLoop(0xc4206da000) /usr/local/go/src/net/http/transport.go:1717 +0x743 created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1237 +0x95a goroutine 115 [runnable]: internal/poll.(*FD).writeUnlock(0xc4206e4080) /usr/local/go/src/internal/poll/fd_mutex.go:246 +0x5a internal/poll.(*FD).Write(0xc4206e4080, 0xc4206f7000, 0x50, 0x1000, 0x50, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:261 +0x2de net.(*netFD).Write(0xc4206e4080, 0xc4206f7000, 0x50, 0x1000, 0x6, 0x6, 0xc4204a2f88) /usr/local/go/src/net/fd_unix.go:220 +0x4f net.(*conn).Write(0xc420400018, 0xc4206f7000, 0x50, 0x1000, 0x0, 0x0, 0x0) /usr/local/go/src/net/net.go:188 +0x6a net/http.persistConnWriter.Write(0xc4206da000, 0xc4206f7000, 0x50, 0x1000, 0x6, 0x1aecc32, 0x3) /usr/local/go/src/net/http/transport.go:1253 +0x52 bufio.(*Writer).Flush(0xc420356040, 0x1c8f0e0, 0xc420356040) /usr/local/go/src/bufio/bufio.go:573 +0x7e net/http.(*persistConn).writeLoop(0xc4206da000) /usr/local/go/src/net/http/transport.go:1838 +0x38d created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1238 +0x97f 2020-11-02 06:48:41 [INFO ] => (Server) : {'title': 'task_heartbeat_update', 'data': {'uuid': 'uca0661a'}, 'host': 'hpc1'} Error response from daemon: Container 1b94f2b8ce91f6276c62814e8e ``` ::: :::info After delete all projects Almost 200 containers fails and dead ::: ## SystemCtl default agent.service ```bash Loaded: loaded (/etc/systemd/system/mlsteam_agent_100.74.53.2_10004.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2020-11-03 03:35:08 UTC; 29min ago Main PID: 5057 (python3) Tasks: 7356 (limit: 7372) ``` ```bash= Type=simple Restart=always NotifyAccess=none RestartUSec=100ms TimeoutStartUSec=1min 30s TimeoutStopUSec=1min 30s RuntimeMaxUSec=infinity WatchdogUSec=0 WatchdogTimestampMonotonic=0 PermissionsStartOnly=no RootDirectoryStartOnly=no RemainAfterExit=no GuessMainPID=yes MainPID=0 ControlPID=0 FileDescriptorStoreMax=0 NFileDescriptorStore=0 StatusErrno=0 Result=timeout UID=[not set] GID=[not set] NRestarts=0 ExecMainStartTimestamp=Mon 2020-11-02 08:06:09 UTC ExecMainStartTimestampMonotonic=425474543069 ExecMainExitTimestamp=Mon 2020-11-02 08:23:26 UTC ExecMainExitTimestampMonotonic=426511587165 ExecMainPID=44247 ExecMainCode=2 ExecMainStatus=9 ExecStart={ path=/usr/bin/python3 ; argv[]=/usr/bin/python3 -m mlsteam_agent.cli -c /etc/mlsteam_agent/mlsteam_agent_100.74.53.2_10004.ini ; ignore_errors=no ; start_time=[Mon 2020-11-02 08:06:09 UTC] ; stop_t Slice=system.slice MemoryCurrent=[not set] CPUUsageNSec=[not set] TasksCurrent=[not set] IPIngressBytes=18446744073709551615 IPIngressPackets=18446744073709551615 IPEgressBytes=18446744073709551615 IPEgressPackets=18446744073709551615 Delegate=no CPUAccounting=no CPUWeight=[not set] StartupCPUWeight=[not set] CPUShares=[not set] StartupCPUShares=[not set] CPUQuotaPerSecUSec=infinity IOAccounting=no IOWeight=[not set] StartupIOWeight=[not set] BlockIOAccounting=no BlockIOWeight=[not set] StartupBlockIOWeight=[not set] MemoryAccounting=no MemoryLow=0 MemoryHigh=infinity MemoryMax=infinity MemorySwapMax=infinity MemoryLimit=infinity DevicePolicy=auto TasksAccounting=yes TasksMax=7372 IPAccounting=no Environment=PYTHONUNBUFFERED=1 UMask=0022 LimitCPU=infinity LimitCPUSoft=infinity LimitFSIZE=infinity LimitFSIZESoft=infinity LimitDATA=infinity LimitDATASoft=infinity LimitSTACK=infinity LimitSTACKSoft=8388608 LimitCORE=infinity LimitCORESoft=0 LimitRSS=infinity LimitRSSSoft=infinity LimitNOFILE=4096 LimitNOFILESoft=1024 LimitAS=infinity LimitASSoft=infinity LimitNPROC=1030855 LimitNPROCSoft=1030855 LimitMEMLOCK=16777216 LimitMEMLOCKSoft=16777216 LimitLOCKS=infinity LimitLOCKSSoft=infinity LimitSIGPENDING=1030855 LimitSIGPENDINGSoft=1030855 LimitMSGQUEUE=819200 LimitMSGQUEUESoft=819200 LimitNICE=0 LimitNICESoft=0 LimitRTPRIO=0 LimitRTPRIOSoft=0 LimitRTTIME=infinity LimitRTTIMESoft=infinity OOMScoreAdjust=0 Nice=0 IOSchedulingClass=0 IOSchedulingPriority=0 CPUSchedulingPolicy=0 CPUSchedulingPriority=0 TimerSlackNSec=50000 CPUSchedulingResetOnFork=no NonBlocking=no StandardInput=null StandardInputData= StandardOutput=journal StandardError=inherit TTYReset=no TTYVHangup=no TTYVTDisallocate=no SyslogPriority=30 SyslogLevelPrefix=yes SyslogLevel=6 SyslogFacility=3 LogLevelMax=-1 SecureBits=0 CapabilityBoundingSet=cap_chown cap_dac_override cap_dac_read_search cap_fowner cap_fsetid cap_kill cap_setgid cap_setuid cap_setpcap cap_linux_immutable cap_net_bind_service cap_net_broadcast cap_net_admin ca AmbientCapabilities= DynamicUser=no RemoveIPC=no MountFlags= PrivateTmp=no PrivateDevices=no ProtectKernelTunables=no ProtectKernelModules=no ProtectControlGroups=no PrivateNetwork=no PrivateUsers=no ProtectHome=no ProtectSystem=no SameProcessGroup=no UtmpMode=init IgnoreSIGPIPE=yes NoNewPrivileges=no SystemCallErrorNumber=0 LockPersonality=no RuntimeDirectoryPreserve=no RuntimeDirectoryMode=0755 StateDirectoryMode=0755 CacheDirectoryMode=0755 LogsDirectoryMode=0755 ConfigurationDirectoryMode=0755 MemoryDenyWriteExecute=no RestrictRealtime=no RestrictSUIDSGID=no RestrictNamespaces=no MountAPIVFS=no KeyringMode=private KillMode=control-group KillSignal=15 SendSIGKILL=yes SendSIGHUP=no Id=mlsteam_agent_100.74.53.2_10004.service Names=mlsteam_agent_100.74.53.2_10004.service Requires=sysinit.target system.slice WantedBy=graphical.target Conflicts=shutdown.target Before=graphical.target shutdown.target After=systemd-journald.socket basic.target sysinit.target system.slice Description=MLSteam worker Agent LoadState=loaded ActiveState=failed SubState=failed FragmentPath=/etc/systemd/system/mlsteam_agent_100.74.53.2_10004.service UnitFileState=enabled UnitFilePreset=enabled StateChangeTimestamp=Mon 2020-11-02 08:23:26 UTC StateChangeTimestampMonotonic=426511587322 InactiveExitTimestamp=Mon 2020-11-02 08:06:09 UTC InactiveExitTimestampMonotonic=425474543147 ActiveEnterTimestamp=Mon 2020-11-02 08:06:09 UTC ActiveEnterTimestampMonotonic=425474543147 ActiveExitTimestamp=Mon 2020-11-02 08:21:56 UTC ActiveExitTimestampMonotonic=426421553033 InactiveEnterTimestamp=Mon 2020-11-02 08:23:26 UTC InactiveEnterTimestampMonotonic=426511587322 CanStart=yes CanStop=yes CanReload=no CanIsolate=no StopWhenUnneeded=no RefuseManualStart=no RefuseManualStop=no AllowIsolate=no DefaultDependencies=yes OnFailureJobMode=replace IgnoreOnIsolate=no NeedDaemonReload=no JobTimeoutUSec=infinity JobRunningTimeoutUSec=infinity JobTimeoutAction=none ConditionResult=yes AssertResult=yes ConditionTimestamp=Mon 2020-11-02 08:06:09 UTC ConditionTimestampMonotonic=425474538816 AssertTimestamp=Mon 2020-11-02 08:06:09 UTC AssertTimestampMonotonic=425474538816 Transient=no Perpetual=no StartLimitIntervalUSec=10s StartLimitBurst=5 StartLimitAction=none FailureAction=none SuccessAction=none InvocationID=a1e86ac1404c4b50a7ade1ff53093e50 CollectMode=inactive ``` ![](https://i.imgur.com/nmJPqG0.png)