Troubleshooting¶
Check if docker swarm/stack runs successfully¶
docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
j7b533zxmzg5 gnes-swarm-2654_encoder replicated 0/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:master
0vlxu4acg1ph gnes-swarm-2654_income-proxy replicated 0/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:master *:4962->4962/tcp
equqrhsn7pky gnes-swarm-2654_indexer replicated 0/3 ccr.ccs.tencentyun.com/gnes/aipd-gnes:master
nd7euo7mcpa9 gnes-swarm-2654_middleman-proxy replicated 0/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:master
ssdlk9gzmggw gnes-swarm-2654_outgoing-proxy replicated 0/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:master *:4963->4963/tcp
xgxeetyhos6t my-gnes_encoder replicated 1/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:a799a0f
zny37400p225 my-gnes_income-proxy replicated 1/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:a799a0f *:8598->8598/tcp
taqqg6qwrxlw my-gnes_indexer replicated 3/3 ccr.ccs.tencentyun.com/gnes/aipd-gnes:a799a0f
j96gnny8ysbn my-gnes_middleman-proxy replicated 1/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:a799a0f
e28spnuksjw8 my-gnes_outgoing-proxy replicated 1/1 ccr.ccs.tencentyun.com/gnes/aipd-gnes:a799a0f *:8599->8599/tcp
In the above example, we started two swarms, i.e. gnes-swarm-2654
and my-gnes
. Unfortunately, gnes-swarm-2654
fails to start and is not running at all. But how can one tell that?
Note the column REPLICAS
, which indicates the number of running service (versus the number of required services). gnes-swarm-2654
gives 0/0
for all services. This suggests the swarm fails to start. The next step is to investigate the reason.
Investigate the reason of a failed service¶
One can not print out all logs of a docker swarm. Instead, one can inspect service by service, e.g.
docker service ps gnes-swarm-2654_encoder --format "{{json .Error}}" --no-trunc
"\"invalid mount config for type \"bind\": bind source path does not exist: /data/han/test-shell/output_data\""
"\"invalid mount config for type \"bind\": bind source path does not exist: /data/han/test-shell/output_data\""
"\"invalid mount config for type \"bind\": bind source path does not exist: /data/han/test-shell/output_data\""
"\"invalid mount config for type \"bind\": bind source path does not exist: /data/han/test-shell/output_data\""
Now the reason is clear, output_data
does not exist when starting the swarm. But why there are duplicated lines there? This is because docker swarm did three retries before giving up on starting this service, where each time it met the same problem. Thus four duplicated lines in total.
Delete a failed service¶
Now that the reason is clear, we can delete the failed service and release the resources.
docker stack rm gnes-swarm-2654
Removing service gnes-swarm-2654_encoder
Removing service gnes-swarm-2654_income-proxy
Removing service gnes-swarm-2654_indexer
Removing service gnes-swarm-2654_middleman-proxy
Removing service gnes-swarm-2654_outgoing-proxy
Removing network gnes-swarm-2654_gnes-net
Locate internal errors by looking at logs¶
Sometime the service fails to start but docker service ps
gives no error,
docker service ps gnes-swarm-4254_encoder --format "{{json .Error}}" --no-trunc
""
Or it shows an error that is not explanatory.
"\"task: non-zero exit (2)\""
Often in this case, the service fails to start not due tothe docker config, but due to the GNES internal error. To see that,
docker service logs gnes-swarm-4254_income-proxy
gnes-swarm-4254_income-proxy.1.yj5v8n4dhfgv@VM-0-3-ubuntu | [--proxy_type {BS,Dict,MapProxyService,Message,MessageHandler,ProxyService,ReduceProxyService,defaultdict}]
gnes-swarm-4254_income-proxy.1.yj5v8n4dhfgv@VM-0-3-ubuntu | [--batch_size BATCH_SIZE] [--num_part NUM_PART]
gnes-swarm-4254_income-proxy.1.kmgk21qo6m0n@VM-0-3-ubuntu | [--proxy_type {BS,Dict,MapProxyService,Message,MessageHandler,ProxyService,ReduceProxyService,defaultdict}]
gnes-swarm-4254_income-proxy.1.w04d552cuj93@VM-0-3-ubuntu | gnes proxy: error: argument --batch_size: invalid int value: ''
gnes-swarm-4254_income-proxy.1.kmgk21qo6m0n@VM-0-3-ubuntu | [--batch_size BATCH_SIZE] [--num_part NUM_PART]
One can now clearly see that the error comes from an incorrectly given --batch_size
, which throws from GNES CLI.