We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
如题,多机训练失败后,非master node还是存活着一个libai进程,导致会持续向控制台打印日志。类似这样的日志:
The text was updated successfully, but these errors were encountered:
收到,我们尝试复现一下问题。
Sorry, something went wrong.
您好,请问【多机训练失败】是手动CTRL + C结束程序,还是代码异常报错失败呢?
我这里基于:https://libai.readthedocs.io/en/latest/tutorials/get_started/quick_run.html 的bert demo跑了一下2机的,CTRL + C以后,master(node0)结束后,node1的程序是可以正常终止的。
代码异常报错失败哈
No branches or pull requests
如题,多机训练失败后,非master node还是存活着一个libai进程,导致会持续向控制台打印日志。类似这样的日志:
The text was updated successfully, but these errors were encountered: