Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-slicing training jobs fail due to Ray timeout error #684

Open
Ivan-Zhou opened this issue Aug 6, 2024 · 0 comments
Open

Multi-slicing training jobs fail due to Ray timeout error #684

Ivan-Zhou opened this issue Aug 6, 2024 · 0 comments

Comments

@Ivan-Zhou
Copy link
Contributor

Among 4 jobs that were launched, 3 crashed in the middle of the training, despite of the babysitting workflow.

Two of them failed due to Ray timeout (see below). My suspicion is that multi-slicing introduces more complexity to babysitting: if one slice is down but the main one is up, it won't triggering babysitting, and the training job will stuck, until failed due to ray timeout error.

  1. https://wandb.ai/stanford-mercury/marin/runs/ttt-1b-llama-tok-us-central2-b-v4-256-d66a86ad-0803/logs
  2. https://wandb.ai/stanford-mercury/marin/runs/ttt-1b-llama-tok-us-central2-b-v4-256-ea56bf3b-0803/logs
(raylet) The node with node id: 666057938a8d86afc6080e71a6d3e58791a056c9364e63c8f4e55a9b and address: 10.130.1.251 and node name: 10.130.1.251 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a  (1) raylet crashes unexpectedly (OOM, etc.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant