Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens routing #8

Open
wkml opened this issue Mar 4, 2024 · 6 comments
Open

tokens routing #8

wkml opened this issue Mar 4, 2024 · 6 comments

Comments

@wkml
Copy link

wkml commented Mar 4, 2024

thanks for your work! It is very valuable! I would like to know how you got your conclusion about token routing, since input is affected by attention and rope, it is not logical that there should be a fixed routing for each token, how should I reproduce your result about this part?

@ilyalasy
Copy link

ilyalasy commented Mar 8, 2024

Hey! Not a paper author here, but I'm currently working on reproducing the results of OpenMoe paper specificaly on token routing.
Take a look: https://github.com/Misterion777/moe-routing/blob/main/notebooks/routing_eda.ipynb
Would appreciate any collaboration!

Also would be grateful for a review from paper author @XueFuzhao whether what I'm doing makes sense.

@wkml
Copy link
Author

wkml commented Mar 8, 2024 via email

@XueFuzhao
Copy link
Owner

Thank you for your interest!!
May I know whether your OpenMoE can generate readable sentences?

@XueFuzhao
Copy link
Owner

My analysis code is a bit dirty. But in general, the core code has been attached in this file: https://github.com/XueFuzhao/OpenMoE/blob/main/analysis/colossalai_replace/layer.py
You can just compare the colossalai's class SparseMLP and mine and you will then get the difference.

I went through your code very quickly (sry I'm totally overwhelmed these days), my two concerns:

  1. The context-independent specialization is not that clear? I am not sure whether the output sentence is normal. If not, the model may have some bugs. Maybe the ckpt loading is not that correct? Just to have a check on the model output.
  2. In your hooker code, it seems that you are accounting for the argmax value directly? However, the routing decision depends on both argmax value and model capacity. So a more reliable implementation is to check the routing decision like this line:
    "dispatch_mask": dispatch_mask_np.tolist(),

Thanks again for your interest! Looking forward to your results on other MoE models like Mistral and Deepseek-MoE. That would be very interesting.

@wkml
Copy link
Author

wkml commented Mar 13, 2024

Hey! Not a paper author here, but I'm currently working on reproducing the results of OpenMoe paper specificaly on token routing. Take a look: https://github.com/Misterion777/moe-experiments/blob/main/notebooks/routing_eda.ipynb Would appreciate any collaboration!

Also would be grateful for a review from paper author @XueFuzhao whether what I'm doing makes sense.

thanks for your code! I have encountered some tricky things recently, so I have spent less energy on advancing this research. I will study your code carefully and thank you for your efforts! Thank you all! @Misterion777 @XueFuzhao

@ilyalasy
Copy link

My analysis code is a bit dirty. But in general, the core code has been attached in this file: https://github.com/XueFuzhao/OpenMoE/blob/main/analysis/colossalai_replace/layer.py You can just compare the colossalai's class SparseMLP and mine and you will then get the difference.

I went through your code very quickly (sry I'm totally overwhelmed these days), my two concerns:

  1. The context-independent specialization is not that clear? I am not sure whether the output sentence is normal. If not, the model may have some bugs. Maybe the ckpt loading is not that correct? Just to have a check on the model output.
  2. In your hooker code, it seems that you are accounting for the argmax value directly? However, the routing decision depends on both argmax value and model capacity. So a more reliable implementation is to check the routing decision like this line:
    "dispatch_mask": dispatch_mask_np.tolist(),

Thanks again for your interest! Looking forward to your results on other MoE models like Mistral and Deepseek-MoE. That would be very interesting.

I changed the hook - now it takes expert capacity into consideration.
Besides, ColossalAI checkpoint is indeed buggy and doesn't output valid text. I am using OrionZheng/... instead.
Now the plot looks much more similar to what you've reported in the paper.
Thank you very much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants