Llama 3 #565

peregilk · 2024-04-25T06:45:50Z

Do you have any plans for adding for supporting Llama-3? Any idea how complex this would be, apart from new configs?

dlwh · 2024-04-25T06:56:14Z

@Helw150 said it worked out of the box. Just configs I think

peregilk · 2024-04-25T07:26:12Z

Thats fantastic, @dlwh! Great if you could share your configs @Helw150.

I must admit I have not dug into the details here yet, but I understood the biggest architectural changes were using a larger tokenizer, and adding GQA to the smaller models. I havent seen GQA used in any of the Levanter models, but found a post saying it was supported. Can this also just be enabled through the configs?

I also read a post about them doing some masking on longer sequences so that the attention did not "spill over" to new documents.

peregilk · 2024-04-25T14:36:25Z

The model seems to start training with:

data:  
  tokenizer: "meta-llama/Meta-Llama-3-8B"
model:
  type: llama
initialize_from_hf: "meta-llama/Meta-Llama-3-8B"
use_hf_model_config: true

However, I keep getting the message: "The tokenizers appear to be different. You may want to check this.".

Not really sure what is causing this.

peregilk · 2024-04-26T13:00:24Z

@dlwh. Unfortuantely, I can not seem to get it to work right out of the box. The model is training, but when trying to train on a domain specific corpus, the loss is starting way too high, and never fully recovers.

I am pretty sure the issue is the vocab size here. I can not seem to override the vocab size in the model config.

This line seem to return the default Llama tokenizer:

levanter/src/levanter/main/train_lm.py

Line 62 in bd2aad6

converter = config.model.default_hf_checkpoint_converter

While it is overwritten later, I think this is the main issue.

I have tried both reading the configs from HF, and creating them from scratch.

Please advice.

dlwh · 2024-04-26T21:15:52Z

ok i'll try to take a look this weekend. Do you have a full config you can use a reproducer by any chance?

peregilk · 2024-04-27T08:59:21Z

Awesome. Here are the config I have been using. Just replaced the urls.

data:
  train_urls:
    - "gs://mydatabucket/train-shard-{0001..0147}-of-0147.json.gz"
  validation_urls:
    - "gs://mydatabucket/NCC_plus_scandi/validation-shard-0001-of-0001.json.gz"
  cache_dir: "gs://mycachebucket/tokenized/llama3hfconfigfalse/"
  tokenizer: "meta-llama/Meta-Llama-3-8B"
model:
  type: llama
  seq_len: 2048
  hidden_dim: 4096
  intermediate_dim: 14336
  num_layers: 32
  num_heads: 32
  num_kv_heads: 8
  initializer_range: 0.02
  use_flash_attention: true
initialize_from_hf: "meta-llama/Meta-Llama-3-8B"
use_hf_model_config: false
trainer:
  wandb:
    entity: "myentity"
    project: "myproject"
    tags: ["llama3"]
    name: north-llamatre-hfconfigfalse
  mp: p=f32,c=bfloat16
  train_batch_size: 256 
  num_train_steps: 10000
  steps_per_eval: 250
  tensor_parallel_axes: ["mlp", "heads"]
  fsdp_axis: "embed"
  batch_axis: "batch"
  checkpointer:
    base_path: "gs://mycheckpointbucket/north-llama3-hfconfigfalse/checkpoints"
    keep:
      - every: 1000
optimizer:
  learning_rate: 1.2e-5
  weight_decay: 0.1
  min_lr_ratio: 0.1
  warmup: 1000
hf_save_steps: 5000
hf_save_path: "gs://myhfbucket/north-llama3-hfconfigfalse/hf"

I have also tried setting
use_hf_model_config: true

This gave the same result.

What I am seeing can be illustrated here:

The red line is the loss of a Mistral model. The grey line is from Llama3. Apart from that, the settings are identical, and they are both trying to use the HF tokenizer. The pattern is very similar to what we are seeing with just hot-swappng to a new tokenizer.

Helw150 · 2024-05-01T19:33:54Z

Do you have a reproduction of a case where the Levanter implementation gives you a different prediction than the HuggingFace implementation? As an example, here's a round trip test I used to verify the Whisper implementation

levanter/tests/whisper_test.py

Line 130 in 407d54b

def test_hf_roundtrip():

The only architectural change in LLama 3 is the Grouped Query attention - which is supported here:

levanter/src/levanter/models/llama.py

Line 236 in 407d54b

    
           QHeadsPerGroup = hax.Axis("q_heads_per_group", config.num_heads // config.num_kv_heads)

I've exported a few Llama 3 finetunes from Levanter to HuggingFace successfully and the models seem to work as expected for inference, so it's unclear to me whether the above case suggests a bug or is a function of the much larger vocab size of LLama 3 v.s. Mistral. I'm not sure what the data mix is above, but if it's multilingual it's also likely Mistral starts from lower loss because it's more explicitly designed for Multilinguality.

If you send over a case where HuggingFace and Levanter output different logits for the Llama 3 weights, I'd be happy to take on the debugging from there!

peregilk · 2024-05-04T22:05:08Z

I am trying to debug this and test on downstream tasks by exporting to HF. However, I noticed that for llama3, no tokenizer.model file is created when saving to HF. Have you experienced this @Helw150?

Edit: I see the reason for this is that the HF repos does not contain any tokenizer.model-file.

peregilk · 2024-05-21T13:03:49Z

Reopening this. I have trained a bit more, and I am really not satisfied with the result, even if the train/eval loss looks fine.

Do you have a working llama3 config-file @Helw150. I want to double check if I have made any mistakes here.

Helw150 · 2024-05-21T17:37:46Z

Hi!

My use case is a bit non-standard (training multi-modal encoders) so I'm not sure my configs will help so much. If you want to check them anyways, you can find them on the will/distill branch tagged with via_*! In these cases, I'm leaving Llama frozen but still need to get gradients from it. I've done runs with both Llama 2 and Llama 3 and haven't seen any surprising looking issues when switching to Llama 3!

Could you give a bit more details about the issue you are facing? Does it seem like the model isn't training properly? Or is it that the results aren't satisfactory?

If it's the latter, additional context (e.g. specific symptoms, expected behavior) would help for me to understand whether there's an underlying bug that could cause this or if it's a matter of hyperparameters/underlying capabilities!

dlwh · 2024-05-21T18:25:27Z

What revision/commit were you using to train? My usage of the TPU splash attention had/has a bug that messed everything up. I'm like 60% sure I know how to fix (and you can probably fix your checkpoints post-hoc) but I need another day or so. If you want to try something, can you pre-multiply all of the q_proj by sqrt(headdim). I haven't verified that yet but I strongly suspect

Helw150 · 2024-05-21T18:40:44Z

Ah yes, worth noting that I haven't pulled in the Splash Attention changes yet

dlwh · 2024-05-21T19:00:13Z

splash attention is currently disabled so main is fine 🤞 right now

peregilk · 2024-05-21T19:47:28Z

I was using splash attention, so that might have caused the error.

However, I was suspecting this to be a tokenizer-size issue. I remember also getting some warning about non-matching tokenizers here.

But I can retry this without splash, and see if that is related.

dlwh · 2024-05-22T06:39:34Z

I believe splash is now fixed in latest main, but it's now off by default.

Can you try

--model.attn_backend splash

and

--model.attn_backend jax_flash

and let me know if things seem ok?

peregilk · 2024-05-22T09:32:45Z

Awesome! I have not been training for long, but in general my good runs have been starting with an eval-loss of around 2.5, while the broken runs have started on 6. In the latest main, this seems to start with a 2.5 loss both with and without flash attention. Looks very good.

For reference (in case other are having the same issue), the correct commands are uppercase:
--model.attn_backend SPLASH
--model.attn_backend JAX_FLASH

Splash automatically upscales to 32, since 16 is not working. I understand this is expected.

dlwh · 2024-05-22T16:20:38Z

Awesome thanks for your patience.

Yeah, for whatever reason they don't support bf16 for attention with that kernel yet

the uppercase thing can be fixed by upgrading draccus to >=0.8.0

Aphoh · 2024-06-25T04:43:17Z

@peregilk Llama3 shouldn't work out of the box nicely, as it uses a different theta for the RoPE scaling and configuring that isn't yet supported in levanter. This issue should probably be re-opened. Even when I use the correct rope theta I don't get reasonable results in levanter (i.e. eval_lm gives me a loss of ~7 on neutral pretraining datasets like SlimPajama). @dlwh any ideas?

dlwh · 2024-06-25T04:59:31Z

that's not great. Probably need to spend some time in a debugger.

dlwh · 2024-06-25T05:00:00Z

i probably won't get to this for at least a few days myself, but happy to provide some support

mayankjobanputra · 2024-09-03T19:03:45Z

any progress on this one? I was thinking of switching to levanter from composer.

peregilk closed this as completed May 5, 2024

peregilk reopened this May 21, 2024

dlwh closed this as completed May 22, 2024

dlwh reopened this Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3 #565

Llama 3 #565

peregilk commented Apr 25, 2024

dlwh commented Apr 25, 2024

peregilk commented Apr 25, 2024

peregilk commented Apr 25, 2024

peregilk commented Apr 26, 2024

dlwh commented Apr 26, 2024

peregilk commented Apr 27, 2024 •

edited

Loading

Helw150 commented May 1, 2024 •

edited

Loading

peregilk commented May 4, 2024 •

edited

Loading

peregilk commented May 21, 2024

Helw150 commented May 21, 2024

dlwh commented May 21, 2024

Helw150 commented May 21, 2024

dlwh commented May 21, 2024

peregilk commented May 21, 2024

dlwh commented May 22, 2024

peregilk commented May 22, 2024

dlwh commented May 22, 2024

Aphoh commented Jun 25, 2024

dlwh commented Jun 25, 2024

dlwh commented Jun 25, 2024

mayankjobanputra commented Sep 3, 2024

Llama 3 #565

Llama 3 #565

Comments

peregilk commented Apr 25, 2024

dlwh commented Apr 25, 2024

peregilk commented Apr 25, 2024

peregilk commented Apr 25, 2024

peregilk commented Apr 26, 2024

dlwh commented Apr 26, 2024

peregilk commented Apr 27, 2024 • edited Loading

Helw150 commented May 1, 2024 • edited Loading

peregilk commented May 4, 2024 • edited Loading

peregilk commented May 21, 2024

Helw150 commented May 21, 2024

dlwh commented May 21, 2024

Helw150 commented May 21, 2024

dlwh commented May 21, 2024

peregilk commented May 21, 2024

dlwh commented May 22, 2024

peregilk commented May 22, 2024

dlwh commented May 22, 2024

Aphoh commented Jun 25, 2024

dlwh commented Jun 25, 2024

dlwh commented Jun 25, 2024

mayankjobanputra commented Sep 3, 2024

peregilk commented Apr 27, 2024 •

edited

Loading

Helw150 commented May 1, 2024 •

edited

Loading

peregilk commented May 4, 2024 •

edited

Loading