Skip to content

Releases: Natooz/MidiTok

v2.1.3 New tokenization workflow, speedups, time signature and PyTorch data loading module

17 Aug 11:17
Compare
Choose a tag to compare

This big update brings a few important changes and improvements.

A new common tokenization workflow for all tokenizers.

We distinguish now three types of tokens:

  1. Global MIDI tokens, which represent attributes and events affecting the music globally, such as the tempo or time signature;
  2. Track tokens, representing values of distinct tracks such as the notes, chords or effects;
  3. Time tokens, which serve to structure and place the previous categories of tokens in time.

All tokenisations now follows the pattern:

  1. Preprocess the MIDI;
  2. Gather global MIDI events (tempo...);
  3. Gather track events (notes, chords);
  4. If "one token stream", concatenate all global and track events and sort them by time of occurrence. Else, concatenate the global events to each sequence of track events;
  5. Deduce the time events for all the sequences of events (only one if "one token stream");
  6. Return the tokens, as a combination of list of strings and list of integers (token ids).

This cleans considerably the code (DRY, less redundant methods), while bringing speedups as the calls to sorting methods has been reduced.

TLDR; other changes

  • New submodule pytorch_data offering PyTorch Dataset objects and a data collator, to be used when training a PyTorch model. Learn more in the documentation of the module;
  • MIDILike, CPWord and Structured now handle natively Program tokens in a multitrack / one_token_stream way;
  • Time signature changes are now handled by TSD, MIDILike and CPWord;
  • The time_signature_range config option is now more flexible / convenient.

Changelog

  • #61 new pytorch_data submodule, with DatasetTok and DatasetJsonIO classes. This module is only loaded if torch is installed in the python environment;
  • #61 tokenize_midi_dataset() method now have a tokenizer_config_file_name argument, allowing to save the tokenizer config with a custom file name;
  • #61 "all-in-one" DataCollator object to be used with PyTorch DataLoaders;
  • #62 Structured and MIDILike now natively handle Program tokens. When setting config.use_programs true, a Program token will be added before each Pitch/NoteOn/NoteOff token to associate its instrument. MIDIs will also be treated as a single stream of tokens in this case, whereas otherwise each track is converted into independent token sequences;
  • #62 miditok.utils.remove_duplicated_notes method can now remove notes with the same pitch and onset time, regardless of their offset time / duration;
  • #62 miditok.utils.merge_same_program_tracks is now called in preprocess_midi when config.use_programs is True;
  • #62 Big refactor of the REMI codebase, that now has all the features of REMIPlus, and code clean and speedups (less calls to sorting). The REMIPlus class is now basically only a wrapped REMI with programs and time signature enabled;
  • #62 TSD and MIDILike now encode and decode time signature changes;
  • #63 @ilya16 The Tempos can now be created with a logarithmic scale, instead of the default linear scale.
  • c53a008 and 5d1c12e track_to_tokens and tokens_to_track methods are now partially removed. They are now protected, for classes that still rely on them, and removed from the others. These methods were made for internal calls and not recommended to use. Instead, the midi_to_tokens method is recommended;
  • #65 @ilya16 changes time_signature_range into a dictionary {denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)};
  • #65 @ilya16 fix in the formula computing the number of ticks per bar.
  • #66 Adds an option to TokenizerConfig to delete the successive tempo / time signature changes carrying the same value during MIDI preprocessing;
  • #66 now using xdist for tests, big speedup on Github actions (ty @ilya16 !);
  • #66 CPWord and Octuple now follow the common tokenization workflow;
  • #66 As a consequence to the previous point, OctupleMono is removed as there was no records of its use. It is now equivalent to Octuple without config.use_programs;
  • #66 CPWord now handling time signature changes;
  • #66 tests for tempo and time signatures changes are now more robust, exceptions were removed and fixed.
  • 5a6378b save_tokens now by default doesn't save programs if config.use_programs is False

Compatibility

  • Calls to track_to_tokens and tokens_to_track methods are not supported anymore. If you used these methods, you may replace them with midi_to_tokens and tokens_to_midi (or just call the tokenizer) while selecting the appropriate token sequences / tracks;
  • time_signature_range now needs to be given as a dictionary;
  • Due to changes in the order of vocabularies of Octuple (as programs are now optional), tokenizers and tokens made with previous versions will not be compatible unless the vocabulary order is swapped, idx 3 moved to 5.

v2.1.2 I/O fixes

24 Jul 18:16
Compare
Choose a tag to compare

Thanks to @Kapitan11 who spotted bugs when decodings tokens given as ids / integers (#59), this update brings a few fixes that solve them alongside tests ensuring that the input / output (i/o) formats of the tokenizers are well handled in every cases.
The documentation has also been updated on this subject, that was unclear until now.

Changes

  • 394dc4d Fix in MuMIDI and Octuple token encodings that performed the preprocessing steps twice;
  • 394dc4d code of single track tests improved and now covering tempos for most tokenizations;
  • 394dc4d MuMIDI can now decode tempo tokens;
  • 394dc4d _in_as_seq decorator now used solely for the tokens_to_midi() method, and removed from tokens_to_track() which explicitly expects a TokSequence object as argument (089fa74);
  • 089fa74 _in_as_seq decorator now handling all token ids input formats as it should;
  • 9fe7639 Fix in TSD decoding with multiple input sequences when not in one_token_stream mode;
  • 9fe7639 Adding i/o input ids tests;
  • 8c2349b unique_track property renamed to one_token_stream as it is more explicit and accurate;
  • 8c2349b new convert_sequence_to_tokseq method, which can convert any input sequence holding ids (integer), tokens (string) or events (Event) data into a TokSequence or list of TokSequences objects, with the appropriate format depending on the tokenizer. This method is used by the _in_as_seq decorator;
  • 8c2349b new io_format tokenizer property, returning the tokenizer's io format as a tuple of strings. Their significations are: I for instrument (for non one_token_stream tokenizers), T for token, C for sub-token class (for multi-voc tokenizers)
  • Minor code lint improvements;

Compatibility

  • All good 🙌

v2.1.1 Minor fixes

06 Jul 13:22
Compare
Choose a tag to compare

Changes

  • 220f384 Fix in learn_bpe() for tokenizers in unique_track mode;
  • 30d5546 Fixes in data augmentation (on tokens) in unique_track mode: 1) was skipping files (detected as drums) and 2) it now augment all pitches except drums ones (as opposed to all before);
  • 30d5546 Tokenizer creating Program tokens from tokenizer.config.programs given by user.

Compatibility

  • If you used custom Program tokens, make sure to give (-1, 128) as argument for your tokenizer's config (TokenizerConfig programs arg). It's already it by default, this message only applied if you gave something else.

V2.1.0 TokenizerConfig

03 Jul 14:47
b12d270
Compare
Choose a tag to compare

Major change

This "mid-size" update brings a new TokenizerConfig object, holding any tokenizer's configuration. This object is now used to instantiate all tokenizers, and replaces the now removed beat_res, nb_velocities, pitch_range and additional_tokens arguments. It allows to simplify the code, reduce exceptions, and expose a simplified way to custom tokenizers.
You can read the documentation and example to see how to use it.

Changes

  • e586b1f New TokenizerConfig object to hold config and instantiate tokenizers
  • 26a67a6 @tingled Fix in __repr__
  • 9970ec4 Fix in CPWord token type graph
  • 69e64a7 max_bar_embedding argument for REMIPlus is now by default set to False
  • 62292d6 @Kapitan11 load_params now private method, and documentation updated for this feature
  • 3aeb7ff Removing the depreciated "slow" BPE methods
  • f8ca854 @ilya16 Fixing PitchBend time attribute in merge_tracks method
  • b12d270 TSD now natively handle Program tokens, the same way REMIPlus does. Using the use_prorams option will convert MIDIs into a single token sequence for all tracks, instead of one seq per track instead;
  • Other minor code, lint and docstring improvements

Compatibility

  • On your current / previous projects, you will need to update your code, specifically the way you create tokenizers, to use this update. This doesn't apply to code creating tokenizers from config file (params arg);
  • Slow BPE removed. If you still use these methods, we encourage you to switch to the new fast ones. You trained models will need to be using with old slow tokenizers.

V2.0.6 MMM tokenizer

16 May 18:41
811bd68
Compare
Choose a tag to compare

Changes

Compatibility

  • All good 🙌

v2.0.5 Bug fixes and safety checks

04 May 16:32
Compare
Choose a tag to compare

Changes

  • f9f63d0 (related to #37) adding a compatibility check to learn_bpe method
  • f1af66a fixing an issue when loading tokens in learn_bpe with unique_track compatible tokenizer (REMIPlus) causing no BPE learning
  • f1af66a in learn_bpe: checking that the total number of unique base tokens (chars) is inferior to the target vocabulary size
  • 47b6166 handling multi-voc indexing with tokens present in all vocabs eg special

Compatibility

  • All good 🙌

v2.0.4 Bugfix

02 May 11:39
Compare
Choose a tag to compare

Changes

  • 456a6ce bugfix on the velocity feature when performing data augmentation at token level

v2.0.3 Minor improvements

04 Apr 09:06
Compare
Choose a tag to compare

Changes

  • ff1bb5e and 195cb65 the __call__ magic method allows to load MIDI and json files before converting them
  • c045630 TokSequences are now subscriptable! (you can do tok_seq[id_])
  • a632214 Special tokens are now stored without the None value
  • Minor code and documentation improvements

Compatibility

  • In case you use token_type_graph and tokens_errors: previous config files store special tokens with None value (eg PAD_None), have to modified to remove it (eg just PAD) (special_tokens entry only). No change in vocabulary / tokens.

v2.0.2 Fix _ids_are_bpe_encoded

21 Mar 08:29
Compare
Choose a tag to compare
  • 63110d7 fix in _ids_are_bpe_encoded method

V2.0.1 REMI+ and new Chord params

18 Mar 09:48
Compare
Choose a tag to compare

Changes

  • e26b088 from @atsukoba + help from @muthissar: REMI+ is now implemented! 🎉 This multitrack tokenization can be seen as an extension of REMI.
  • 2962211 Chord tokens can now represent the root note within tokens (versus only chord quality previously). Chord parameters have to be specified in additional_tokens argument, with the keys chord_maps, chord_tokens_with_root_note and chord_unknown. You can use the default value as an example.
  • e402b0d _in_as_seq decorator now automatically checks if the input ids are encoded with BPE
  • 2064ee9 fix with BPE containing spaces in merges, could not load tokenizers after training

Compatibility

  • due to 2064ee9, bytes and merges are shifted from v2.0.0. BPE tokenizers will be incompatible and would have to be retrained, or the bytes from their vocabularies and merges would have to be shifted. This only applies for BPE.