Skip to content

fakhirali/Compression

Repository files navigation

Compression = Coding + Modelling

Coding: How to store symbols.
Modelling: Which symbols are more probable.

We want to code the most probable symbols in the least amount of bits. And vice versa.

Coding is solved. Modelling is proved unsolvable.

On first 1000 bytes of wikipedia
gzip :: 508 bytes
16 bit n-gram :: 522 bytes
16 bit n-gram + backoff :: 513 bytes
16 bit learned weighted n-gram :: 587 bytes

You can test your models too! Docs

TODO:

  • Code with more bits to improve precision
  • Test compression ratio w/ more data

Results:

averaged on 3 random samples of 1000 bytes from enwik9

Model Compressed Size Theoretical Compression
zip 532.3333333333334 N/A
Ngram 722.3333333333334 717.4846796689859
Backoff 685.0 680.7137133275579
LearnedWeighted 718.3333333333334 712.3682375572251

About

compression == intelligence

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published