Skip to content

Commit

Permalink
Merge pull request #5 from mideind/lookupid
Browse files Browse the repository at this point in the history
Version 0.4.0; added lookup_id() function; new KRISTINsnid.csv data
  • Loading branch information
vthorsteinsson committed Nov 3, 2021
2 parents 40dccf1 + d344884 commit 3b7d53a
Show file tree
Hide file tree
Showing 7 changed files with 367 additions and 157 deletions.
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ and about 300,000 distinct lemmas.

Miðeind has encapsulated the database in an easy-to-install Python package,
compressing it
from a 400+ megabyte CSV file into an ~80 megabyte indexed binary structure.
from a 400+ megabyte CSV file into an ~82 megabyte indexed binary structure.
The package maps this structure directly into memory (via `mmap`) for fast lookup.
An algorithm for handling compound words is an important additional feature
of the package.
Expand Down Expand Up @@ -233,6 +233,17 @@ Here we see, perhaps unexpectedly, that the word form *laga* has five possible l
four nouns (*lag*, *lög*, *lagi* and *lögur*, neutral (`hk`) and masculine (`kk`)
respectively), and one verb (`so`), having the infinitive (*nafnháttur*) *að laga*.

## Lookup by BÍN identifier

Given a BÍN identifier (id number), BinPackage can return all entries for that id:

```python
>>> from islenska import Bin
>>> b = Bin()
>>> b.lookup_id(495410)
[<Ksnid: bmynd='sko', ord/ofl/hluti/bin_id='sko'/uh/alm/495410, mark=OBEYGJANLEGT, ksnid='1;;;;K;1;;;'>]
```

## Grammatical variants

With BinPackage, it is easy to obtain grammatical variants
Expand Down Expand Up @@ -472,6 +483,27 @@ and the second element is the list of matching entries, each represented
by an instance of class `Ksnid`.


## `lookup_id()` function

If you have a BÍN identifier (integer id) and need to look up the associated
augmented format (*Kristínarsnið*) entries, call the `lookup_id()` function:

```python
>>> b.lookup_id(495410)
[<Ksnid: bmynd='sko', ord/ofl/hluti/bin_id='sko'/uh/alm/495410, mark=OBEYGJANLEGT, ksnid='1;;;;K;1;;;'>]

```

`lookup_id()` has a single mandatory parameter:

| Name | Type | Default | Description |
|------|------|---------|-------------|
| bin_id | `int` | | The BÍN identifier of the entries to look up. |

The function returns a list of type `List[Ksnid]`. If the given id number is not found
in BÍN, an empty list is returned.


## `lookup_cats()` function

To look up the possible classes/categories of a word (*orðflokkar*),
Expand Down
16 changes: 8 additions & 8 deletions src/islenska/basics.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,27 +60,27 @@
UINT32 = struct.Struct("<I")

# BÍN compressed file format version (used in tools/binpack.py and bincompress.py)
BIN_COMPRESSOR_VERSION = b"Greynir 02.00.00"
BIN_COMPRESSOR_VERSION = b"Greynir 03.00.00"
assert len(BIN_COMPRESSOR_VERSION) == 16
BIN_COMPRESSED_FILE = "compressed.bin"

# The following are encoded with each word form
# Bits allocated for the lemma index (currently max 310787)
LEMMA_BITS = 19
LEMMA_MAX = 2 ** LEMMA_BITS
# Bits allocated for the bin_id number (currently max 513582)
BIN_ID_BITS = 23
BIN_ID_MAX = 2 ** BIN_ID_BITS
BIN_ID_MASK = BIN_ID_MAX - 1
# Bits allocated for the meaning index (currently max 968)
MEANING_BITS = 10
MEANING_MAX = 2 ** MEANING_BITS
MEANING_MASK = MEANING_MAX - 1
# Make sure that we have at least three high bits available for other
# purposes in a 32-bit word that already contains a lemma index and a meaning index
assert LEMMA_BITS + MEANING_BITS <= 29
# assert BIN_ID_BITS + MEANING_BITS <= 29
# Bits allocated for the ksnid-string index (currently max 5826)
KSNID_BITS = 13
KSNID_MAX = 2 ** KSNID_BITS
KSNID_MASK = KSNID_MAX - 1

# The following are encoded with each lemma
# Bits allocated for the bin_id number (currently max 513582)
UTG_BITS = 23
# Bits allocated for the subcategory index (hluti) (currently max 49)
SUBCAT_BITS = 8

Expand Down
Loading

0 comments on commit 3b7d53a

Please sign in to comment.