Pattern_recognition

Script for automatic generating regex from given string list.

Requirements

We use only build-in Python libraries. Some calculations may be done with numpy or Pandas, but I decided to use only fast build-ins.
The data that we are going to process have a structured structure and a specified length. E.g. Iban numbers, postal codes, telephone numbers, etc.
We try to find the exact sign on a given place, if it is not possible then we are looking for the most accurate category

Finding the most common length of one entry
Rejection of entries that are not of a specified length (we assume that they are erroneous for various reasons)
Transposition of the clean data matrix, example:

[DE123, DE456, DE789]

Will be:

[DDD, EEE, 147, 258, 369]

This allows us to calculate the frequency of occurrence of each character.
Knowing the frequency of occurrence of characters we determine what is in a given place:
- If a particular character occurs at 98%, it assigns this character to this place.
- If a group occurs at 90%, it assigns this group to this place.
- If no condition is met, return the dot.
The mechanism allows adding next conditions, e.g. selection of several candidates at selected probabilities.
After generating regex we check which data are not detected and add them to the pool of rejected data.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
datasets		datasets
test		test
README.md		README.md
examples.py		examples.py
pattern_recognition.py		pattern_recognition.py