-
Notifications
You must be signed in to change notification settings - Fork 1
Home
You must write a program in the Python programming language that can build a non-deterministic finite automaton (NFA) from a regular expression and can use the NFA to check if the regular expression matches any given string of text.
$ git clone https://github.com/johnshields/Python-RegEx-Parser
$ python regex_parser.py
A regular expression is a string containing a series of characters, some of which may have a special meaning. For example, the three characters ., |, and * have the special meanings concatenate, or, and Kleene star respectively. For example, the regular expression 0.1 means a 0 followed by a 1, 0|1 means a 0 or a 1, and 1* means any number of 1’s. These special characters must be used in your submission.
When you want to perform string matching operations that are more complex than the operations, you use regular expressions.
., |, *, +, ?, (), $
. concatenates two characters. So, a.b means an a followed by a b
| means or. So, a|b means an a or a b
- (Kleene Star) means a character appears 0 or more times
- means a character appears 1 or more times
? means a character appears 0 or 1 time
() are used to group characters
$ means a character that does not appear
The Shunting Yard Algorithm, created by Edsger Dijkstra converts an Infix Expression to a Postfix Expression using a stack that holds operators. The stack is used to reverse the order of the operators in the expression. Since no operator can be printed until both of its operands have appeared, it also serves as a storage structure.
Takes regular expressions from infix notation to postfix notation
Concatenates operations Infix:
a.b = a followed by b
a|b = a or b
a* = any number of a's (including 0's)
Postfix:
ab. = a followed by b
ab| = an or b
a* = any number of a's (including 0's)
E.g. - (a|b).(a*|b*) becomes ab|a*b*|
set up a specials dictionary:
specials = {'*': 50, '.': 40, '|': 30, '+':40, '?':35}
set up a for loop to search for the Special Characters
- looks for c in specials dictionary, if not found, give value 0
- looks for the top of the stack in specials dictionary, if not found, give value 0
put statements together
Thompson's Construction, by Ken Thompson, is used to convert a Postfix Expression into a Non-deterministic Finite Automata. These NFAs can be used to match strings against Regular Expressions.
- reads Regular Expressions from left to right to create NFA's
An NFA is represented by its initial and accepts states set up classes
In Python a function is defined using the def keyword:
def my_function():
print("Hello from a function")
Algorithm: A function has a stack of NFAs, the pofix allows to loop one character at a time until the regular expression is complete.
The stack works as last in first out
- pop 2 NFAs off the stack.
- merges them together
- Connect 1st NFAs accept state to the 2nd's initial
- Push NFA to stack
- Push newnfa
- Create a new initial and accept states
- Creates new NFA
- Join the initial states to the accept state using an arrow labeled c.
- Going to create a new instance of the NFA class.
- Set it's initial state to the initial state
- Push new NFA to the stack # ¬ returns an instance of the NFA class
- push NFA
- nfastack should only have a single NFA on it at the point.
- Return the set of states that can be reached from the state following e arrows
- Check if the state has arrows labeled e from it
- Check if edge1 is a state
- If there's an edge1, follow it
- Check if edge2 is a state
- If there's an edge2, follow it
- Return the set of states
'current' = THE CURRENT SET
'upcomin' = THE NEXT SET UPCOMING
Shunt and compile the regular expression from the functions in the code
- Call the shunt and compiletom function
- Current set of states and the next set of the states
- Add the initial state to the current set
- Loop through each character in the string
- Check of that state is labeled s
- Add edge1 state to the upcomin set
- Set current to upcomin, and clear out next
- Check if the accept state is in the set of current states
Testing the Shunting Yard Algorithm to see if actually takes a Regular Expression from infix notion to postfix notation.
regex e.g. = (a.b)|(c*.d)
print(shunt("(a.b)|(c*.d)"))
Output:
(a.b)|(c*.d) to ab.c*d.|
Test to see if the compiletom function can actually make a regex into a NFA.
print(compiletom("ab.cd.|"))
Output:
<__main__.nfa object at 0x0000028BACDB97F0>
- Create list of infix regex + Strings to match against the regex infixes.
infixes = ["a.b.c" ,"a.b.c*, a.(b|d).c*", "(a.(b|d))*", "a.(b.b)*.c"]
- Double for loop. For loops through the regex and strings.
strings = ["abc", "abbc", "abcc", "abad", "abbbc"]
- The match function prints out True/False if the infix regex matches the string.
infix = a.b.c
string = abc
print(match(exp, res), exp, res)
Output:
True a.b.c abc
Type in any Infix and string to be matched and tested
infix = (a.(b|d))*
string = abc
while loop prints out True/False if the infix regex matches the string
print(match(infixes, string), infixes, string)
Output:
False (a.(b|d))* abc
crtl+c and then press the 'Enter' key to exit prompt
Besides the typical errors of misspelling and missing :, ), etc. I had one problem that stopped my progress with the project. When the match function was called all the matches would come out as 'False' and 'None'. After attempting to find errors that are not clear and do not cause the code to crash I tracked it down to my indentation in the for loop in the match function. After I fixed the indentation the output displayed both 'False' and 'True'.