Home

Project Statement

You must write a program in the Python programming language that can build a non-deterministic finite automaton (NFA) from a regular expression and can use the NFA to check if the regular expression matches any given string of text.

Run the Project:

Requirements:

Git
Python 3

Clone the Repository:

Open a directory of your choice in Command-Line and enter:

$ git clone https://github.com/johnshields/Python-RegEx-Parser

Open the repository directory in Command-Line and enter:

$ python regex_parser.py

Regular Expressions

A regular expression is a string containing a series of characters, some of which may have a special meaning. For example, the three characters ., |, and * have the special meanings concatenate, or, and Kleene star respectively. For example, the regular expression 0.1 means a 0 followed by a 1, 0|1 means a 0 or a 1, and 1* means any number of 1’s. These special characters must be used in your submission.

When you want to perform string matching operations that are more complex than the operations, you use regular expressions.

Special Characters

., |, *, +, ?, (), $

. concatenates two characters. So, a.b means an a followed by a b

| means or. So, a|b means an a or a b

(Kleene Star) means a character appears 0 or more times

means a character appears 1 or more times

? means a character appears 0 or 1 time

() are used to group characters

$ means a character that does not appear

Shunting Yard Algorithm

The Shunting Yard Algorithm, created by Edsger Dijkstra converts an Infix Expression to a Postfix Expression using a stack that holds operators. The stack is used to reverse the order of the operators in the expression. Since no operator can be printed until both of its operands have appeared, it also serves as a storage structure.

Takes regular expressions from infix notation to postfix notation

Concatenates operations Infix:

a.b = a followed by b
a|b = a or b
a* = any number of a's (including 0's)

Postfix:

ab. = a followed by b
ab| = an or b
a* = any number of a's (including 0's)

E.g. - (a|b).(a*|b*) becomes ab|a*b*|

set up a specials dictionary:

specials = {'*': 50, '.': 40, '|': 30, '+':40, '?':35}

set up a for loop to search for the Special Characters

looks for c in specials dictionary, if not found, give value 0
looks for the top of the stack in specials dictionary, if not found, give value 0

put statements together

Thompson's Construction

Thompson's Construction, by Ken Thompson, is used to convert a Postfix Expression into a Non-deterministic Finite Automata. These NFAs can be used to match strings against Regular Expressions.

reads Regular Expressions from left to right to create NFA's

An NFA is represented by its initial and accepts states set up classes

In Python a function is defined using the def keyword:

def my_function():
  print("Hello from a function")

Algorithm: A function has a stack of NFAs, the pofix allows to loop one character at a time until the regular expression is complete.

The stack works as last in first out

pop 2 NFAs off the stack.
merges them together
Connect 1st NFAs accept state to the 2nd's initial
Push NFA to stack
Push newnfa
Create a new initial and accept states
Creates new NFA
Join the initial states to the accept state using an arrow labeled c.
Going to create a new instance of the NFA class.
Set it's initial state to the initial state
Push new NFA to the stack # ¬ returns an instance of the NFA class
push NFA

nfastack should only have a single NFA on it at the point.

Following e arrows

Return the set of states that can be reached from the state following e arrows
Check if the state has arrows labeled e from it
Check if edge1 is a state
If there's an edge1, follow it
Check if edge2 is a state
If there's an edge2, follow it
Return the set of states

Matching Regular Expressions

'current' = THE CURRENT SET
'upcomin' = THE NEXT SET UPCOMING

Shunt and compile the regular expression from the functions in the code

Call the shunt and compiletom function

Current set of states and the next set of the states

Add the initial state to the current set
Loop through each character in the string
Check of that state is labeled s
Add edge1 state to the upcomin set
Set current to upcomin, and clear out next
Check if the accept state is in the set of current states

Testing the code

Testing the Shunting Yard Algorithm

Testing the Shunting Yard Algorithm to see if actually takes a Regular Expression from infix notion to postfix notation.

regex e.g. = (a.b)|(c*.d)

Print the calling of the fuction 'shunt'

print(shunt("(a.b)|(c*.d)"))

Output:

(a.b)|(c*.d) to ab.c*d.|

Testing Thompson's Construction

Test to see if the compiletom function can actually make a regex into a NFA.

Print the calling of the fuction 'compiletom'

print(compiletom("ab.cd.|"))

Output:

<__main__.nfa object at 0x0000028BACDB97F0>

Testing Matching Regular Expressions

Create list of infix regex + Strings to match against the regex infixes.

infixes = ["a.b.c" ,"a.b.c*, a.(b|d).c*", "(a.(b|d))*", "a.(b.b)*.c"]

Double for loop. For loops through the regex and strings.

strings = ["abc", "abbc", "abcc", "abad", "abbbc"]

The match function prints out True/False if the infix regex matches the string.

infix = a.b.c
string = abc

Print the calling of the function 'match'

print(match(exp, res), exp, res)

Output:

True a.b.c abc

This tells that this regex infix matches with this string

Prompt for tests to match Regular Expressions

Type in any Infix and string to be matched and tested

infix = (a.(b|d))* 
string = abc

while loop prints out True/False if the infix regex matches the string

Function prints out True/False if the infix regex matches the string

print(match(infixes, string), infixes, string)

Output:

False (a.(b|d))* abc

This tells that this regex infix does not match with this string

crtl+c and then press the 'Enter' key to exit prompt

Problems encountered

Besides the typical errors of misspelling and missing :, ), etc. I had one problem that stopped my progress with the project. When the match function was called all the matches would come out as 'False' and 'None'. After attempting to find errors that are not clear and do not cause the code to crash I tracked it down to my indentation in the for loop in the match function. After I fixed the indentation the output displayed both 'False' and 'True'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Project Statement

Run the Project:

Requirements:

Clone the Repository:

Open a directory of your choice in Command-Line and enter:

Open the repository directory in Command-Line and enter:

Regular Expressions

Special Characters

Shunting Yard Algorithm

Thompson's Construction

Following e arrows

Matching Regular Expressions

Testing the code

Testing the Shunting Yard Algorithm

Print the calling of the fuction 'shunt'

Testing Thompson's Construction

Print the calling of the fuction 'compiletom'

Testing Matching Regular Expressions

Print the calling of the function 'match'

This tells that this regex infix matches with this string

Prompt for tests to match Regular Expressions

Function prints out True/False if the infix regex matches the string

This tells that this regex infix does not match with this string

Problems encountered

All references can be viewed here.

END OF HOME

Clone this wiki locally