Skip to content

Implemented machine learning models like Bayes, HMM & MCMC from scratch for POS tagging using Python with 94% accuracy

Notifications You must be signed in to change notification settings

santoshd97/Part-of-Speech-Tagging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Part of Speech Tagging from Scratch

Naive Bayes Method

Bayes model

  • We use the following Bayes formula to find probability: P(Si|Wi) = P(Wi|Si) * P(Si)/P(Wi)
  • Since the denominator will remain same for given word, we ignore the P(Wi). P(Si|Wi) = P(Wi|Si) * P(Si)
  • For each word Wi in test sentence, the part of speech tag Si with the maximum probability P(Si|Wi) is considered for classifying the word Wi
  • Handling new words in simple model: If a word in the test sentence is not present in emission probability calculated from training data then we assign emission probability to that word = (minimum of all emission probabilities)*0.1

Hidden Markov Model(HMM) using Viterbi

HMM model

  • Viterbi is based on dynamic programming
  • The number of states in Viterbi model = number of words in test sentence
  • For the first word, we only calculate emission probability * prior probability for all 12 parts of speech and save the values in dictionary v["state, part of speech"]
  • From second word onwards, we calculate max for each of 12 part of speech = max(previous Viterbi state * transition probability). Corresponding part of speech and pervious part of speech is stored in a list of list called tag. Then we multiply this value with corresponding emission probability and get value for 2nd Viterbi state
  • This process continues till last word of sentence
  • We find max value of last Viterbi state which is stored in dictionary v and get corresponding part of speech tag which is the tag for last word in sentence. We then backtrack using this part of speech to find tags for each word
  • Handling emission probabilities of new words by assigning min(emission probability)*0.1
  • We have first taken log of individual probabilities (emission, transition) and then added them instead of multiplication given in formula to handle underflow problem

Markov Chain Monte Carlo(MCMC)

MCMC model

  • First, generate a random sample consisting of all nouns. From this sample, we generate a new sample using create_sample() function
  • To generate a sample, the probability of a part of speech tag for a word is calculated, given all the other words and their corresponding tags observed, that is P(S_i | (S - {S_i}), W1,W2,...,Wn)
  • For first word we calculated product of emission and prior probability
  • For the second word we calculated product of emission and transition of tag from 1st to 2nd word
  • From 3rd word onwards, we calculated product of emission, transition from previous to current tag, and transition from previous of previous to current tag
  • We then divide these probabilities for each tag of given word by sum of all probabilities of all tags for that word. Using np.random.choice we randomly get a tag having probability close to 1 and assign this tag to that word in sample. We do this process for all samples
  • To handle emission probability of new words, a small probability (min(emission)*0.1) is assigned. The first 1000 samples were discarded to pass the warming period and improve the accuracy
  • To calculate marginal distribution from samples, we first generated 3000 samples using MCMC (after discarding the first 1000 samples)
  • From these samples we calculate max probability of each part of speech tag corresponding to each word in the test sentence
  • We assign the part of speech tag having maximum probability for a word and combine them to get the part of speech tags for the entire sentence
  • Handling emission probabilities of new words by assigning min(emission probability)*0.1

Result

Command to run code ./label.py bc.train.txt bc.test.txt

Naive Bayes Accuracy

Words correct = 93.92%
Sentences correct = 47.45%

HMM using Viterbi Accuracy

Words correct = 94.70%
Sentences correct = 53.25%

MCMC Accuracy

Words correct = 95.08%
Sentences correct = 55.65%

About

Implemented machine learning models like Bayes, HMM & MCMC from scratch for POS tagging using Python with 94% accuracy

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages