This project was created towards completion of course requirements for CS265-F17 - Topics in Artificial Intelligence
The Intelligent Career Agent takes user input for keywords to search for scrapes ieee.org, acm.org and indeed.com for related jobs and gives a list of the closest matches using KNN to give the K closest sites and K means to cluster the similar sites together within the data.
Usage : in your terminal run
python CareerAgent.py
and the interactive command line interface will ask for your input.
The script then scrapes the websites mentioned above and generates a lookuptable and stores it on your
the system as a shelve file when you search for the first key word.
careerlookup - this file will be created by shelve to store the lookup table, it is a shelve file which is an extension of pickle
The script uses shelve as a data structure/ lookup table to store search results and links to access them as quickly as possible
when they are searched again. After more than 24 hours, if a word is searched for again the lookup table is updated.
CareerResults.html - this file is generated by the script to output the result of a query in the browser
A word about the algorithms:
KNN - I have embedded the KNN algorithm into the lookup table generation by calculating the jaccard similarity and sorting the
list in a descending order of them so as to give the k nearest neighbours extremely quickly if the key is present in the look up
table. The script simply has to get the top K links.
K-means clustering algorithm - I have used the jaccard similarity to cluster vectors together
as for the stopping measure I have used difflib which is a python package which uses
Gestalt pattern matching to produce a similarity ratio betwen two vectors in the range of [0,1]
My stopping parameters are as follows:
if the average similarity value - averagesm for all clusters is greater than 0.7, program
returns the given clusters.(therefore, if the similarity is about 70%)
If the absolute difference between the current avereage similarity and the old average
similarity is less than 0.1. (that is there is less than 10% dissimilarity between the two)
The output of the script is displayed in a new page in the browser