Build a Simple Chatbot in Python with Naive Bayes

In the previous blog, we were able to create a well performing chatbot with no more than 70 lines of code. In this blog, we are going to cover a slightly more complicated algorithm, although still considerably more simple than a Sequence-to-Sequence (Seq2Seq) model. One thing that differentiates this model from the previous is that it is necessary to have a decent amount of data for the model to be accurate. In our case we are using a file that only has 41 rows, so the accuracy of the model will not be as high due to this.

The file used in this exercise can be downloaded here.

The pandas package is used to import the csv file and the first field category is dropped. The function to_dict transforms our dataset into a data dictionary.

import pandas as pd
mydata= pd.read_csv(‘car_dealership.csv’, header=0)
mydata2= mydata.drop(‘category’, axis=1)
training_data = mydata2.to_dict(‘r’)


NLTK is a very large and valuable library that has been used for many years. It is specifically for natural language processing and has multiple functions within it. The first one we are going to use is the Lancaster Stemmer. In NLP, stemming is the process of reducing a word to its stem so that similar works can be used similarly in a model. For example the words “thankful”, “thanking”, and “thanks” would be stemmed to “thank”. Therefore “thank” would now show up more frequently in the data and could help the prediction.

import nltk
from nltk.stem.lancaster import LancasterStemmer
stemmer= LancasterStemmer()


The dataset will now be broken out separately into question (questions) and response. These are then tokenized, stemmed, lower-cased, and deduped. If the question word is not already in our data set then it will be given a value of 1, if it is already in there then it is given a value of +1.

question_words = {}
response_words = {}
responses = list(set([a[‘response’] for a in training_data]))
for r in responses:
    
response_words[r] = []

for data in training_data:
     for word in nltk.word_tokenize(data[‘sentence’]):
          if word not in [“?”, “‘s”]:
               stemmed_word = stemmer.stem(word.lower())
                   if stemmed_word not in question_words:
                        question_words[stemmed_word] = 1
                   else:
                        question_words[stemmed_word] += 1
                   response_words[data[‘response’]].extend([stemmed_word])


A function called calculate_response_score will be created with the following steps:

  1. Tokenize, stem, and lower-case the user input.
  2. Check to see if the question is in the responses
  3. Give each word a relative weight and calculate the score

def calculate_response_score(question, response_name, show_details=True):
     score = 0
     for word in nltk.word_tokenize(question):
          if stemmer.stem(word.lower()) in response_words[response_name]:
             score += (1 / question_words[stemmer.stem(word.lower())])
     return score

A function called translate will be created where we will create two datasets with the responses themselves and one with the corresponding score (calculated with the calculate_response_score function). The max score will be chosen and the corresponding response will be printed. If all of the scores are zero, the output will be an error message.

def translate(question):
     rawdata_r= []
     rawdata_response= []
     for r in response_words.keys():
         rawdata_r.append(r)
     for r in response_words.keys():
         rawdata_response.append(calculate_response_score(question,r))
     df= pd.DataFrame(list(zip(rawdata_r, rawdata_response)),columns =[‘Response’, ‘Score’])
     out1= df.loc[df[‘Score’].idxmax()]
     out2=out1[‘Response’]
     if out1[‘Score’]==0 :
         print(“Sorry, I dont quite understand. Could you rephrase that?”)
     else:
         print(out2)


Finally we will run the chatbot with a while loop. When the user types ‘quit’, the program will end.

print (“\n Welcome to our dealership! How may I help you? \n”)
while (True):
     question= input().lower()
     if question== ‘quit’:
         break
     translate(question)