Introduction

Naive Bayes is a popular algorithm used in machine learning for classification problems. It is a probabilistic algorithm that predicts the probability of an input feature belonging to a particular class. In this article, we will discuss two types of Naive Bayes algorithms: Multinomial Naive Bayes and Bernoulli Naive Bayes.

Multinomial Naive Bayes

Multinomial Naive Bayes is a variant of the Naive Bayes algorithm that is used for text classification problems. It is called Multinomial because it assumes that the input features are generated from a multinomial distribution. This means that the input features represent the frequencies of words in a document or a corpus. The algorithm is called Naive Bayes because it assumes that the input features are independent of each other, which is not always true in practice.

In Multinomial Naive Bayes, each input feature represents the number of times a particular word occurs in a document or a corpus. The algorithm then calculates the probability of each class given the input features. The class with the highest probability is then assigned to the input feature. The algorithm assumes that the probabilities of the input features are independent of each other, which is why it is called Naive Bayes.

Application of Multinomial Naive Bayes

Multinomial Naive Bayes is primarily used for text classification problems. It is used to classify documents or text data into different categories based on the frequency of words in the documents. Some examples of applications of Multinomial Naive Bayes include spam detection, sentiment analysis, and topic classification.

The mathematical explanation of Multinomial Naive Bayes involves calculating the probability of each class given the input features. The algorithm assumes that the input features are generated from a multinomial distribution.

Let’s assume that we have a set of input features X = {x1, x2, …, xn} and a set of classes C = {c1, c2, …, cm}. The probability of a class ci given the input features X can be calculated using Bayes’ theorem:

P(ci|X) = P(X|ci) * P(ci) / P(X)

Where P(ci|X) is the posterior probability of class ci given the input features X, P(X|ci) is the likelihood of the input features X given the class ci, P(ci) is the prior probability of class ci, and P(X) is the probability of the input features X.

In Multinomial Naive Bayes, the likelihood of the input features X given the class ci can be calculated as:

P(X|ci) = ∏(P(xi|ci)^n(xi))

Where P(xi|ci) is the probability of the i-th feature xi given the class ci, and n(xi) is the frequency of the i-th feature xi in the input features X.

The prior probability of class ci can be calculated as:

P(ci) = N(ci) / N

Where N(ci) is the number of times class ci appears in the training data, and N is the total number of training examples.

The probability of the input features X can be calculated as:

P(X) = ∑(P(X|ci) * P(ci))

Where the sum is over all the classes in C.

Bernoulli Naive Bayes

Bernoulli Naive Bayes is a variant of the Naive Bayes algorithm used for text classification problems where the input features are binary, i.e., they represent the presence or absence of a word in a document. It is called Bernoulli because it assumes that the input features follow a Bernoulli distribution.

The Bernoulli distribution is a probability distribution that describes the outcomes of a random experiment that has only two possible outcomes, such as flipping a coin or rolling a dice. In the context of text classification, the experiment is whether a particular word is present or absent in a document, and the outcome is either 1 or 0, respectively.

The Bernoulli Naive Bayes algorithm assumes that the probability of the input features given a class is independent of each other, and it calculates the probability of each class given the input features. The class with the highest probability is then assigned to the input feature.

Let’s consider an example to understand how Bernoulli Naive Bayes works. Suppose we have a training dataset with two classes: “spam” and “not spam.” We also have a set of input features that represent the presence or absence of certain words in an email. For simplicity, let’s assume that there are only three input features: “buy,” “discount,” and “offer.”

The Bernoulli Naive Bayes algorithm calculates the probability of each class given the input features as follows:

P(spam | X) = P(X | spam) * P(spam) / P(X) P(not spam | X) = P(X | not spam) * P(not spam) / P(X)

Where X represents the input features, and P(spam) and P(not spam) are the prior probabilities of the two classes.

The likelihood of the input features X given the class spam can be calculated as:

P(X | spam) = P(buy | spam) * P(discount | spam) * P(offer | spam)

Where P(buy | spam) represents the probability of the word “buy” appearing in a spam email, P(discount | spam) represents the probability of the word “discount” appearing in a spam email, and P(offer | spam) represents the probability of the word “offer” appearing in a spam email.

Similarly, the likelihood of the input features X given the class not spam can be calculated as:

P(X | not spam) = P(buy | not spam) * P(discount | not spam) * P(offer | not spam)

The probability of the input features X can be calculated as:

P(X) = P(X | spam) * P(spam) + P(X | not spam) * P(not spam)

Finally, the Bernoulli Naive Bayes algorithm assigns the class with the highest probability to the input feature X.

One potential issue with Bernoulli Naive Bayes is the “zero-frequency problem.” If a particular word does not appear in the training data for a particular class, the probability of that word given the class will be zero, which will cause the entire likelihood to be zero. To address this problem, a smoothing technique such as Laplace smoothing can be applied to avoid zero probabilities.

Laplace Smoothing

Laplace smoothing, also known as add-one smoothing, is a technique used to address the “zero-frequency problem” in Bayesian statistics and machine learning. The zero-frequency problem occurs when a particular event or feature has zero occurrence in the training data, leading to a probability estimate of zero for that event or feature. This can result in an inaccurate or unstable model, especially for small datasets.

Laplace smoothing involves adding a small amount of “pseudo-count” to the observed counts for each event or feature. This has the effect of “smoothing” the probability estimates and reducing the impact of zero counts. Specifically, for each event or feature, Laplace smoothing adds one to the observed count and adds a total of “k” to the denominator of the probability estimate, where “k” is the number of possible outcomes or categories.

The formula for Laplace smoothing can be expressed as:

P(x) = (count(x) + 1) / (N + k)

Where:

  • P(x) is the smoothed probability estimate of the event or feature x
  • count(x) is the observed count of the event or feature x in the training data
  • N is the total number of observations in the training data
  • k is the number of possible outcomes or categories

For example, suppose we have a binary classification problem where the input features are words in a document, and we want to estimate the probability of each class given the input features using Naive Bayes. If a particular word does not appear in the training data for a particular class, the probability estimate for that word will be zero, leading to a zero probability estimate for the entire document. To address this problem, we can apply Laplace smoothing by adding a count of one to the observed counts for each word and adding two to the denominator of the probability estimate (since there are two possible classes). This has the effect of “smoothing” the probability estimates and reducing the impact of zero counts.

Laplace smoothing is a simple and effective technique for addressing the zero-frequency problem in Bayesian statistics and machine learning. However, it is not always the best choice, and other smoothing techniques, such as Good-Turing smoothing and Jelinek-Mercer smoothing, may be more appropriate for specific applications.

Data Prep

Multinomial Naive Bayes is specifically designed to handle discrete data, such as word frequencies or document term matrices. The following are the possible options and formats of data that can be used with Multinomial Naive Bayes:

  1. Bag-of-words (BoW) representation: In this format, the text data is first tokenized and then converted into a matrix where each row represents a document, and each column represents a unique word in the corpus. The values in the matrix represent the frequency of each word in the corresponding document. This format can be used directly with Multinomial Naive Bayes.
  2. Term Frequency-Inverse Document Frequency (TF-IDF) representation: This is a variant of the BoW representation that takes into account the importance of each word in the corpus. In this format, the values in the matrix are normalized by the number of documents in which each word appears. This format can also be used directly with Multinomial Naive Bayes.
  3. Count-based features: In addition to word frequencies, other discrete features can also be used with Multinomial Naive Bayes, such as the presence or absence of certain keywords or phrases, or the frequency of specific characters or n-grams in the text.
  4. Categorical data: Multinomial Naive Bayes can also be used with other types of categorical data, such as survey responses, user ratings, or product reviews, where each observation represents a category or label and a set of discrete features or attributes.

Overall, the data used with Multinomial Naive Bayes should be in a format that represents discrete features or attributes that are relevant to the classification task. The choice of data format will depend on the specific problem and the nature of the available data.

Project specific Data prep

For this project continuous variable will be converted to categorical variables and then Multinomial Naive Bayes is used for analysis.

The data requires a label and the data to be categorical, Naive Bayes performs better with frequency matrix and categorical variable. Before proceeding with Naive Bayes we would also need to check the correlation of the independent variables. Naive Bayes assumes independence between the independent variables.

Fig 1. Correlation Plot of the independent variables and dependent variable

From the Correlation plot, it can be seen that the features in the model are not independent and are highly correlated. Technically this makes sense considering the nature of the data, which is population centric. The features in the analysis directly relate to the number of people in a state. This presents a big issue in the analysis as the data also does not show good correlation of the independent variables with the dependent variable which is the Number of registrations. Although this poses a huge issue in the analysis, it’s not difficult to apply Naive bayes to this data. To apply Naive Bayes, the data was converted from numeric and continuous to categorical for each feature. In the following image, the transformation of the data from continuous numeric to discrete categorical can be seen. The categorical data used for Naive Bayes can be found here.

Fig 2. Electric Vehicle data with continuous variables
Fig 3. Electric vehicle data with categorical variables

Now that the format of data is appropriate to apply Naive Bayes, the data can be split into a training and testing set. It is crucial that the training and testing sets are disjoint. If they weren’t disjoint the model would get exposed to the test observations and would perform exceptionally well. The model trained on intersecting training and testing sets can be quite useless in real world applications.

In R, the data can be split into training and testing set using sample.split() function. Ensure that the split is made without replacement. The code for implementing Naive Bayes can be found here. A glimpse of the training and testing sets can be seen below:

Fig 4. Training set
Fig 5. Test set

The training set is used to train the model and the values for the test set are predicted. Applying Naive Bayes to this dataset yielded the following results.

Fig 6. Confusion Matrix and accuracy of Naive Bayes Model
Fig 7. Class wise metrics

In this model, class 1 represents Low number of Electric Vehicles, class 2 represents Medium number of Electric Vehicles and class 3 represents High number of electric vehicles. It can be seen that the model performs poorly in predicting class 3 correctly. It predict most of class 3 observations as class 1. This sheds light on the fact that the features used in the model might not be the best to predict the number of electric vehicles in a state.

Conclusion

The model does fairly well in predicting class 1, but does poorly in predicting class 2 and class 3. From applying Naive Bayes it is clear that the features of the model need to be reconsidered. A feature representing the adoption threshold point of electric vehicles needs to be introduced in the form of a binary variable. GDP needs to be converted to GDP per capita. The population metrics need to be compressed to one metric, perhaps a percentage would be more applicable than an absolute value.