Exploratory Data Analysis can not only provide tools to get insight from data, it is also a good method to find flaws in the data. The original DataFrame created using Kaggle Electric Vehicle population data can be found here.
The code for EDA can be found here.
The first step was to get an idea about the population of Electric Vehicles in each state. It was seen from Fig 1 that only Washington had any Electric Vehicles registered in the whole country. Thanks to domain knowledge, this seemed wrong. After digging into the data, it was found that the Kaggle dataset advertised as a full country dataset, but only had data pertaining to Washington. This is not the data that is needed. Through this visualization, the error in the dataset was easily found.
The next step is to correct this mistake and obtain data for each state. Multiple sources were used to find this data, majority of it was fabricated using this website. The values for adjusted to remove the 2021 and 2022 vehicle registrations. Using the new dataset, Fig 2 shows a better picture and distribution of Electric Vehicle Registrations from year 2008-2020.
After correcting the data, The new dataframe can be seen in Fig 3:
Now that the Electric Vehicle data is correct, it’s important to inquire whether the features associated to the label (Number of Electric Vehicles registered per state per year) are mostly flawless.
Let’s take a look at the Bachelor’s education attainment and GDP, and how the electric vehicle registration relate to each feature. In the figure on the left below, it can be seen that Number of electric vehicles registered are not necessarily dependent on the education level. Although the plot needs to be elaborated more into each state to find a better pattern. In the figure on the right below, it can be seen that the normalized number of electric vehicles does have a slight dependence on the normalized GDP.