Data has historically been used to answer questions, get insights and express information. But is data always present for us to use how we choose? Sometimes, sure. But most of the time? Not really.

To dive into finding out what factors affect the sales of Electric vehicle for a state, there is a lot of data that needs to be collected. For the purpose of this analysis, the state wise data is necessary. Before collecting the data, it is good to have an idea of the scope of the analysis. For this analysis historical data ranging from 2008-2020 will be used. The topics of data that would be interesting to include this analysis are economic, population, population change, education level, temperature, energy consumption and production, GDP, fuel prices, total jobs and more. This list is in no way exhaustive. It’s important to be flexible while collecting and analyzing data. Why? It’s possible that some factors that one may deem fit for the analysis might not make sense later on.

There are several techniques of collecting data. Data can be collected using APIs, using web scraping, downloading data and occasionally manual entry. All methods are fine if it suits the analysis and the purpose of the analysis. For this project, all aforementioned methods have been utilized. On this page, there will be an explanation followed by code snippets to inform the user better about the steps they can follow to get data from multiple sources. The reader will also learn how to convert the data in a desirable format, on which certain Machine learning models can be applied.

For convenience the full Python code can be found here.

Loading Libraries

To be able to perform the transformations and cleaning that is needed, certain libraries need to be loaded

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import json
import string
import re

APIs

Using APIs can speed up the data collection process tremendously. Below is the code to get State wise jobs per year. If this was going to be done manually, it would take the user a long time. Python is a powerful tool on which transformation such as below can be done.

Data Source: www.kff.org

Using APIs makes the process streamlined. Below is the code to transform the data from json jargon to pandas DataFrame.

## Total employment in each state from 2008-2020


# To get LineCode the below code will return all the possible line codes. 
# We need to select the total employment Line Code
get_line_code = 'http://apps.bea.gov/api/data?&UserID=21B05CEF-CC78-40C4-BF59-752A897A0DAF&method=GetParameterValuesFiltered&datasetname=Regional&TargetParameter=LineCode&TableName=CAEMP25N&ResultFormat=JSON'
response = requests.get(get_line_code)
print(response.content)


# Creating an endpoint for the API call
endpoint_jobs = 'https://apps.bea.gov/api/data'

# Creating  a BEA API dictionary to access tables
bea_API_dic = {'TableName' : 'CAEMP25N', 
'GeoFips' : 'STATE',
'Year' : '2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020',
'datasetname' : 'Regional',
'LineCode' : '10',
'method' : 'GetData',
'UserID' : '21B05CEF-CC78-40C4-BF59-752A897A0DAF',
'ResultFormat' : 'json'
    }


## Creating csv file to add data to
job_filename = 'tot_employment.csv'
file = open(job_filename, 'w')

# Adding headers to the csv file, note the \n in the end
write_this = 'State,Year,num_jobs_tot\n'
file.write(write_this)
file.close()


## Getting data from API
response = requests.get(endpoint_jobs,bea_API_dic)
print(response)
json_t = response.json()
print(json_t)  ########## Use this for before data cleaning

## Saving json file to computer
save_file = open('employement.json', 'w')
json.dump(json_t, save_file, indent =4)  #this saves the file in a pretty format
save_file.close()

## Extracting relevant information from files
file = open(job_filename, 'a')

for item in json_t['BEAAPI']['Results']['Data']:
    #print(item, '\n\n')
    State = item['GeoName']
    #print(State)
    
    Year = item['TimePeriod']
    
    #Removing commas cause it can fuck up the csv file
    num_jobs_total = item['DataValue']
    num_jobs_total = re.sub(',','', num_jobs_total)
        
    #writing each observation in the csv file
    write_new = str(State)+','+str(Year)+','+str(num_jobs_total)+'\n'
    print(write_new)
    file.write(write_new)
    
file.close()


# Converting csv to pandas dataframe
tot_jobs = pd.read_csv(job_filename)
tot_jobs.head()

# Removing all State variable values that are irrelevant 
delete_later_2 = []
for j in range(len(tot_jobs['State'])):
    if tot_jobs.iloc[j]['State'] in List_of_state_names:
        delete_later_2.append(1)
    else:
        delete_later_2.append(0)
        
tot_jobs['Crap'] = delete_later_2

tot_jobs.drop(tot_jobs[tot_jobs['Crap'] != 1].index, inplace =True)

tot_jobs.head()

tot_jobs = tot_jobs[['State', 'Year', 'num_jobs_tot']]
tot_jobs.shape
tot_jobs['Year'] = tot_jobs['Year'].astype(object)
tot_jobs.dtypes

Using API to get article data and transform it to a List of List

A better version of the code can be found on Dr. Gates’ website here. The goal was to get the article descriptions for all articles related to Electric vehicles, renewable energy and energy. This data is collected to perform Association Rule Mining later on to find closest associations with electric vehicles. To perform Association Rule Mining, the data needs to be in a List of List format. The transformation can be seen below. The json text is converted to a neat List of List.

Data Source: www.newsapi.org

The code below performs this transformation. In this code the title, description, date and source are extracted. Since a json is in a dictionary format, the keys in the json text can be used to extract information that lives in the keys’ value. The data is text, and text comes with punctuations. If this data is going to be used for Machine learning, cleaning the text of punctuations, stopwords and uneccessary whitespace needs to be eliminated.

#Getting News data about 'EV', "Electric Vehicles' and 'renewable energy'

filename = 'Headlines_for_analysis.csv'
this_file = open(filename, 'w')

write_stuff = 'Label,Date,Source,Title,Description\n'
this_file.write(write_stuff)
this_file.close()
q_topics = ['electric vehicle', 'renewable energy', 'energy']
end_point = 'https://newsapi.org/v2/everything'

for topic in q_topics:
    
    URL_dic = {'apiKey': 'e814794a812449d1b7c7d485b5e4525a',
              'q': topic,
              'pageSize': 100,
              'sortBy' : 'top',
              'totalRequests': 100
              }
    response = requests.get(end_point, URL_dic)
    print(response)
    json_text = response.json()
    print(json_text)
    
    this_file = open(filename, 'a')
    LABEL = topic
    for item in json_text['articles']:
        print(item, "\n\n")
        
        #Source
        Source = item['source']['name']
        print(Source)
        
        #Date
        A_date = item['publishedAt']
        B_date = A_date.split('T')
        date = B_date[0]
        print(date)
        
        #Title
        title = item['title']
        title = title.lower()
        title = re.sub(r'\d+', '', title)
        title = re.sub('—',' ', str(title))
        table = title.maketrans('','',string.punctuation)
        title = title.translate(table)
        title = re.sub('\n|\r', '', title)
        title = re.sub(r' +', ' ', title)
                    
        title = title.strip()
        print(title)
    
        description = item['description']
        print(description,'\n')
        description = str(description)
        description = description.lower()
        description = re.sub(r'\d+', '', description)
        description = re.sub(r'[,‘’.—;@%_#?!&$/()="<>:\-\']+', ' ', description, flags = re.IGNORECASE)
        description = re.sub('\n|\r', '', description)
        description = re.sub(r' +', ' ', description)
                    
        title = title.strip()
        print(title)
        
        write_stuff = str(LABEL)+','+str(date)+','+str(Source)+','+str(title)+','+str(description)+'\n'
        print(write_stuff)
        
        this_file.write(write_stuff)
    
    this_file.close()
    

text_df = pd.read_csv(filename, error_bad_lines = False, encoding = 'cp1252')
text_df.head()

text_df.columns

text_df = text_df.dropna()
text_df.drop_duplicates(inplace=True)
print(len(text_df['Description']))

description_list = []  ## USE THIS AS THE SNIPPET
label_list = []


ARM_description_list = []
for i in range(len(text_df['Description'])):
    listname = 'list'+str(i)
    
    x = (text_df.iloc[i]['Description'])
    print([x])
    ARM_description_list.append([x])

## Use this list for ARM    
print(ARM_description_list)

Downloading CSV data and converting it to desired DataFrame

Often in data collection, downloading csv and transforming the data is more feasible. To get the age group wise population for each state, multiple csv files were downloaded from kff.org. Then using loops and pandas the data was transformed into a usable format.

Data Source: www.kff.org

In this data there were column values under ‘Location’ that needed to be removed because they were not USA states. A column named ‘Year’ was also added.

#Age distribution by state: Adults 19-25, Adults 26-34, Adults 35-54


year
old_pop_data= pd.read_csv('C:/Users/chaub/Documents/CU_Boulder/Spring 2023/CSCI 5622 Machine Learning/CSCI5662 Project/pop_data_2018.csv', skiprows=2)
print(old_pop_data.shape)


def get_pop_data():
    file_1 = 'C:/Users/chaub/Documents/CU_Boulder/Spring 2023/CSCI 5622 Machine Learning/CSCI5662 Project/pop_data_'
    frames =[]
    for i in range(num_years):
        filename= file_1+year[i]+'.csv'
        
        doc = pd.read_csv(filename, skiprows=(2))
        #print(doc.shape)
        
        doc = doc[['Location', 'Adults 19-25', 'Adults 26-34', 'Adults 35-54']]
               
        print(doc.shape)
        
        delete_later =[]
        
        for j in range(len(doc['Location'])):
            
            
            if doc.iloc[j]['Location'] in List_of_state_names:
                delete_later.append(1)
            else:
                delete_later.append(0)
            
        
        doc['Check'] = delete_later
        doc.drop(doc[doc['Check'] != 1].index, inplace=True)

        doc = doc[['Location',  'Adults 19-25',  'Adults 26-34',  'Adults 35-54']]
        mul = len(doc[['Location']])
        Year_pop_data =[year[i]]*mul
        doc['Year'] = Year_pop_data
        doc['Year'] = doc['Year'].astype(object)
        doc.rename(columns={'Location':'State'}, inplace = True)
        
        frames.append(doc)
    return pd.concat(frames)
        

        #doc.shape
Population_data = get_pop_data()
print(Population_data)
print(Population_data.shape)
print(Population_data.dtypes)

Getting Data using BeautifulSoup

Web scraping using BeautifulSoup is another method of getting data. It’s really fun too. Through this the user also learns about the html structure of a website. This method was used to scrape median salary data from stlouidfred.org. On this website the data is in a format where user needs to click on the year tab for each year to get that year’s data. To avoid downloading multiple csv files, BeautifulSoup along with loops can be used. Below is what the transformation from raw to desired looks like.

Data Source: stlouidfred.org

The above transformation and cleaning can be performed using the code below.

#Average salary by state from 2008-2021


salary_url_pre = 'https://fred.stlouisfed.org/release/tables?rid=249&eid=259515&od='
salary_url_post = '-01-01#'

def get_salary_data():
    
    #defining vectors to add data to
    state_name_inc = []
    median_income_temp = []
    Year_inc = []

    for i in range(num_years):
        
        url = salary_url_pre+year[i]+salary_url_post
        #print(url)
        res = requests.get(url)
        #print(res)
        
        soup_median_income = BeautifulSoup(res.content, 'html.parser')
        print(soup_median_income.find_all('td', class_ = 'fred-rls-elm-nm-cntnr fred-rls-elm-nm-td'))
        find_state_name = soup_median_income.find_all('td', class_ = 'fred-rls-elm-nm-cntnr fred-rls-elm-nm-td')
        for state_name in find_state_name:
            #print(state_name.get_text().strip())
            state_name_inc.append(state_name.get_text().strip())
            Year_inc.append(year[i])
            
        #print(state_name_inc)
        #print(len(state_name_inc))
        
        find_inc = soup_median_income.find_all('td', class_ = 'fred-rls-elm-vl-td')    
        for inc in find_inc:
            to_text = inc.get_text().strip()
            to_text = re.sub(',','',to_text)
            median_income_temp.append(to_text)
            #print(to_text)
        #print(median_income_temp) 
        #print(len(median_income_temp))
        median_inc = median_income_temp[0::3] 
        #print(len(median_inc))
        
         
    final_inc = [state_name_inc, Year_inc, median_inc]
    return final_inc

med_inc = get_salary_data()
print(med_inc)

median_salary_data = pd.DataFrame()
median_salary_data['State'] = med_inc[0]
median_salary_data['Year'] = med_inc[1]
median_salary_data['Median_income'] = med_inc[2]
median_salary_data.dtypes
median_salary_data['Median_income'] = median_salary_data['Median_income'].astype(float)
median_salary_data.dtypes
median_salary_data.drop(median_salary_data[median_salary_data['State']=='The United States'].index, inplace = True)
median_salary_data.head()

There is more data that was collected for this project the source for the raw data is:

  1. Energy consumption raw data: www.eia.gov
  2. Bachelors attainment raw data: fred.stlouisfed.org
  3. Median income raw data: fred.stlouisfed.org
  4. Age distribution population for Adults 19-25, 26-34, 35-54 raw data: www.kff.org
  5. EV, renewable energy, energy text raw data: newsapi.org
  6. Total employment raw data: bea.gov
  7. Average temperature raw data: www.ncei.noaa.gov

Above methods were used to extract data related to statewise energy consumption, GDP per state, Bachelors attainment per state and average temperature data. All the panda DataFrames were combined using the pd.merge() method.

pd_1 = pd.merge(education_data,Population_data, how = 'outer', on=['State', 'Year'])
pd_1.head()
pd_2 = pd.merge(pd_1,State_energy_con, how = 'outer', on=['State', 'Year'])
pd_2.tail()
pd_3 = pd.merge(pd_2,tot_jobs, how = 'outer', on=['State', 'Year'])
pd_3.columns
pd_4 = pd.merge(pd_3,temperature_data, how = 'outer', on=['State', 'Year'])
pd_4.columns
pd_5 = pd.merge(pd_4, median_salary_data, how = 'outer', on=['State', 'Year'])
pd_5.columns
pd_6 = pd.merge(pd_5, GDP_data, how = 'outer', on = ['State', 'Year'])
pd_6.columns
pd_x = pd.merge(pd_6, EV_1, how='outer',  on=['State', 'Year'])
pd_x.columns

The above code resulted in a full DataFrame in record format. Which looks like this:

It can be seen that there are some NaN (Not a number) values in the data, these need further cleaning. Please refer to EDA tab to see next steps