# Data Pipeline
Chris Weber
School of Government and Public Policy
University of Arizona
chrisweber@arizona.edu
## Introduction
There are three general data sources for the voter project: 1) The SoS Voter File, 2) Public Opinion Data, and 3) Precinct Level data. Let's start with (1) The data are delivered in two forms -- the SoS maintains a large file updated irregularly. Alternatively, the SoS also distributes individual county files that can be cleaned and structured to create a date specific voter file.
All the original files are stored in a dropbox folder. I then upload them to GPC. Prior to doing this, I clean things up a little, creating common headers.
# The Voter File
There is a file from every county, released at different time points. They're accessed through GCP using the wildcard character.
```
import pandas as pd
import numpy as np
import os
import glob
import re
import datetime
from datetime import date
path =r"/Users/Chris/Dropbox/batch_upload/"
os.chdir(path)
csv_files = glob.glob(os.path.join(path, "*.csv"))
def check_header(df):
if '08/02/2022-2022 PRIMARY ELECTION' not in df.columns:
df['08/02/2022-2022 PRIMARY ELECTION'] = np.nan
else:
df = df
if '08/02/2022-PRIMARY 2022' not in df.columns:
df['08/02/2022-PRIMARY 2022'] = np.nan
else:
df = df
return(df)
for f in csv_files:
path_name = os.path.split(f)[-1]
match = re.search(r'(\d{4}-\d{2}-\d{2})', path_name)
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
df = pd.read_csv(f)
df = check_header(df)
df["build_date"] = date
df.to_csv("/Users/Chris/Dropbox/batch_upload/fix_headers/clean_" + path_name)
```
The **full** file (<az-voter-file.registration_pipeline.combined>) is large and shouldn't be regularly processed (I pay for this). I restrict the data to the most recent deliver, shortly before the election and perform operations on this smaller table. As you'll see, I alternate between views and creating tables. Everything relevant may be found in Data_AVP_001.
From this file, the first thing to do is geocode.
# A Detour: Geocoding
The Voter File includes addresses, but not latitudes and longitudes. Since the addresses are well-formatted and regularly updated, we maintain an active table called "gold_addresses".
Here's how that's built:
```
CREATE OR REPLACE TABLE registration_pipeline.addresses_10_22 as(
with t1 as(
SELECT
distinct(Residence_Address || ', ' || Residence_City || ', ' || Residence_State || ', ' || Residence_Zip || ', ' || County) as address
FROM `az-voter-file.registration_pipeline.combined`
)
SELECT
address,
split(address, ",")[SAFE_OFFSET(0)] as street,
split(address, ",")[SAFE_OFFSET(1)] as city,
split(address, ",")[SAFE_OFFSET(2)] as state,
split(address, ",")[SAFE_OFFSET(3)] as zip,
split(address, ",")[SAFE_OFFSET(4)] as county,
from t1
)
```
This calculates the unique addresses from the massive combined file. We manually export these to a .csv file and geocode in ArcGIS. This could be automated. After it's geocoded, I then re-upload the data -- again this could be automated -- of the full set of coded addresses. Some data are missing.
The geocoder is free, and perhaps not as good as others. You can extract the missing items and use a different geocoder (check for usage restrictios).
```
-- What happens for unmatched addresses, use a different geocoder.
-- David coded these in ArcGIS pro.
-- Create a table of unmatched/tied to suppelemtn
CREATE OR REPLACE TABLE registration_pipeline.unmatched_10_31_22 as(
SELECT * FROM `az-voter-file.registration_pipeline.gold_addresses_geocoded_10_31_22` where Status = "T" OR Status = "U"
)
```
You could merge these into a file, though I'll just go ahead and use the free one.
```
CREATE OR REPLACE TABLE `az-voter-file.registration_pipeline.gold_addresses_1` as(
-- SELECT
-- *
-- FROM `az-voter-file.registration_pipeline.finalgeo`
-- UNION ALL
SELECT
*,
FROM `az-voter-file.registration_pipeline.matched_geo1031`
)
```
This creates a file called <az-voter-file.registration_pipeline.gold_addresses>. It's basically just a master set of addresses, updated 10-31-2022. Any subsequent datasets can be compared and updated.
This cleans up that file and constructs the address file used throughout the analysis.
```
CREATE OR REPLACE VIEW `az-voter-file.Data_AVP_001.gold_addresses_2`as(
with geo_address as (
SELECT
*,
ST_GEOGPOINT(longitude, latitude ) AS location_1,
FROM `az-voter-file.Data_AVP_001.gold_addresses_1`),
lds as (
SELECT
*,
from geo_address as updates cross join `az-voter-file.Data_AVP_001.geo_legislative_districts` as geo
WHERE ST_CONTAINS(st_geogfromgeojson(geo.geo),updates.location_1) ),
blocs as(
SELECT
address,
longitude,
latitude,
location,
cd as legislative_district_22,
geo as geo,
bg.geo_id as census_block
from lds CROSS JOIN `bigquery-public-data.geo_census_blockgroups.blockgroups_04` AS bg
WHERE
ST_CONTAINS(bg.blockgroup_geom, lds.location)
)
SELECT
blocs.*,
geo.cd as congressional_district_22
FROM blocs cross join `az-voter-file.public_data.geo_cd` as geo
WHERE ST_CONTAINS(st_geogfromgeojson(geo.geo),blocs.location)
)
```
## Construct an ML Model
I did this in tensorflow (in python). As a Google product, it could be incorporated into the workflow a number of other ways. The download creates a pickled file.
```
import os
import pandas as pd
from datetime import datetime
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/Users/Chris/Dropbox/Keys/az-voter-file-30395362c45b.json"
import pandas_gbq
from google.cloud import bigquery
from sklearn.preprocessing import MinMaxScaler
import pandas_gbq
### Formulate the SQL query to pull the data from BigQuery
query = """
SELECT
Registrant_ID,
census_block,
party_identification,
GENERAL_2016,
PRIMARY_2016,
GENERAL_2018,
PRIMARY_2018,
GENERAL_2020,
PRIMARY_2020,
PRIMARY_2022,
early_list,
early_voters,
absentee_voters,
provisional_voters,
polling_voters,
CD,
LD,
respondent_age,
median_age,
median_income,
SAFE_DIVIDE(white_pop, total_pop) as white,
SAFE_DIVIDE(black_pop, total_pop) as black,
SAFE_DIVIDE(asian_pop, total_pop) as asian,
SAFE_DIVIDE(hispanic_pop, total_pop) as hispanic,
SAFE_DIVIDE(amerindian_pop, total_pop) as indian,
SAFE_DIVIDE(housing_units, total_pop) as housing,
SAFE_DIVIDE(employed_pop, total_pop) as employed,
SAFE_DIVIDE(armed_forces, total_pop) as armed,
SAFE_DIVIDE(pop_in_labor_force, total_pop) as pop
FROM `az-voter-file.Data_AVP_001.voters_actionable_preML`
"""
df = pandas_gbq.read_gbq(query, project_id="az-voter-file")
df.to_pickle('voter_file_build_001.pkl') ## For later load, not to sync.
```
Compile the model
```
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import keras
from keras import layers
#from tensorflow.keras import layers
#"https://caffeinedev.medium.com/how-to-install-tensorflow-on-m1-mac-8e9b91d93706"
df = pd.read_pickle('voter_file_build_001.pkl')
df['Republican'] = np.where(df['party_identification'] == 'Republican', 1, 0)
df['Democrat'] = np.where(df['party_identification'] == 'Democrat', 1, 0)
df['Independent'] = np.where(df['party_identification'] == 'Independent', 1, 0)
df['engaged'] = np.where((((df["PRIMARY_2018"] == 1)
& (df["GENERAL_2018"] == 1)
& (df["PRIMARY_2022"] == 1)
)),1,0)
full_data_variables = ['engaged', 'Registrant_ID', 'census_block', 'Republican', 'Democrat', 'GENERAL_2016',
'PRIMARY_2016', 'GENERAL_2018', 'PRIMARY_2018', 'GENERAL_2020',
'PRIMARY_2020', 'PRIMARY_2022', 'early_list', 'CD', 'LD',
'respondent_age', 'median_age', 'median_income', 'white', 'black',
'asian', 'hispanic', 'indian', 'housing', 'employed', 'armed', 'pop']
features = ['Republican', 'Democrat', 'GENERAL_2016',
'PRIMARY_2016', 'GENERAL_2020',
'PRIMARY_2020', 'early_list', 'CD', 'LD',
'respondent_age', 'median_age', 'median_income', 'white', 'black',
'asian', 'hispanic', 'indian', 'housing', 'employed', 'armed', 'pop']
st_dat = df[full_data_variables]
st_dat = st_dat.dropna(how = 'any')
scaler = MinMaxScaler()
st_dat_array = scaler.fit_transform(st_dat)
st_dat = pd.DataFrame(st_dat_array, columns = st_dat.columns)
train, test = train_test_split(st_dat, test_size=0.2)
features_train = train[features]
labels_train = pd.DataFrame({"engaged":train['engaged'], "not_engaged":1-train['engaged']})
features_train_array = np.array(features_train, np.float64)
labels_train_array = np.array(labels_train, np.float64)
features_test = test[features]
labels_test = pd.DataFrame({"engaged":test['engaged'], "not_engaged":1-test['engaged']})
features_test_array = np.array(features_test, np.float64)
labels_test_array = np.array(labels_test, np.float64)
from tensorflow.keras.regularizers import l1_l2
model = tf.keras.Sequential()
# Define the first layer
model.add(keras.layers.Dense(15, activation='softmax',
input_shape=(features_train.shape[1],)))
model.add(keras.layers.Dense(13, activation='softmax'))
model.add(keras.layers.Dense(7, activation='softmax'))
model.add(keras.layers.Dense(5, activation='softmax'))
model.add(keras.layers.Dense(2, activation='softmax'))
# Finish the model compilation
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
```
```
print("TensorFlow version:", tf.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.list_physical_devices('GPU')
model.fit(features_train_array,
labels_train_array, epochs=5, batch_size=1000,
validation_split=0.20)
features_test = test[features]
labels_test = pd.DataFrame({"engaged":test['engaged'], "not_engaged":1-test['engaged']})
features_test_array = np.array(features_test, np.float64)
labels_test_array = np.array(labels_test, np.float64)
outcome = model.predict(features_test_array) > 0.5
preds1 = model.predict(features_test_array) > 0.5
preds2 = model.predict(features_train_array)
# Evaluate the model
from tensorflow.keras.metrics import Accuracy, Precision, Recall
acc = Accuracy()
prec = Precision()
recall = Recall()
acc.update_state(labels_test_array, preds1)
acc.result().numpy()
# prec.result().numpy()
# recall.result().numpy()
```
The above code should be grouped into blocks for readability. The accuracy is 0.899738
```
### Standardize the data, train with variables below ####
full_data = df[full_data_variables]
full_data = full_data.dropna(how = 'any')
registrant_id = full_data['Registrant_ID']
full_data_array = scaler.fit_transform(full_data)
full_data = pd.DataFrame(full_data_array, columns = full_data.columns)
full_data['engaged'] = np.where((((full_data["PRIMARY_2018"] == 1)
& (full_data["GENERAL_2018"] == 1)
& (full_data["PRIMARY_2022"] == 1)
)),1,0)
labels_full = pd.DataFrame({"engaged": full_data['engaged'], "not_engaged": 1-full_data['engaged']})
features_full = full_data[features]
features_full_array = np.array(features_full, np.float64)
labels_full_array = np.array(labels_full, np.float64)
preds = model.predict(features_full_array)
preds = pd.DataFrame(preds)
upload_data = pd.DataFrame( {"engaged_pr" : preds.iloc[:,0],
"not_engaged_pr" : preds.iloc[:,1],
"point" : np.random.binomial(1, preds.iloc[:,0]),
"engaged_true" : full_data["engaged"],
"registrant_id" : registrant_id.tolist() } )
from google.cloud import bigquery
from sklearn.preprocessing import MinMaxScaler
import pandas_gbq
bqclient = bigquery.Client()
project_id = "az-voter-file"
pandas_gbq.to_gbq(upload_data, "az-voter-file.registration.nn06", project_id=project_id, if_exists="replace")
```
The data are now uploaded to BQ. I create a view with the appended data. The data we'll regularly use is the appended data_merged.
```
CREATE OR REPLACE VIEW `az-voter-file.Data_AVP_001.data_merged` AS (
SELECT
*
except( int64_field_0, registrant_id),
FROM `az-voter-file.Data_AVP_001.voters_actionable_preML` as m1 LEFT JOIN
`az-voter-file.AZ_Data_001.neural_network_01` as m2 ON
m1.Registrant_ID = m2.registrant_id
)
```
From here, construct census block aggregates.
```
-- This script creates a summary census block file.
-- Construct the census blocks from the neural networks
-- This creates an aggregated score.
CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.block_aggregates` as(
WITH
aggregates AS (
SELECT
census_block,
SUM(CASE
WHEN party_identification = "Republican (REP)" THEN 1
ELSE
0
END
) AS Republicans,
SUM(CASE
WHEN party_identification = "Democrat (DEM)" THEN 1
ELSE
0
END
) AS Democrats,
SUM(CASE
WHEN party_identification = "Libertarian (LIB)" THEN 1
ELSE
0
END
) AS Libertarian,
SUM(CASE
WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" AND party_identification <> "Libertarian (LIB)" THEN 1
ELSE
0
END
) AS Independents,
AVG(SAFE_DIVIDE(white_pop, total_pop)) AS white,
AVG(SAFE_DIVIDE(hispanic_pop, total_pop)) AS hispanic,
AVG(SAFE_DIVIDE(bachelors_degree, total_pop)) AS college_block,
AVG(SAFE_DIVIDE(employed_pop, total_pop)) AS employed_block,
AVG(SAFE_DIVIDE(asian_pop, total_pop)) AS asian,
AVG(SAFE_DIVIDE(amerindian_pop, total_pop)) AS indian,
AVG(SAFE_DIVIDE(housing_units, total_pop)) AS housing,
AVG(SAFE_DIVIDE(employed_pop, total_pop)) AS employed,
AVG(SAFE_DIVIDE(armed_forces, total_pop)) AS armed,
AVG(SAFE_DIVIDE(pop_in_labor_force, total_pop)) AS in_labor_force,
AVG(income_per_capita) AS income_per_capita,
AVG(owner_occupied_housing_units_median_value) AS owner_occupied_median,
AVG(SAFE_DIVIDE(mobile_homes, total_pop)) AS mobile_homes,
AVG(SAFE_DIVIDE(housing_built_2005_or_later, total_pop)) AS new_homes,
AVG(median_year_structure_built) AS median_home_year,
AVG(SAFE_DIVIDE(rent_over_50_percent, total_pop)) AS rent_50_burden,
AVG(respondent_age) AS registrant_age,
AVG(engaged_pr)*100 AS engagement_score,
SUM(CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
) AS threeXthree,
SUM(CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
) AS twoXtwo,
SUM(PRIMARY_2022) AS primary_22,
AVG(absentee_voters) AS absentee,
AVG(provisional_voters) AS provisional,
AVG(polling_voters) AS polling,
AVG(early_voters) AS early,
SUM(early_list) AS early_list,
COUNT(*) AS total_registered
FROM
`az-voter-file.Data_AVP_001.data_merged`
GROUP BY
census_block ),
g2 as (SELECT
aggregates.*,
SAFE_DIVIDE(aggregates.Republicans, aggregates.total_registered) AS republican_percentage,
SAFE_DIVIDE(aggregates.Democrats, aggregates.total_registered) AS democrat_percentage,
SAFE_DIVIDE(aggregates.Libertarian, aggregates.total_registered) AS libertarian_percentage,
SAFE_DIVIDE(aggregates.Independents, aggregates.total_registered) AS independent_percentage,
SAFE_DIVIDE(aggregates.twoXtwo, aggregates.total_registered) AS twoXtwo_percentage,
SAFE_DIVIDE(aggregates.threeXthree, aggregates.total_registered) AS threeXthree_percentage,
SAFE_DIVIDE(aggregates.primary_22, aggregates.total_registered) AS primary_22_percentage,
SAFE_DIVIDE(aggregates.early_list, aggregates.total_registered) AS early_list_percentage,
centroids.geo_id,
centroids.internal_point_lat AS latitude,
centroids.internal_point_lon AS longitude,
FROM
`bigquery-public-data.geo_census_blockgroups.blockgroups_04` AS centroids
JOIN
aggregates
ON
centroids.geo_id = aggregates.census_block
)
SELECT
g2.*,
ld_shapes.legislative_district as LD_CLEAN,
from g2 as g2 CROSS JOIN
`az-voter-file.public_data.ld_shape` as ld_shapes
WHERE
ST_CONTAINS(ld_shapes.geometry,ST_GEOGPOINT(g2.longitude, g2.latitude) )
)
```
Here are the aggregates for LDs
```
-- This script creates a summary census block file.
-- Construct the census blocks from the neural networks
-- This creates an aggregated score.
CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.ld_aggregates` as(
WITH
aggregates AS (
SELECT
LD,
SUM(CASE
WHEN party_identification = "Republican (REP)" THEN 1
ELSE
0
END
) AS Republicans,
SUM(CASE
WHEN party_identification = "Democrat (DEM)" THEN 1
ELSE
0
END
) AS Democrats,
SUM(CASE
WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" THEN 1
ELSE
0
END
) AS Independents,
AVG(respondent_age) AS registrant_age,
AVG(engaged_pr)*100 AS engagement_score,
SUM(CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
) AS threeXthree,
SUM(CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
) AS twoXtwo,
SUM(PRIMARY_2022) AS primary_22,
AVG(absentee_voters) AS absentee,
AVG(provisional_voters) AS provisional,
AVG(polling_voters) AS polling,
AVG(early_voters) AS early,
SUM(early_list) AS early_list,
COUNT(*) AS total_registered
FROM
`az-voter-file.Data_AVP_001.data_merged`
GROUP BY
LD )
SELECT
aggregates.*,
SAFE_DIVIDE(aggregates.Republicans, aggregates.total_registered) AS republican_percentage,
SAFE_DIVIDE(aggregates.Democrats, aggregates.total_registered) AS democrat_percentage,
SAFE_DIVIDE(aggregates.Independents, aggregates.total_registered) AS independent_percentage,
SAFE_DIVIDE(aggregates.twoXtwo, aggregates.total_registered) AS twoXtwo_percentage,
SAFE_DIVIDE(aggregates.threeXthree, aggregates.total_registered) AS threeXthree_percentage,
SAFE_DIVIDE(aggregates.primary_22, aggregates.total_registered) AS primary_22_percentage,
SAFE_DIVIDE(aggregates.early_list, aggregates.total_registered) AS early_list_percentage,
FROM
aggregates
WHERE LD IS NOT NULL
);
CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.cd_aggregates` as(
WITH
aggregates AS (
SELECT
CD,
SUM(CASE
WHEN party_identification = "Republican (REP)" THEN 1
ELSE
0
END
) AS Republicans,
SUM(CASE
WHEN party_identification = "Democrat (DEM)" THEN 1
ELSE
0
END
) AS Democrats,
SUM(CASE
WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" THEN 1
ELSE
0
END
) AS Independents,
AVG(respondent_age) AS registrant_age,
AVG(engaged_pr)*100 AS engagement_score,
SUM(CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
) AS threeXthree,
SUM(CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
) AS twoXtwo,
SUM(PRIMARY_2022) AS primary_22,
AVG(absentee_voters) AS absentee,
AVG(provisional_voters) AS provisional,
AVG(polling_voters) AS polling,
AVG(early_voters) AS early,
SUM(early_list) AS early_list,
COUNT(*) AS total_registered
FROM
`az-voter-file.Data_AVP_001.data_merged`
GROUP BY
CD )
SELECT
aggregates.*,
SAFE_DIVIDE(aggregates.Republicans, aggregates.total_registered) AS republican_percentage,
SAFE_DIVIDE(aggregates.Democrats, aggregates.total_registered) AS democrat_percentage,
SAFE_DIVIDE(aggregates.Independents, aggregates.total_registered) AS independent_percentage,
SAFE_DIVIDE(aggregates.twoXtwo, aggregates.total_registered) AS twoXtwo_percentage,
SAFE_DIVIDE(aggregates.threeXthree, aggregates.total_registered) AS threeXthree_percentage,
SAFE_DIVIDE(aggregates.primary_22, aggregates.total_registered) AS primary_22_percentage,
SAFE_DIVIDE(aggregates.early_list, aggregates.total_registered) AS early_list_percentage,
FROM
aggregates
WHERE CD IS NOT NULL
)
```
Finally, use a star schema to construct the data for analysis.
```
---Star Table
CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.star_table`as (
with s1 as(
SELECT
count(*) as voters,
census_block as geoid,
cast(CD as int64) as CD,
cast(LD as int64) as LD,
party_identification as pid,
FROM `az-voter-file.Data_AVP_001.data_merged`
WHERE party_identification is not null AND LD is not null and CD is not null
group by 2, 3, 4, 5
),
s2 as(
SELECT
*,
FROM s1 JOIN `az-voter-file.Data_AVP_001.block_aggregates` as block ON
block.census_block = s1.geoid
),
s3 as (
SELECT
s2.*,
SAFE_DIVIDE(CD_1.Republicans, CD_1.total_registered) as Republicans_CD,
SAFE_DIVIDE(CD_1.Democrats, CD_1.total_registered) as Democrats_CD,
SAFE_DIVIDE(CD_1.Independents, CD_1.total_registered) as Independents_CD,
CD_1.registrant_age as registrant_age_CD,
CD_1.engagement_score as engagement_score_CD,
SAFE_DIVIDE(CD_1.threeXthree, CD_1.total_registered) as threeXthree_CD,
SAFE_DIVIDE(CD_1.twoXtwo, CD_1.total_registered) as twoXtwo_CD,
SAFE_DIVIDE(CD_1.primary_22, CD_1.total_registered) as primary_22_CD,
CD_1.absentee as absentee_CD,
CD_1.provisional as provisional_CD,
CD_1.polling as polling_CD,
CD_1.early as early_CD,
SAFE_DIVIDE(CD_1.early_list, CD_1.total_registered) as early_list_CD,
CD_1.total_registered as total_registered_CD,
FROM s2 JOIN `az-voter-file.Data_AVP_001.cd_aggregates` as CD_1
ON s2.CD = cast(CD_1.CD as int64)
),
final_merge as(
SELECT
s3.*,
SAFE_DIVIDE(LD_1.Republicans, LD_1.total_registered) as Republicans_LD,
SAFE_DIVIDE(LD_1.Democrats, LD_1.total_registered) as Democrats_LD,
SAFE_DIVIDE(LD_1.Independents, LD_1.total_registered) as Independents_LD,
LD_1.registrant_age as registrant_age_LD,
LD_1.engagement_score as engagement_score_LD,
SAFE_DIVIDE(LD_1.threeXthree, LD_1.total_registered) as threeXthree_LD,
SAFE_DIVIDE(LD_1.twoXtwo, LD_1.total_registered) as twoXtwo_LD,
SAFE_DIVIDE(LD_1.primary_22, LD_1.total_registered) as primary_22_LD,
LD_1.absentee as absentee_LD,
LD_1.provisional as provisional_LD,
LD_1.polling as polling_LD,
LD_1.early as early_LD,
SAFE_DIVIDE(LD_1.early_list, LD_1.total_registered) as early_list_LD,
LD_1.total_registered as total_registered_LD,
from s3 JOIN `az-voter-file.Data_AVP_001.ld_aggregates` as LD_1
ON s3.LD = cast(LD_1.LD as int64)
),
cds as (
SELECT
fm.*,
FROM final_merge as fm JOIN `az-voter-file.Data_AVP_001.cd_geometries` as cd_geo ON
cd_geo.cd = fm.cd
)
SELECT
cds.*,
FROM cds JOIN `az-voter-file.Data_AVP_001.ld_geometries` as ld_geo ON
ld_geo.cd = cds.CD
)
```
And as a brief aside -- the tableau workbook includes information about candidates. I created several .csv files by scraping the SoS website.
```
from six import class_types
from bs4 import BeautifulSoup as soup
from lxml.html import parse
import requests
import pandas as pd
rows = []
URL = "https://www.azleg.gov/MemberRoster/?body=S"
page=requests.get(URL)
soup = soup(page.content, 'html.parser')
table = soup.find_all('tr')
portraits = []
party = []
district = []
names = []
for table in table:
tmp = re.findall(r"http\S+MemberPhotos\S+\.[a-z]{3}", str(table))
re.sub(r"\[|\]", "", str(tmp))
portraits.append(tmp)
party.append(re.findall(r"Democrat|Republican", str(table)))
tmp = re.findall(r"District\S\d+", str(table))
tmp = re.findall(r"[0-9].", str(tmp))
re.sub(r"\[|\]", "", str(tmp))
district.append(tmp)
tmp = re.findall(r"(?=https)(.*)(?=<strong>)", str(table))
tmp = re.sub(r"http\S+>", "", str(tmp))
re.sub(r"\[|\]", "", str(tmp))
names.append(tmp)
senators = pd.DataFrame(list(zip(district, names, party, portraits)), columns = ["district", "legislator", "party", "portraits"])
senators["district"] = senators["district"].str[0]
senators["party"] = senators["party"].str[0]
senators["legislator"] = senators["legislator"].str.extract(r"(\w+\s\w+)")
```
```
from six import class_types
from bs4 import BeautifulSoup as soup
from lxml.html import parse
import requests
import pandas as pd
rows = []
URL = "https://www.azleg.gov/MemberRoster/?body=H"
page=requests.get(URL)
soup = soup(page.content, 'html.parser')
table = soup.find_all('tr')
portraits = []
party = []
district = []
names = []
for table in table:
tmp = re.findall(r"http\S+MemberPhotos\S+\.[a-z]{3}", str(table))
re.sub(r"\[|\]", "", str(tmp))
portraits.append(tmp)
party.append(re.findall(r"Democrat|Republican", str(table)))
tmp = re.findall(r"District\S\d+", str(table))
tmp = re.findall(r"[0-9].", str(tmp))
re.sub(r"\[|\]", "", str(tmp))
district.append(tmp)
tmp = re.findall(r"(?=https)(.*)(?=<strong>)", str(table))
tmp = re.sub(r"http\S+>", "", str(tmp))
re.sub(r"\[|\]", "", str(tmp))
names.append(tmp)
house = pd.DataFrame(list(zip(district, names, party, portraits)), columns = ["district", "legislator", "party", "portraits"])
house["district"] = house["district"].str[0]
house["party"] = house["party"].str[0]
house["legislator"] = house["legislator"].str.extract(r"(\w+\s\w+)")
house.to_csv( "house.csv")
senators.to_csv( "senators.csv")
```
Complete: Data Analysis in Tableau, D3.js, etc
# Precinct Analysis
The data is an xml file. I parsed and uploaded the table to Big Query. Here are the queries to reproduce that data.
Here I created a file called **precinct_aggregated** which I use in Big Query.
```
WITH
t1 AS (
SELECT
latitude,
longitude,
party_identification,
CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
AS threeXthree,
CASE
WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1
ELSE
0
END
AS twoXtwo,
latitude,
longitude,
ST_GEOGPOINT(longitude, latitude) AS location
FROM
`az-voter-file.Data_AVP_001.data_merged`),
# This is the precinct file, built from the state's xml file.
geo AS(
SELECT
geo.precinct ,
geo.geo,
FROM
`az-voter-file.Data_AVP_001.precincts_actionable` AS geo
WHERE
ballot_candidate = "Hobbs_Katie" ),
voter_file AS(
SELECT
SUM(CASE
WHEN party_identification = "Republican (REP)" THEN 1
ELSE
0
END
) AS Republicans,
SUM(CASE
WHEN party_identification = "Democrat (DEM)" THEN 1
ELSE
0
END
) AS Democrats,
SUM(CASE
WHEN party_identification = "Libertarian (LIB)" THEN 1
ELSE
0
END
) AS Libertarian,
SUM(CASE
WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" AND party_identification <> "Libertarian (LIB)" THEN 1
ELSE
0
END
) AS Independents,
SUM(twoXtwo) AS two_two,
SUM(threeXthree) AS three_three,
geo2.precinct
FROM
t1 AS t1
CROSS JOIN
geo AS geo2
WHERE
ST_CONTAINS(SAFE.st_geogfromgeojson(geo2.geo),location)
GROUP BY
precinct ),
voter_file_and_ld AS (
SELECT
voter_file.*,
precincts.*,
ST_CENTROID(SAFE.st_geogfromgeojson(precincts.geo)) AS point,
ST_X(ST_CENTROID(SAFE.st_geogfromgeojson(precincts.geo))) AS longitude,
ST_Y(ST_CENTROID(SAFE.st_geogfromgeojson(precincts.geo))) AS latitude,
FROM
voter_file AS voter_file
JOIN
`az-voter-file.Data_AVP_001.precincts_actionable` AS precincts
ON
voter_file.precinct = precincts.precinct ),
voter_file_with_cd AS(
SELECT
voter_file_and_ld.*,
geo.cd AS legislative_district
FROM
voter_file_and_ld
CROSS JOIN
`az-voter-file.Data_AVP_001.ld_geometries` AS geo
WHERE
ST_CONTAINS(ST_GEOGFROMGEOJSON(geo.geo),voter_file_and_ld.point) ),
cds AS (
SELECT
voter_file_with_cd.*,
geo.cd AS congressional_district
FROM
voter_file_with_cd AS voter_file_with_cd
CROSS JOIN
`az-voter-file.Data_AVP_001.ld_geometries` AS geo
WHERE
ST_CONTAINS(ST_GEOGFROMGEOJSON(geo.geo),voter_file_with_cd.point) ),
blocks AS (
SELECT
cds.*,
bg.geo_ID AS geographic_identifier,
bg.internal_point_lat AS block_latitude,
bg.internal_point_lon AS block_longitude
FROM
cds
CROSS JOIN
`bigquery-public-data.geo_census_blockgroups.blockgroups_04` AS bg
WHERE
ST_CONTAINS(bg.blockgroup_geom, point) )
SELECT
*,
FROM
blocks
LEFT JOIN
`bigquery-public-data.census_bureau_acs.blockgroup_2018_5yr` AS a
ON
blocks.geographic_identifier = a.geo_id
-- SELECT
-- *,
-- safe_subtract(cast(Yee_Kimberly as int), cast(Masters_Blake as int)) as Yee_Margin,
-- safe_add(cast(Fontes_Adrian as int), cast(Finchem_Mark as int)) as SoS_Total,
-- safe_subtract(cast(Mayes_Kris as int), cast(Hamadeh_Abraham_Abe as int)) as Mayes_Margin,
-- safe_add(cast(Mayes_Kris as int), cast(Hamadeh_Abraham_Abe as int)) as AG_Total,
-- safe_subtract(cast(Quezada_Martin as int), cast(Yee_Kimberly as int)) as Quezada_margin,
-- safe_add(cast(Quezada_Martin as int), cast(Yee_Kimberly as int)) as Treasurer_Total,
-- safe_subtract(cast(Kelly_Mark as int), cast(Masters_Blake as int)) as Kelly_Margin,
-- safe_add(cast(Kelly_Mark as int), cast(Masters_Blake as int)) as Senate_Total,
-- safe_subtract(cast(Hobbs_Katie as int), cast(Lake_Kari as int)) as Hobbs_Margin,
-- safe_add(cast(Hobbs_Katie as int), cast(Lake_Kari as int)) as Governor_Total,
-- FROM `az-voter-file.public_data.precinct_counts_07````
```
```
# Survey Analysis