Data Pipeline - HackMD

# Data Pipeline Chris Weber School of Government and Public Policy University of Arizona chrisweber@arizona.edu ## Introduction There are three general data sources for the voter project: 1) The SoS Voter File, 2) Public Opinion Data, and 3) Precinct Level data. Let's start with (1) The data are delivered in two forms -- the SoS maintains a large file updated irregularly. Alternatively, the SoS also distributes individual county files that can be cleaned and structured to create a date specific voter file. All the original files are stored in a dropbox folder. I then upload them to GPC. Prior to doing this, I clean things up a little, creating common headers. # The Voter File There is a file from every county, released at different time points. They're accessed through GCP using the wildcard character. ``` import pandas as pd import numpy as np import os import glob import re import datetime from datetime import date path =r"/Users/Chris/Dropbox/batch_upload/" os.chdir(path) csv_files = glob.glob(os.path.join(path, "*.csv")) def check_header(df): if '08/02/2022-2022 PRIMARY ELECTION' not in df.columns: df['08/02/2022-2022 PRIMARY ELECTION'] = np.nan else: df = df if '08/02/2022-PRIMARY 2022' not in df.columns: df['08/02/2022-PRIMARY 2022'] = np.nan else: df = df return(df) for f in csv_files: path_name = os.path.split(f)[-1] match = re.search(r'(\d{4}-\d{2}-\d{2})', path_name) date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date() df = pd.read_csv(f) df = check_header(df) df["build_date"] = date df.to_csv("/Users/Chris/Dropbox/batch_upload/fix_headers/clean_" + path_name) ``` The **full** file (<az-voter-file.registration_pipeline.combined>) is large and shouldn't be regularly processed (I pay for this). I restrict the data to the most recent deliver, shortly before the election and perform operations on this smaller table. As you'll see, I alternate between views and creating tables. Everything relevant may be found in Data_AVP_001. From this file, the first thing to do is geocode. # A Detour: Geocoding The Voter File includes addresses, but not latitudes and longitudes. Since the addresses are well-formatted and regularly updated, we maintain an active table called "gold_addresses". Here's how that's built: ``` CREATE OR REPLACE TABLE registration_pipeline.addresses_10_22 as( with t1 as( SELECT distinct(Residence_Address || ', ' || Residence_City || ', ' || Residence_State || ', ' || Residence_Zip || ', ' || County) as address FROM `az-voter-file.registration_pipeline.combined` ) SELECT address, split(address, ",")[SAFE_OFFSET(0)] as street, split(address, ",")[SAFE_OFFSET(1)] as city, split(address, ",")[SAFE_OFFSET(2)] as state, split(address, ",")[SAFE_OFFSET(3)] as zip, split(address, ",")[SAFE_OFFSET(4)] as county, from t1 ) ``` This calculates the unique addresses from the massive combined file. We manually export these to a .csv file and geocode in ArcGIS. This could be automated. After it's geocoded, I then re-upload the data -- again this could be automated -- of the full set of coded addresses. Some data are missing. The geocoder is free, and perhaps not as good as others. You can extract the missing items and use a different geocoder (check for usage restrictios). ``` -- What happens for unmatched addresses, use a different geocoder. -- David coded these in ArcGIS pro. -- Create a table of unmatched/tied to suppelemtn CREATE OR REPLACE TABLE registration_pipeline.unmatched_10_31_22 as( SELECT * FROM `az-voter-file.registration_pipeline.gold_addresses_geocoded_10_31_22` where Status = "T" OR Status = "U" ) ``` You could merge these into a file, though I'll just go ahead and use the free one. ``` CREATE OR REPLACE TABLE `az-voter-file.registration_pipeline.gold_addresses_1` as( -- SELECT -- * -- FROM `az-voter-file.registration_pipeline.finalgeo` -- UNION ALL SELECT *, FROM `az-voter-file.registration_pipeline.matched_geo1031` ) ``` This creates a file called <az-voter-file.registration_pipeline.gold_addresses>. It's basically just a master set of addresses, updated 10-31-2022. Any subsequent datasets can be compared and updated. This cleans up that file and constructs the address file used throughout the analysis. ``` CREATE OR REPLACE VIEW `az-voter-file.Data_AVP_001.gold_addresses_2`as( with geo_address as ( SELECT *, ST_GEOGPOINT(longitude, latitude ) AS location_1, FROM `az-voter-file.Data_AVP_001.gold_addresses_1`), lds as ( SELECT *, from geo_address as updates cross join `az-voter-file.Data_AVP_001.geo_legislative_districts` as geo WHERE ST_CONTAINS(st_geogfromgeojson(geo.geo),updates.location_1) ), blocs as( SELECT address, longitude, latitude, location, cd as legislative_district_22, geo as geo, bg.geo_id as census_block from lds CROSS JOIN `bigquery-public-data.geo_census_blockgroups.blockgroups_04` AS bg WHERE ST_CONTAINS(bg.blockgroup_geom, lds.location) ) SELECT blocs.*, geo.cd as congressional_district_22 FROM blocs cross join `az-voter-file.public_data.geo_cd` as geo WHERE ST_CONTAINS(st_geogfromgeojson(geo.geo),blocs.location) ) ``` ## Construct an ML Model I did this in tensorflow (in python). As a Google product, it could be incorporated into the workflow a number of other ways. The download creates a pickled file. ``` import os import pandas as pd from datetime import datetime os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/Users/Chris/Dropbox/Keys/az-voter-file-30395362c45b.json" import pandas_gbq from google.cloud import bigquery from sklearn.preprocessing import MinMaxScaler import pandas_gbq ### Formulate the SQL query to pull the data from BigQuery query = """ SELECT Registrant_ID, census_block, party_identification, GENERAL_2016, PRIMARY_2016, GENERAL_2018, PRIMARY_2018, GENERAL_2020, PRIMARY_2020, PRIMARY_2022, early_list, early_voters, absentee_voters, provisional_voters, polling_voters, CD, LD, respondent_age, median_age, median_income, SAFE_DIVIDE(white_pop, total_pop) as white, SAFE_DIVIDE(black_pop, total_pop) as black, SAFE_DIVIDE(asian_pop, total_pop) as asian, SAFE_DIVIDE(hispanic_pop, total_pop) as hispanic, SAFE_DIVIDE(amerindian_pop, total_pop) as indian, SAFE_DIVIDE(housing_units, total_pop) as housing, SAFE_DIVIDE(employed_pop, total_pop) as employed, SAFE_DIVIDE(armed_forces, total_pop) as armed, SAFE_DIVIDE(pop_in_labor_force, total_pop) as pop FROM `az-voter-file.Data_AVP_001.voters_actionable_preML` """ df = pandas_gbq.read_gbq(query, project_id="az-voter-file") df.to_pickle('voter_file_build_001.pkl') ## For later load, not to sync. ``` Compile the model ``` import numpy as np import tensorflow as tf import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler import keras from keras import layers #from tensorflow.keras import layers #"https://caffeinedev.medium.com/how-to-install-tensorflow-on-m1-mac-8e9b91d93706" df = pd.read_pickle('voter_file_build_001.pkl') df['Republican'] = np.where(df['party_identification'] == 'Republican', 1, 0) df['Democrat'] = np.where(df['party_identification'] == 'Democrat', 1, 0) df['Independent'] = np.where(df['party_identification'] == 'Independent', 1, 0) df['engaged'] = np.where((((df["PRIMARY_2018"] == 1) & (df["GENERAL_2018"] == 1) & (df["PRIMARY_2022"] == 1) )),1,0) full_data_variables = ['engaged', 'Registrant_ID', 'census_block', 'Republican', 'Democrat', 'GENERAL_2016', 'PRIMARY_2016', 'GENERAL_2018', 'PRIMARY_2018', 'GENERAL_2020', 'PRIMARY_2020', 'PRIMARY_2022', 'early_list', 'CD', 'LD', 'respondent_age', 'median_age', 'median_income', 'white', 'black', 'asian', 'hispanic', 'indian', 'housing', 'employed', 'armed', 'pop'] features = ['Republican', 'Democrat', 'GENERAL_2016', 'PRIMARY_2016', 'GENERAL_2020', 'PRIMARY_2020', 'early_list', 'CD', 'LD', 'respondent_age', 'median_age', 'median_income', 'white', 'black', 'asian', 'hispanic', 'indian', 'housing', 'employed', 'armed', 'pop'] st_dat = df[full_data_variables] st_dat = st_dat.dropna(how = 'any') scaler = MinMaxScaler() st_dat_array = scaler.fit_transform(st_dat) st_dat = pd.DataFrame(st_dat_array, columns = st_dat.columns) train, test = train_test_split(st_dat, test_size=0.2) features_train = train[features] labels_train = pd.DataFrame({"engaged":train['engaged'], "not_engaged":1-train['engaged']}) features_train_array = np.array(features_train, np.float64) labels_train_array = np.array(labels_train, np.float64) features_test = test[features] labels_test = pd.DataFrame({"engaged":test['engaged'], "not_engaged":1-test['engaged']}) features_test_array = np.array(features_test, np.float64) labels_test_array = np.array(labels_test, np.float64) from tensorflow.keras.regularizers import l1_l2 model = tf.keras.Sequential() # Define the first layer model.add(keras.layers.Dense(15, activation='softmax', input_shape=(features_train.shape[1],))) model.add(keras.layers.Dense(13, activation='softmax')) model.add(keras.layers.Dense(7, activation='softmax')) model.add(keras.layers.Dense(5, activation='softmax')) model.add(keras.layers.Dense(2, activation='softmax')) # Finish the model compilation model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy']) ``` ``` print("TensorFlow version:", tf.__version__) print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU'))) tf.config.list_physical_devices('GPU') model.fit(features_train_array, labels_train_array, epochs=5, batch_size=1000, validation_split=0.20) features_test = test[features] labels_test = pd.DataFrame({"engaged":test['engaged'], "not_engaged":1-test['engaged']}) features_test_array = np.array(features_test, np.float64) labels_test_array = np.array(labels_test, np.float64) outcome = model.predict(features_test_array) > 0.5 preds1 = model.predict(features_test_array) > 0.5 preds2 = model.predict(features_train_array) # Evaluate the model from tensorflow.keras.metrics import Accuracy, Precision, Recall acc = Accuracy() prec = Precision() recall = Recall() acc.update_state(labels_test_array, preds1) acc.result().numpy() # prec.result().numpy() # recall.result().numpy() ``` The above code should be grouped into blocks for readability. The accuracy is 0.899738 ``` ### Standardize the data, train with variables below #### full_data = df[full_data_variables] full_data = full_data.dropna(how = 'any') registrant_id = full_data['Registrant_ID'] full_data_array = scaler.fit_transform(full_data) full_data = pd.DataFrame(full_data_array, columns = full_data.columns) full_data['engaged'] = np.where((((full_data["PRIMARY_2018"] == 1) & (full_data["GENERAL_2018"] == 1) & (full_data["PRIMARY_2022"] == 1) )),1,0) labels_full = pd.DataFrame({"engaged": full_data['engaged'], "not_engaged": 1-full_data['engaged']}) features_full = full_data[features] features_full_array = np.array(features_full, np.float64) labels_full_array = np.array(labels_full, np.float64) preds = model.predict(features_full_array) preds = pd.DataFrame(preds) upload_data = pd.DataFrame( {"engaged_pr" : preds.iloc[:,0], "not_engaged_pr" : preds.iloc[:,1], "point" : np.random.binomial(1, preds.iloc[:,0]), "engaged_true" : full_data["engaged"], "registrant_id" : registrant_id.tolist() } ) from google.cloud import bigquery from sklearn.preprocessing import MinMaxScaler import pandas_gbq bqclient = bigquery.Client() project_id = "az-voter-file" pandas_gbq.to_gbq(upload_data, "az-voter-file.registration.nn06", project_id=project_id, if_exists="replace") ``` The data are now uploaded to BQ. I create a view with the appended data. The data we'll regularly use is the appended data_merged. ``` CREATE OR REPLACE VIEW `az-voter-file.Data_AVP_001.data_merged` AS ( SELECT * except( int64_field_0, registrant_id), FROM `az-voter-file.Data_AVP_001.voters_actionable_preML` as m1 LEFT JOIN `az-voter-file.AZ_Data_001.neural_network_01` as m2 ON m1.Registrant_ID = m2.registrant_id ) ``` From here, construct census block aggregates. ``` -- This script creates a summary census block file. -- Construct the census blocks from the neural networks -- This creates an aggregated score. CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.block_aggregates` as( WITH aggregates AS ( SELECT census_block, SUM(CASE WHEN party_identification = "Republican (REP)" THEN 1 ELSE 0 END ) AS Republicans, SUM(CASE WHEN party_identification = "Democrat (DEM)" THEN 1 ELSE 0 END ) AS Democrats, SUM(CASE WHEN party_identification = "Libertarian (LIB)" THEN 1 ELSE 0 END ) AS Libertarian, SUM(CASE WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" AND party_identification <> "Libertarian (LIB)" THEN 1 ELSE 0 END ) AS Independents, AVG(SAFE_DIVIDE(white_pop, total_pop)) AS white, AVG(SAFE_DIVIDE(hispanic_pop, total_pop)) AS hispanic, AVG(SAFE_DIVIDE(bachelors_degree, total_pop)) AS college_block, AVG(SAFE_DIVIDE(employed_pop, total_pop)) AS employed_block, AVG(SAFE_DIVIDE(asian_pop, total_pop)) AS asian, AVG(SAFE_DIVIDE(amerindian_pop, total_pop)) AS indian, AVG(SAFE_DIVIDE(housing_units, total_pop)) AS housing, AVG(SAFE_DIVIDE(employed_pop, total_pop)) AS employed, AVG(SAFE_DIVIDE(armed_forces, total_pop)) AS armed, AVG(SAFE_DIVIDE(pop_in_labor_force, total_pop)) AS in_labor_force, AVG(income_per_capita) AS income_per_capita, AVG(owner_occupied_housing_units_median_value) AS owner_occupied_median, AVG(SAFE_DIVIDE(mobile_homes, total_pop)) AS mobile_homes, AVG(SAFE_DIVIDE(housing_built_2005_or_later, total_pop)) AS new_homes, AVG(median_year_structure_built) AS median_home_year, AVG(SAFE_DIVIDE(rent_over_50_percent, total_pop)) AS rent_50_burden, AVG(respondent_age) AS registrant_age, AVG(engaged_pr)*100 AS engagement_score, SUM(CASE WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END ) AS threeXthree, SUM(CASE WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END ) AS twoXtwo, SUM(PRIMARY_2022) AS primary_22, AVG(absentee_voters) AS absentee, AVG(provisional_voters) AS provisional, AVG(polling_voters) AS polling, AVG(early_voters) AS early, SUM(early_list) AS early_list, COUNT(*) AS total_registered FROM `az-voter-file.Data_AVP_001.data_merged` GROUP BY census_block ), g2 as (SELECT aggregates.*, SAFE_DIVIDE(aggregates.Republicans, aggregates.total_registered) AS republican_percentage, SAFE_DIVIDE(aggregates.Democrats, aggregates.total_registered) AS democrat_percentage, SAFE_DIVIDE(aggregates.Libertarian, aggregates.total_registered) AS libertarian_percentage, SAFE_DIVIDE(aggregates.Independents, aggregates.total_registered) AS independent_percentage, SAFE_DIVIDE(aggregates.twoXtwo, aggregates.total_registered) AS twoXtwo_percentage, SAFE_DIVIDE(aggregates.threeXthree, aggregates.total_registered) AS threeXthree_percentage, SAFE_DIVIDE(aggregates.primary_22, aggregates.total_registered) AS primary_22_percentage, SAFE_DIVIDE(aggregates.early_list, aggregates.total_registered) AS early_list_percentage, centroids.geo_id, centroids.internal_point_lat AS latitude, centroids.internal_point_lon AS longitude, FROM `bigquery-public-data.geo_census_blockgroups.blockgroups_04` AS centroids JOIN aggregates ON centroids.geo_id = aggregates.census_block ) SELECT g2.*, ld_shapes.legislative_district as LD_CLEAN, from g2 as g2 CROSS JOIN `az-voter-file.public_data.ld_shape` as ld_shapes WHERE ST_CONTAINS(ld_shapes.geometry,ST_GEOGPOINT(g2.longitude, g2.latitude) ) ) ``` Here are the aggregates for LDs ``` -- This script creates a summary census block file. -- Construct the census blocks from the neural networks -- This creates an aggregated score. CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.ld_aggregates` as( WITH aggregates AS ( SELECT LD, SUM(CASE WHEN party_identification = "Republican (REP)" THEN 1 ELSE 0 END ) AS Republicans, SUM(CASE WHEN party_identification = "Democrat (DEM)" THEN 1 ELSE 0 END ) AS Democrats, SUM(CASE WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" THEN 1 ELSE 0 END ) AS Independents, AVG(respondent_age) AS registrant_age, AVG(engaged_pr)*100 AS engagement_score, SUM(CASE WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END ) AS threeXthree, SUM(CASE WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END ) AS twoXtwo, SUM(PRIMARY_2022) AS primary_22, AVG(absentee_voters) AS absentee, AVG(provisional_voters) AS provisional, AVG(polling_voters) AS polling, AVG(early_voters) AS early, SUM(early_list) AS early_list, COUNT(*) AS total_registered FROM `az-voter-file.Data_AVP_001.data_merged` GROUP BY LD ) SELECT aggregates.*, SAFE_DIVIDE(aggregates.Republicans, aggregates.total_registered) AS republican_percentage, SAFE_DIVIDE(aggregates.Democrats, aggregates.total_registered) AS democrat_percentage, SAFE_DIVIDE(aggregates.Independents, aggregates.total_registered) AS independent_percentage, SAFE_DIVIDE(aggregates.twoXtwo, aggregates.total_registered) AS twoXtwo_percentage, SAFE_DIVIDE(aggregates.threeXthree, aggregates.total_registered) AS threeXthree_percentage, SAFE_DIVIDE(aggregates.primary_22, aggregates.total_registered) AS primary_22_percentage, SAFE_DIVIDE(aggregates.early_list, aggregates.total_registered) AS early_list_percentage, FROM aggregates WHERE LD IS NOT NULL ); CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.cd_aggregates` as( WITH aggregates AS ( SELECT CD, SUM(CASE WHEN party_identification = "Republican (REP)" THEN 1 ELSE 0 END ) AS Republicans, SUM(CASE WHEN party_identification = "Democrat (DEM)" THEN 1 ELSE 0 END ) AS Democrats, SUM(CASE WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" THEN 1 ELSE 0 END ) AS Independents, AVG(respondent_age) AS registrant_age, AVG(engaged_pr)*100 AS engagement_score, SUM(CASE WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END ) AS threeXthree, SUM(CASE WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END ) AS twoXtwo, SUM(PRIMARY_2022) AS primary_22, AVG(absentee_voters) AS absentee, AVG(provisional_voters) AS provisional, AVG(polling_voters) AS polling, AVG(early_voters) AS early, SUM(early_list) AS early_list, COUNT(*) AS total_registered FROM `az-voter-file.Data_AVP_001.data_merged` GROUP BY CD ) SELECT aggregates.*, SAFE_DIVIDE(aggregates.Republicans, aggregates.total_registered) AS republican_percentage, SAFE_DIVIDE(aggregates.Democrats, aggregates.total_registered) AS democrat_percentage, SAFE_DIVIDE(aggregates.Independents, aggregates.total_registered) AS independent_percentage, SAFE_DIVIDE(aggregates.twoXtwo, aggregates.total_registered) AS twoXtwo_percentage, SAFE_DIVIDE(aggregates.threeXthree, aggregates.total_registered) AS threeXthree_percentage, SAFE_DIVIDE(aggregates.primary_22, aggregates.total_registered) AS primary_22_percentage, SAFE_DIVIDE(aggregates.early_list, aggregates.total_registered) AS early_list_percentage, FROM aggregates WHERE CD IS NOT NULL ) ``` Finally, use a star schema to construct the data for analysis. ``` ---Star Table CREATE OR REPLACE TABLE `az-voter-file.Data_AVP_001.star_table`as ( with s1 as( SELECT count(*) as voters, census_block as geoid, cast(CD as int64) as CD, cast(LD as int64) as LD, party_identification as pid, FROM `az-voter-file.Data_AVP_001.data_merged` WHERE party_identification is not null AND LD is not null and CD is not null group by 2, 3, 4, 5 ), s2 as( SELECT *, FROM s1 JOIN `az-voter-file.Data_AVP_001.block_aggregates` as block ON block.census_block = s1.geoid ), s3 as ( SELECT s2.*, SAFE_DIVIDE(CD_1.Republicans, CD_1.total_registered) as Republicans_CD, SAFE_DIVIDE(CD_1.Democrats, CD_1.total_registered) as Democrats_CD, SAFE_DIVIDE(CD_1.Independents, CD_1.total_registered) as Independents_CD, CD_1.registrant_age as registrant_age_CD, CD_1.engagement_score as engagement_score_CD, SAFE_DIVIDE(CD_1.threeXthree, CD_1.total_registered) as threeXthree_CD, SAFE_DIVIDE(CD_1.twoXtwo, CD_1.total_registered) as twoXtwo_CD, SAFE_DIVIDE(CD_1.primary_22, CD_1.total_registered) as primary_22_CD, CD_1.absentee as absentee_CD, CD_1.provisional as provisional_CD, CD_1.polling as polling_CD, CD_1.early as early_CD, SAFE_DIVIDE(CD_1.early_list, CD_1.total_registered) as early_list_CD, CD_1.total_registered as total_registered_CD, FROM s2 JOIN `az-voter-file.Data_AVP_001.cd_aggregates` as CD_1 ON s2.CD = cast(CD_1.CD as int64) ), final_merge as( SELECT s3.*, SAFE_DIVIDE(LD_1.Republicans, LD_1.total_registered) as Republicans_LD, SAFE_DIVIDE(LD_1.Democrats, LD_1.total_registered) as Democrats_LD, SAFE_DIVIDE(LD_1.Independents, LD_1.total_registered) as Independents_LD, LD_1.registrant_age as registrant_age_LD, LD_1.engagement_score as engagement_score_LD, SAFE_DIVIDE(LD_1.threeXthree, LD_1.total_registered) as threeXthree_LD, SAFE_DIVIDE(LD_1.twoXtwo, LD_1.total_registered) as twoXtwo_LD, SAFE_DIVIDE(LD_1.primary_22, LD_1.total_registered) as primary_22_LD, LD_1.absentee as absentee_LD, LD_1.provisional as provisional_LD, LD_1.polling as polling_LD, LD_1.early as early_LD, SAFE_DIVIDE(LD_1.early_list, LD_1.total_registered) as early_list_LD, LD_1.total_registered as total_registered_LD, from s3 JOIN `az-voter-file.Data_AVP_001.ld_aggregates` as LD_1 ON s3.LD = cast(LD_1.LD as int64) ), cds as ( SELECT fm.*, FROM final_merge as fm JOIN `az-voter-file.Data_AVP_001.cd_geometries` as cd_geo ON cd_geo.cd = fm.cd ) SELECT cds.*, FROM cds JOIN `az-voter-file.Data_AVP_001.ld_geometries` as ld_geo ON ld_geo.cd = cds.CD ) ``` And as a brief aside -- the tableau workbook includes information about candidates. I created several .csv files by scraping the SoS website. ``` from six import class_types from bs4 import BeautifulSoup as soup from lxml.html import parse import requests import pandas as pd rows = [] URL = "https://www.azleg.gov/MemberRoster/?body=S" page=requests.get(URL) soup = soup(page.content, 'html.parser') table = soup.find_all('tr') portraits = [] party = [] district = [] names = [] for table in table: tmp = re.findall(r"http\S+MemberPhotos\S+\.[a-z]{3}", str(table)) re.sub(r"\[|\]", "", str(tmp)) portraits.append(tmp) party.append(re.findall(r"Democrat|Republican", str(table))) tmp = re.findall(r"District\S\d+", str(table)) tmp = re.findall(r"[0-9].", str(tmp)) re.sub(r"\[|\]", "", str(tmp)) district.append(tmp) tmp = re.findall(r"(?=https)(.*)(?=<strong>)", str(table)) tmp = re.sub(r"http\S+>", "", str(tmp)) re.sub(r"\[|\]", "", str(tmp)) names.append(tmp) senators = pd.DataFrame(list(zip(district, names, party, portraits)), columns = ["district", "legislator", "party", "portraits"]) senators["district"] = senators["district"].str[0] senators["party"] = senators["party"].str[0] senators["legislator"] = senators["legislator"].str.extract(r"(\w+\s\w+)") ``` ``` from six import class_types from bs4 import BeautifulSoup as soup from lxml.html import parse import requests import pandas as pd rows = [] URL = "https://www.azleg.gov/MemberRoster/?body=H" page=requests.get(URL) soup = soup(page.content, 'html.parser') table = soup.find_all('tr') portraits = [] party = [] district = [] names = [] for table in table: tmp = re.findall(r"http\S+MemberPhotos\S+\.[a-z]{3}", str(table)) re.sub(r"\[|\]", "", str(tmp)) portraits.append(tmp) party.append(re.findall(r"Democrat|Republican", str(table))) tmp = re.findall(r"District\S\d+", str(table)) tmp = re.findall(r"[0-9].", str(tmp)) re.sub(r"\[|\]", "", str(tmp)) district.append(tmp) tmp = re.findall(r"(?=https)(.*)(?=<strong>)", str(table)) tmp = re.sub(r"http\S+>", "", str(tmp)) re.sub(r"\[|\]", "", str(tmp)) names.append(tmp) house = pd.DataFrame(list(zip(district, names, party, portraits)), columns = ["district", "legislator", "party", "portraits"]) house["district"] = house["district"].str[0] house["party"] = house["party"].str[0] house["legislator"] = house["legislator"].str.extract(r"(\w+\s\w+)") house.to_csv( "house.csv") senators.to_csv( "senators.csv") ``` Complete: Data Analysis in Tableau, D3.js, etc # Precinct Analysis The data is an xml file. I parsed and uploaded the table to Big Query. Here are the queries to reproduce that data. Here I created a file called **precinct_aggregated** which I use in Big Query. ``` WITH t1 AS ( SELECT latitude, longitude, party_identification, CASE WHEN GENERAL_2020 = 1 AND GENERAL_2018 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2018 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END AS threeXthree, CASE WHEN GENERAL_2020 = 1 AND GENERAL_2016 = 1 AND PRIMARY_2016 = 1 AND PRIMARY_2020 = 1 THEN 1 ELSE 0 END AS twoXtwo, latitude, longitude, ST_GEOGPOINT(longitude, latitude) AS location FROM `az-voter-file.Data_AVP_001.data_merged`), # This is the precinct file, built from the state's xml file. geo AS( SELECT geo.precinct , geo.geo, FROM `az-voter-file.Data_AVP_001.precincts_actionable` AS geo WHERE ballot_candidate = "Hobbs_Katie" ), voter_file AS( SELECT SUM(CASE WHEN party_identification = "Republican (REP)" THEN 1 ELSE 0 END ) AS Republicans, SUM(CASE WHEN party_identification = "Democrat (DEM)" THEN 1 ELSE 0 END ) AS Democrats, SUM(CASE WHEN party_identification = "Libertarian (LIB)" THEN 1 ELSE 0 END ) AS Libertarian, SUM(CASE WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" AND party_identification <> "Libertarian (LIB)" THEN 1 ELSE 0 END ) AS Independents, SUM(twoXtwo) AS two_two, SUM(threeXthree) AS three_three, geo2.precinct FROM t1 AS t1 CROSS JOIN geo AS geo2 WHERE ST_CONTAINS(SAFE.st_geogfromgeojson(geo2.geo),location) GROUP BY precinct ), voter_file_and_ld AS ( SELECT voter_file.*, precincts.*, ST_CENTROID(SAFE.st_geogfromgeojson(precincts.geo)) AS point, ST_X(ST_CENTROID(SAFE.st_geogfromgeojson(precincts.geo))) AS longitude, ST_Y(ST_CENTROID(SAFE.st_geogfromgeojson(precincts.geo))) AS latitude, FROM voter_file AS voter_file JOIN `az-voter-file.Data_AVP_001.precincts_actionable` AS precincts ON voter_file.precinct = precincts.precinct ), voter_file_with_cd AS( SELECT voter_file_and_ld.*, geo.cd AS legislative_district FROM voter_file_and_ld CROSS JOIN `az-voter-file.Data_AVP_001.ld_geometries` AS geo WHERE ST_CONTAINS(ST_GEOGFROMGEOJSON(geo.geo),voter_file_and_ld.point) ), cds AS ( SELECT voter_file_with_cd.*, geo.cd AS congressional_district FROM voter_file_with_cd AS voter_file_with_cd CROSS JOIN `az-voter-file.Data_AVP_001.ld_geometries` AS geo WHERE ST_CONTAINS(ST_GEOGFROMGEOJSON(geo.geo),voter_file_with_cd.point) ), blocks AS ( SELECT cds.*, bg.geo_ID AS geographic_identifier, bg.internal_point_lat AS block_latitude, bg.internal_point_lon AS block_longitude FROM cds CROSS JOIN `bigquery-public-data.geo_census_blockgroups.blockgroups_04` AS bg WHERE ST_CONTAINS(bg.blockgroup_geom, point) ) SELECT *, FROM blocks LEFT JOIN `bigquery-public-data.census_bureau_acs.blockgroup_2018_5yr` AS a ON blocks.geographic_identifier = a.geo_id -- SELECT -- *, -- safe_subtract(cast(Yee_Kimberly as int), cast(Masters_Blake as int)) as Yee_Margin, -- safe_add(cast(Fontes_Adrian as int), cast(Finchem_Mark as int)) as SoS_Total, -- safe_subtract(cast(Mayes_Kris as int), cast(Hamadeh_Abraham_Abe as int)) as Mayes_Margin, -- safe_add(cast(Mayes_Kris as int), cast(Hamadeh_Abraham_Abe as int)) as AG_Total, -- safe_subtract(cast(Quezada_Martin as int), cast(Yee_Kimberly as int)) as Quezada_margin, -- safe_add(cast(Quezada_Martin as int), cast(Yee_Kimberly as int)) as Treasurer_Total, -- safe_subtract(cast(Kelly_Mark as int), cast(Masters_Blake as int)) as Kelly_Margin, -- safe_add(cast(Kelly_Mark as int), cast(Masters_Blake as int)) as Senate_Total, -- safe_subtract(cast(Hobbs_Katie as int), cast(Lake_Kari as int)) as Hobbs_Margin, -- safe_add(cast(Hobbs_Katie as int), cast(Lake_Kari as int)) as Governor_Total, -- FROM `az-voter-file.public_data.precinct_counts_07```` ``` ``` # Survey Analysis