Fall 2022 Data Pipeline Documentation

# Fall 2022 Data Pipeline Documentation After Rongbo cleaned up the data, here is what I did to construct the datasets. First, let's just create a file that we'll use for machine learning purposes. I will save this table as **registration.clean_data_machine_learning** There's a lot going on here. I start by pulling in the updated geocode data that David sent. I then geocode voters in legislative and congressional district. ``` -- Tutorial 1: Construct Machine Learning File with geo_updates as ( SELECT lat as latitude_22, long as longitude_22, ST_GEOGPOINT(long, lat ) AS location_22, Score, cast(USER_registrant_id as string) as USERregistrant_id FROM `az-voter-file.registration.geocode_0924`), lds as ( SELECT updates.USERregistrant_id, updates.latitude_22, updates.longitude_22, updates.location_22, geo.cd as legislative_district_22, updates.Score as match_score from geo_updates as updates cross join `az-voter-file.public_data.geo_legislative_districts` as geo WHERE ST_CONTAINS(st_geogfromgeojson(geo.geo),updates.location_22) ), updated_voter as ( SELECT lds.*, geo.cd as congressional_district_22, from lds cross join `az-voter-file.public_data.geo_cd` as geo WHERE ST_CONTAINS(st_geogfromgeojson(geo.geo),lds.location_22) ), --The above block codes David's data into the CD/LD configuration registration_voters AS( SELECT gd.*, registrant_id, status, CAST(year_of_birth AS int) AS birth_year, PARSE_DATE("%m/%d/%Y", LEFT(effective_date_of_change, 10)) AS registration_change, county, party, occupation, CAST(REGEXP_EXTRACT(congressional, r"([0-9]+)") AS string) AS OLD_CONG_DIST, CAST(REGEXP_EXTRACT(legislative, r"([0-9]+)") AS string) AS OLD_LEG_DIST, zipcode, CAST(registrant_id AS string) AS ident, CASE WHEN party = "REP" THEN "Republican (REP)" WHEN party = "DEM" THEN "Democrat (DEM)" WHEN party = "LBT" THEN "Libertarian (LIB)" ELSE "Independent" END AS party_identification, CASE WHEN status LIKE "inactive" THEN 0 ELSE 1 END AS voter_status, FROM `az-voter-file.registration.full_registration_voter_May2022`as c join updated_voter as gd ON gd.USERregistrant_id = c.registrant_id WHERE status = "Active"), --- Pull in the voting data. registration_votes AS( SELECT election, CASE WHEN election = "GENERAL 2020" THEN 1 ELSE 0 END AS gen_2020, CASE WHEN election = "PRIMARY 2020" THEN 1 ELSE 0 END AS primary_2020, CASE WHEN election = "2018 PRIMARY ELECTION" THEN 1 ELSE 0 END AS primary_2018, CASE WHEN election = "2016 PRIMARY ELECTION" THEN 1 ELSE 0 END AS primary_2016, CASE WHEN election = "2014 GENERAL ELECTION" THEN 1 ELSE 0 END AS gen_2014, CASE WHEN election = "2018 GENERAL ELECTION" THEN 1 ELSE 0 END AS gen_2018, CASE WHEN election = "2016 GENERAL ELECTION" THEN 1 ELSE 0 END AS gen_2016, CAST(is_general AS int64) AS general_participation, CAST(is_primary AS int64) AS primary_participation, CAST(registrant_id AS string) AS ident, CAST(REGEXP_CONTAINS(UPPER(election), "GENERAL 2020") AS int64) AS vote2020 FROM `az-voter-file.registration.full_registration_votes_2022`), registration_count AS( SELECT ident, SUM(gen_2020) AS general_2020, SUM(primary_2020) AS primary_2020, SUM(gen_2018) AS general_2018, SUM(primary_2018) AS primary_2018, SUM(gen_2016) AS general_2016, SUM(primary_2016) AS primary_2016, SUM(gen_2014) AS general_2014, SUM(general_participation) AS general_participation, SUM(primary_participation) AS primary_participation, SUM(vote2020) AS vote2020 FROM registration_votes GROUP BY ident ), registration_merged AS ( SELECT *, FROM registration_count RIGHT JOIN registration_voters ON registration_count.ident = registration_voters.registrant_id ), blocks AS( SELECT registration_merged.*, bg.geo_ID, bg.internal_point_lat as block_latitude, bg.internal_point_lon as block_longitude FROM registration_merged CROSS JOIN `bigquery-public-data.geo_census_blockgroups.blockgroups_04` AS bg WHERE ST_CONTAINS(bg.blockgroup_geom, location_22)), t3 as (SELECT *, FROM `bigquery-public-data.census_bureau_acs.blockgroup_2018_5yr` AS a LEFT JOIN blocks ON blocks.geo_ID = a.geo_id) SELECT * --except(geo_id) from t3 ``` The final few tables geocode voters into census blocks, create summary counts, and append the data with block group information. What comes next is outside of big query -- though it need not be (*this is a project -- how to automate this better*). I use python/tensorflow, to run a machine learning model. Honestly, this could be improved. The accuracy isn't great (0.85), but also, there's just not a lot of individual data to then accurately predict things. It works, however, and it's generally correlated with other participation measures. Here is the jupyter notebook for that: https://github.com/The-Arizona-Voter/site/blob/main/tensorflow/nn.ipynb I do want to note that this is overkill. There's almost no reason to apply machine learning in this case, as far as I can tell. First, we have all the data. Second, we don't have a whole lot of x-variables. Nonetheless, it's a useful exercise. **Future projects will use these scores to predict not-yet-occurred* outcomes, or those things that haven't been incorporated into this build of the file** Now there should be a table in big query entitled **nn05**. We'll need to incorporate that into the tables in BQ and then create summary measures. Let's start with summaries for block groups. ``` -- This script creates a summary census block file. -- Construct the census blocks from the neural networks -- This creates an aggregated score. with m1 as (SELECT cast(geo_id as string) as census_block, cast(congressional_district_22 as int64) as CD, cast(legislative_district_22 as int64) as LD, party_identification, general_2020, primary_2020, general_2018, primary_2018, general_2016, primary_2016, general_2014, bachelors_degree, total_pop as tn, median_age, median_income, white_pop, hispanic_pop, housing_units, armed_forces, employed_pop, pop_in_labor_force, registrant_id as registrant_id FROM `az-voter-file.registration.clean_data_machine_learning_02`), --- Select out and construct percentages and meaningful measures. m2 as (SELECT census_block, SUM(CASE WHEN party_identification = "Republican (REP)" THEN 1 ELSE 0 END) as Republicans, SUM(CASE WHEN party_identification = "Democrat (DEM)" THEN 1 ELSE 0 END) as Democrats, SUM(CASE WHEN party_identification = "Libertarian (LIB)" THEN 1 ELSE 0 END) as Libertarian, SUM(CASE WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" AND party_identification <> "Libertarian (LIB)" THEN 1 ELSE 0 END) as Independents, avg(safe_divide(point, tn)) as engaged_count_pr, avg(safe_divide(white_pop, tn)) as white_block, avg(safe_divide(hispanic_pop, tn)) as hispanic_block, avg(safe_divide(bachelors_degree, tn)) as college_block, avg(safe_divide(employed_pop, tn)) as employed_block, avg(engaged_pr)*100 as engagement_score, sum(CASE WHEN general_2020 = 1 AND general_2018 = 1 AND general_2016 = 1 AND primary_2016 = 1 AND primary_2018 = 1 AND primary_2020 = 1 THEN 1 ELSE 0 END) as threeXthree, sum(CASE WHEN general_2020 = 1 AND general_2016 = 1 AND primary_2016 = 1 AND primary_2020 = 1 THEN 1 ELSE 0 END) as twoXtwo, from m1 LEFT JOIN `az-voter-file.registration.nn06` as m2 ON m1.registrant_id = m2.registrant_id WHERE CD IS NOT NULL GROUP BY census_block ) select m2.*, centroids.geo_id, centroids.internal_point_lat as latitude, centroids.internal_point_lon as longitude, from `bigquery-public-data.geo_census_blockgroups.blockgroups_04` as centroids join m2 on centroids.geo_id = m2.census_block ``` Now what we can do is just go ahead and create summary counts for LDs and CDs, according to the 2020 Census. Just change the unit from legislative to congressional districts. ``` -- This file creates the summary measures for districts. with m1 as (SELECT geo_id as census_tract, cast(congressional_district_22 as int64) as congressional_district, cast(legislative_district_22 as int64) as legislative_district, party_identification, general_2020, primary_2020, general_2018, primary_2018, general_2016, primary_2016, general_2014, registrant_id as registrant_id, FROM `az-voter-file.registration.clean_data_machine_learning_02`) --- Select out and construct percentages and meaningful measures. SELECT m1.congressional_district, SUM(CASE WHEN party_identification = "Republican (REP)" THEN 1 ELSE 0 END) as Republicans_cd, SUM(CASE WHEN party_identification = "Democrat (DEM)" THEN 1 ELSE 0 END) as Democrats_cd, SUM(CASE WHEN party_identification = "Libertarian (LIB)" THEN 1 ELSE 0 END) as Libertarian_cd, SUM(CASE WHEN party_identification <> "Republican (REP)" AND party_identification <> "Democrat (DEM)" AND party_identification <> "Libertarian (LIB)" THEN 1 ELSE 0 END) as Independents_cd, avg(engaged_pr)*100 as engagement_score_cd, sum(CASE WHEN general_2020 = 1 AND general_2018 = 1 AND general_2016 = 1 AND primary_2016 = 1 AND primary_2018 = 1 AND primary_2020 = 1 THEN 1 ELSE 0 END) as threeXthree, sum(CASE WHEN general_2020 = 1 AND general_2016 = 1 AND primary_2016 = 1 AND primary_2020 = 1 THEN 1 ELSE 0 END) as twoXtwo, from m1 LEFT JOIN `az-voter-file.registration.nn06` as m2 ON m1.registrant_id = m2.registrant_id WHERE congressional_district IS NOT NULL GROUP BY congressional_district ``` To recap, I now have several useful tables. 1) A table of block characteristics, called **blocks_actionable** 2) a table of LD, and a table of 3) CD characteristics. The data also includes the block centroid latitude and longitude for mapping. Now the idea is to use the raw, "clean_machine_learning" data, aggregated to * tract * LD * CD * Party Identification These form the measures. Create this table, then join in the "actionable" files. ``` -- Star Table, with allowable cd ld comparisons with raw_1 as (SELECT geo_id as census_tract, congressional_district_22, legislative_district_22, party_identification, registrant_id as registrant_id, total_pop as t FROM `az-voter-file.registration.clean_data_machine_learning_02` WHERE congressional_district_22 IS NOT NULL OR legislative_district_22 IS NOT NULL), tmp_dat as (SELECT nn.engaged_pr as engaged_pr_unit, nn.point as engaged_point_unit, raw_1.*, FROM `az-voter-file.registration.nn06` as nn JOIN raw_1 ON nn.registrant_id = raw_1.registrant_id), star_voter as( SELECT count(distinct registrant_id) as voters, census_tract as geoid, cast(congressional_district_22 as int64) as cd, cast(legislative_district_22 as int64) as ld, party_identification as pid, avg(engaged_pr_unit)*100 as engagement_unit, avg(engaged_point_unit)*100 as engagement_point_unit, FROM tmp_dat WHERE party_identification is not null group by 2, 3, 4, 5), star_p2 as ( SELECT *, FROM `az-voter-file.public_data.congressional_districts_actionable` as cd LEFT JOIN star_voter on star_voter.cd = cd.congressional_district), star_p3 as ( SELECT *, FROM `az-voter-file.public_data.legislative_districts_actionable` as ld LEFT JOIN star_p2 on star_p2.ld = ld.legislative_district), -- Need to now verify geocode. 1) Select everything, create star_p4 as ( SELECT *, cast(CD as int64) as cd_int, cast(LD as int64) as ld_int, ST_GEOGPOINT(longitude, latitude) AS location FROM `az-voter-file.public_data.census_blocks_actionable` as tracts left JOIN star_p3 on star_p3.geoid = tracts.census_block ) SELECT *, from star_p4 where ld is not null or cd is not null ``` This is the main file I then use in Tableau. ## Uploading Outside Geojsons to BQ Big Query has great geocoding tools. You can easily(?) upload a geojoson and geocode respondents. Here's how this is done. 1) Download the shape files, or whatever. 2) Convert in mapshaper. Download a json. 3) Clean the json by removing non geometry attributes. For CDs, ``` import pandas as pd from pandas_geojson import read_geojson geo_json = read_geojson("/Users/Chris/desktop/cds.json") dat = pd.DataFrame(geo_json) #dat = dat["features"].to_string dat.head() geometry = [] congressional_district = [] for x in range(9): geometry.append(dat["features"][x]["geometry"]) congressional_district.append(dat["features"][x]["properties"]["DISTRICT"]) dat = pd.DataFrame({"cd": congressional_district, "geo" : geometry}) dat.to_csv("geo_cd.csv") ``` Then for LDs. ``` import pandas as pd from pandas_geojson import read_geojson geo_json = read_geojson("/Users/Chris/website_fall22/site/report/LD3.json") dat = pd.DataFrame(geo_json) #dat = dat["features"].to_string dat.head() geometry = [] legislative_district = [] for x in range(30): geometry.append(dat["features"][x]["geometry"]) legislative_district.append(dat["features"][x]["properties"]["DISTRICT"]) dat = pd.DataFrame({"cd": legislative_district, "geo" : geometry}) dat.to_csv("geo_ld.csv") ``` From here, move to a google cloud storage bucket. I uploaded these myself, but that's unnecessary. Then run this to create to files, ld-shape and cd-shape ``` -- SELECT -- cd as legislative_district, -- st_geogfromgeojson(geo) as geometry, -- FROM `az-voter-file.public_data.ld_geo` SELECT cd as legislative_district, st_geogfromgeojson(geo) as geometry, FROM `az-voter-file.public_data.geo_legislative_districts` # Creating D3 Graphics -- Star Table, with allowable cd ld comparisons with raw_1 as (SELECT geo_id as census_tract, congressional_district_22, legislative_district_22, party_identification, registrant_id as registrant_id, total_pop as t FROM `az-voter-file.registration.clean_data_machine_learning_02` WHERE congressional_district_22 IS NOT NULL OR legislative_district_22 IS NOT NULL), tmp_dat as (SELECT nn.engaged_pr as engaged_pr_unit, nn.point as engaged_point_unit, raw_1.*, FROM `az-voter-file.registration.nn06` as nn JOIN raw_1 ON nn.registrant_id = raw_1.registrant_id), star_voter as( SELECT count(distinct registrant_id) as voters, census_tract as geoid, cast(congressional_district_22 as int64) as cd, cast(legislative_district_22 as int64) as ld, party_identification as pid, avg(engaged_pr_unit)*100 as engagement_unit, avg(engaged_point_unit)*100 as engagement_point_unit, FROM tmp_dat WHERE party_identification is not null group by 2, 3, 4, 5), star_p2 as ( SELECT *, FROM `az-voter-file.public_data.congressional_districts_actionable` as cd LEFT JOIN star_voter on star_voter.cd = cd.congressional_district), star_p3 as ( SELECT *, FROM `az-voter-file.public_data.legislative_districts_actionable` as ld LEFT JOIN star_p2 on star_p2.ld = ld.legislative_district), -- Need to now verify geocode. 1) Select everything, create star_p4 as ( SELECT *, cast(CD as int64) as cd_int, cast(LD as int64) as ld_int, ST_GEOGPOINT(longitude, latitude) AS location FROM `az-voter-file.public_data.blocks_actionable` as tracts left JOIN star_p3 on star_p3.geoid = tracts.census_block ) SELECT *, from star_p4 where ld is not null or cd is not null ```