Algorithm to create an adjacency matrix of industry pairs

--- tags: KnoledgeFlows --- # Algorithm to create an adjacency matrix of industry pairs ## Data * Join 5 years of ASHE sample, in this case from 2009 to 2013. This can change depending which matrix we want to build (2 years, 5 years, 10 year, etc). ## Selection: * Select individuals with no missing data in age, sic07, personal id or work postcode variables. * Select individuals with age > 15 & age < 65. * Loop over each unique individual in the dataset (unique "piden"): * Select all inputs observed for that individual into a table. e.g, for individual with "piden"==1 : | piden | year | sic07 |work postcode | |-------|------|-------|--------------| | 1 | 2009 | A | A1 | | 1 | 2010 | A | A1 | | 1 | 2010 | B | B1 | | 1 | 2011 | C | C1 | | 1 | 2013 | D | D1 | | 1 | 2013 | D | D2 | ## Algorithm: Two possible approaches: **EITHER** * **Build "flows"**: Sort table by year, and define a flow as 2 consecutive entries. Create a table as the following: | piden | flow | years | postcodes | |-------|------|----------|--------| | 1 | A-A |2009-2010 | A1-A1 | | 1 | A-B |2010-2010 | A1-B1 | | 1 | B-C |2010-2011 | B1-C1 | | 1 | C-D |2011-2013 | C1-D1 | | 1 | D-D |2013-2013 | D1-D2 | **OR** * **Build "connections"**: Define a connection as all possible combinations found for that individual. Create a connection table as the following: | piden | flow | years |postcodes| |-------|------|----------|---------| | 1 | A-A |2009-2010 | A1-A1 | | 1 | A-B |2009-2010 | A1-B1 | | 1 | A-C |2009-2011 | A1-C1 | | 1 | A-D |2009-2013 | A1-D1 | | 1 | A-D |2009-2013 | A1-D2 | | 1 | A-B |2010-2010 | A1-B1 | | 1 | A-C |2010-2011 | A1-C1 | | 1 | A-D |2010-2013 | A1-D1 | | 1 | A-D |2010-2013 | A1-D2 | | 1 | B-C |2010-2011 | B1-C1 | | 1 | B-D |2010-2013 | B1-D1 | | 1 | B-D |2010-2013 | B1-D2 | | 1 | C-D |2011-2013 | C1-D1 | | 1 | C-D |2011-2013 | C1-D2 | | 1 | D-D |2013-2013 | D1-D2 | ## Select valid flows/connections: The flow/connection selection is agnostic to the approach used before. * Define a time threshold, valid flows/connections can not be separated by more than X years (eg. in this case 2 years). * A flow/connection can not happen within the same job in 2 different years (if the flow/connection industry code and the work postcode are the same is not considered). * Only consider unique flows/connection (especially for the connection table). eg. resulting flow table: | piden | flow | years | postcodes | |-------|------|----------|-----------| | 1 | A-B |2010-2010 | A1-B1 | | 1 | B-C |2010-2011 | B1-C1 | | 1 | C-D |2011-2013 | C1-D1 | | 1 | D-D |2013-2013 | D1-D2 | eg. resulting connection table: | piden | flow | years |postcodes| |-------|------|----------|---------| | 1 | A-B |2009-2010 | A1-B1 | | 1 | A-C |2009-2011 | A1-C1 | | 1 | B-C |2010-2011 | B1-C1 | | 1 | C-D |2011-2013 | C1-D1 | | 1 | C-D |2011-2013 | C1-D2 | | 1 | D-D |2013-2013 | D1-D2 | ## Adjacency matrix Project the flows/connection pairs to an adjacency matrix. Each pair is 1 counts. * Flows are considered directed: | | A | B | C | D | |---|------|----|----|----| | A | | 1 | | | | B | | | 1 | | | C | | | | 1 | | D | | | | 1 | * Connections are considered un-directed, matrix is symmetrical: | | A | B | C | D | |---|------|----|----|----| | A | | 1 | 1 | | | B | 1 | | 1 | | | C | 1 | 1 | | 1 | | D | | | 1 | 1 | # Discussion on the two approaches 1. The flows are always a sub-sample of the connections. 3. The flow approach does not handle properly what happens when someone has more than 1 job, it doesn't considered all combination between the jobs. 4. Connection addresses the multiple part-time job issue. If the threshold is < 2 years, the result is almost equivalent to the "flows". 5. The connection approach creates an undirected adjacency matrix, avoiding the need of symmetrise the matrix. It also increases the size of our sample. 6. The method is different to what is shown in the literature, and implies a different definition of what is being measured.