owned this note
owned this note
Published
Linked with GitHub
# Communication Enhancement Project
###### tags: `SacRT` `train arrival prediction`

## Project Proposal
### Objective
Enhance the current information provided to passengers at the station leveraging the existing DMS(Digital Messaging Signs) and PA (Public Address system)
* identify train type
* communicate notification/train arrival communication
### Motivation
Trains are typically in a station less than 60s and effcient boarding is a prerequisite to on-time service, this should be achieved by timely notifying and positioning the passengers on the right platform within the station for different (type of) incoming train.
### Background Information
#### Light rail vehicles (LRVs)
* CAF 200, Siemens U2A 100, UTDC 300
* high floor
* passgengers with mobility restrctions board using mini-high platform (end of station platform)
* 1~4 cars
* S700
* low floor
* passgengers with mobility restrctions board on elevated midsection
#### Stations
* a total of 55 stations
* 28 S700 trains only serve the Gold Line
* focus on the stations that serve multiple SacRT train lines
#### S700 On-Board Technology
* equipped with a wireless Train Numbering Sign (TNS)
* data send to the backend
* train ID and trip ID
* Train Consist
* Route/Heading
* Latitude & Longitude
* Direction of travel
* Speed
* Time of information sending
* (Predefined) Geofence ID
:::warning
Information of CAF, U2A and UTDC will not be drawn from their onboard systems. Generating the necessary data from the S700 to provide their information by exclusion is necessary.
:::
:::info
Question for Modeling:
* Input: the available information for CAF, U2A and UTDC, the detector deployment and rail network architecture
* Metric for train arrival communication, heuristically we can inform passengers as soon as we identify the incoming train regardless of remaining time, which simplifies the problem to train identification.
* what are the current models in consideration or being tested?
* how does the car-detachment/attachment work and their impact on the modeling
* should we consider train dispatching problem?
:::
:::info
each station needs to notify the corresponding boarding car number
loop + gps
research on usage of loop and geofence data
prediction models
model help with correcting inaccurate gps readings
:::
## Data Available
### Raw Data
:::spoiler Train Details

:::
:::spoiler Time Table by Train/Route

:::
:::spoiler Train Track History


:::
:::spoiler Analysis - NO CM

:::
:::spoiler Car History

:::
## Poroject Initialization
* Determine the location to trigger the communication message for each station/stop
* Determine the message content to be broadcast (low/high flow, location to board)
:::info
Factors that impact the calculation:
Passenger transit speed, train arrival/departure time, train speed, processing delays and propagation delays, other buffer time,...
Tricky scenario:
* Multiple trains incoming around the same time
* Different boading locations for different train stations
:::
:::spoiler {state="open"} To do next
List different stops (focus on downtown region), summarize their statistics of coming train type, length of stay at specific time period, identify the the stations that have complex traffic scenario.
:::

Next train - Real Time Next Train Information by Station - History

### 13 St Station
:::warning
* (see row 77/78,62/63,52/53) Train station 13ST - train 41 - to node is 13ST , seems the type of readings are coupling with (train 41, to node TOWN).
* How does green line works (T9 to TOWN?)
7th & Richards/Township 9 - Downtown 7th & Capitol - 13th Street Light Rail
13th Street Light Rail - Downtown 9th & K - 7th & Richards/Township 9 or St Rose of Lima Light Rail
* One way platforms at Archives Plaza/Cathedral Square
Not really
* (row 34-35) Train 28 and train 26, both from Sac Valley to Sunrise, arrived at the same time, are they the same train
One train could be put away
* (row 119-120) Two train 24 with different departure and destinations
This train is likely to flip direction in this station
* For two stops at one station, can I claim for one specifc stop, all the trains is INBOUND, the other is OUTBOUND?
Likely
:::







### 16 St Station





### Distance to trigger the message
Input:
close-in time for stop $i$
average passenger transit time
train $j$ approaching speed at stop $i$
need to also consider: traffic light signal intervention
Can I generate distance-time plots like:



[LRT station spacing and average line speeds](https://onlinepubs.trb.org/Onlinepubs/trr/1992/1361/1361-017.pdf):

Train speed visualization from data:

For approaching each station, the speed value recording has obvious latecncy:

Car history data: speed_061623_car107:

#### calculting train passing time at each station

Identify how many train passed the stop on weekday - ~~estimated to be 46~~
(Is there a schedule table to verify it)
k-means/Gaussian Mixture Model to estimate the time intervals
:::spoiler K-Means
Cluster 1: (0:24:47.666667, 22:13:15.666667)
Cluster 2: (0:28:07, 0:29:35.666667)
Cluster 3: (5:34:32.416667, 5:36:01.083333)
Cluster 4: (5:56:40.733333, 5:57:50)
Cluster 5: (6:18:56.526316, 6:20:15.210526)
Cluster 6: (6:45:39.888889, 6:46:53.055556)
Cluster 7: (7:14:28.347826, 7:15:55.521739)
Cluster 8: (7:42:22.133333, 7:43:54.733333)
Cluster 9: (8:03:42.375000, 8:05:20.062500)
Cluster 10: (8:25:52.500000, 8:27:15.083333)
Cluster 11: (8:47:38.250000, 8:49:05.062500)
Cluster 12: (9:11:57.583333, 9:13:25.083333)
Cluster 13: (9:33:11.277778, 9:34:46.277778)
Cluster 14: (9:56:18, 9:57:47.714286)
Cluster 15: (10:21:45.631579, 10:23:09.473684)
Cluster 16: (10:48:58, 10:50:32.416667)
Cluster 17: (11:15:43.789474, 11:17:00.105263)
Cluster 18: (11:43:50.500000, 11:46:29.611111)
Cluster 19: (12:04:54.923077, 12:06:14.384615)
Cluster 20: (12:26:54.800000, 12:28:18)
Cluster 21: (12:49:00, 12:50:12.500000)
Cluster 22: (13:12:38.500000, 13:15:46.428571)
Cluster 23: (13:34:40.933333, 13:36:37.933333)
Cluster 24: (14:00:24.600000, 14:01:38.600000)
Cluster 25: (14:26:09.666667, 14:27:20.916667)
Cluster 26: (14:49:53.428571, 14:51:30.214286)
Cluster 27: (15:15:59.052632, 15:17:24.052632)
Cluster 28: (15:45:09.631579, 15:46:34.789474)
Cluster 29: (16:16:50.526316, 16:18:25.842105)
Cluster 30: (16:45:38.555556, 16:47:06.277778)
Cluster 31: (17:15:01.500000, 17:16:42.562500)
Cluster 32: (17:45:49.684211, 17:47:24.368421)
Cluster 33: (18:15:34.533333, 18:17:02.800000)
Cluster 34: (18:28:33, 23:23:16)
Cluster 35: (18:45:15.714286, 18:46:42.857143)
Cluster 36: (19:14:32.466667, 19:15:51.800000)
Cluster 37: (19:44:42.642857, 19:46:08.500000)
Cluster 38: (19:55:45, 22:27:33)
Cluster 39: (20:21:34.818182, 20:22:52.545455)
Cluster 40: (20:54:17, 20:56:35.750000)
Cluster 41: (21:23:31.500000, 21:24:45.750000)
Cluster 42: (21:54:44.571429, 21:56:07.857143)
Cluster 43: (22:26:15.500000, 22:27:28.500000)
Cluster 44: (22:53:20.300000, 22:54:39.900000)
Cluster 45: (23:25:24, 23:26:43.500000)
Cluster 46: (23:53:39.285714, 23:54:59)
:::
:::spoiler Gaussian Mixture Model
Cluster 1: (0:24:47.666667, 22:13:15.666667)
Cluster 2: (0:28:07, 0:29:35.666667)
Cluster 3: (5:34:30.316566, 5:35:59.005904)
Cluster 4: (6:00:47.637090, 6:02:02.272696)
Cluster 5: (6:22:03.258448, 6:23:25.814035)
Cluster 6: (6:40:23.650082, 6:41:39.736460)
Cluster 7: (7:10:41.709923, 7:12:10.898967)
Cluster 8: (7:35:57.471738, 7:37:17.685797)
Cluster 9: (8:01:20.500471, 8:05:29.000544)
Cluster 10: (8:30:55.328923, 8:32:16.041262)
Cluster 11: (8:43:32.631108, 8:45:18.128079)
Cluster 12: (9:20:45.066554, 9:22:17.105049)
Cluster 13: (9:38:39.216813, 9:41:02.554426)
Cluster 14: (9:52:11.560574, 9:53:30.863291)
Cluster 15: (10:13:51.841178, 10:15:28.515499)
Cluster 16: (10:39:34.014776, 10:41:05.132946)
Cluster 17: (11:24:08.814492, 11:25:30.944303)
Cluster 18: (11:37:02, 11:59:43)
Cluster 19: (11:59:37.942999, 12:00:43.807866)
Cluster 20: (12:26:09.076564, 12:27:40.578688)
Cluster 21: (13:07:21, 13:22:50.500000)
Cluster 22: (13:07:52.226695, 13:09:09.342447)
Cluster 23: (13:30:23.001545, 13:32:28.190097)
Cluster 24: (14:00:19.821305, 14:01:11.701425)
Cluster 25: (14:25:27.954169, 14:26:48.220051)
Cluster 26: (14:49:11.173228, 14:51:45.213219)
Cluster 27: (15:07:48.527964, 15:09:07.878303)
Cluster 28: (15:40:56.494098, 15:42:23.480535)
Cluster 29: (16:12:38.554135, 16:14:56.567758)
Cluster 30: (16:48:50.962442, 16:50:10.802737)
Cluster 31: (17:19:11.744553, 17:22:53.911102)
Cluster 32: (17:48:29.934363, 17:49:54.779448)
Cluster 33: (18:11:55.383449, 18:13:53.141430)
Cluster 34: (18:28:33, 23:23:16)
Cluster 35: (18:37:04.465361, 18:38:27.365942)
Cluster 36: (19:21:56.648783, 19:23:08.063083)
Cluster 37: (19:50:23.057230, 19:52:10.568498)
Cluster 38: (19:55:45, 22:27:33)
Cluster 39: (20:21:36.767103, 20:22:54.586311)
Cluster 40: (20:54:17.004093, 20:56:35.755114)
Cluster 41: (21:23:31.500023, 21:24:45.750022)
Cluster 42: (21:54:44.571429, 21:56:07.857143)
Cluster 43: (22:26:15.500000, 22:27:28.500000)
Cluster 44: (22:53:20.300000, 22:54:39.900000)
Cluster 45: (23:25:24.000012, 23:26:43.500012)
Cluster 46: (23:53:39.285716, 23:54:59.000002)
:::
Built neural network:
For specific stop
Training data: (timestamp, remaining time for next incoming train)
Input/Output: current time/ETA
* Standard Feedforward
* LSTM
Not working - accuracy is low
On whole transit map level:
* GNN
[Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting](https://www.ijcai.org/proceedings/2018/0505.pdf)
Training data: $m\times n$ matrix, $m$ is timestamp grid, $n$ is the number of stations/stop, each cell reading could be univariable (ETA) or multivariable (ETA, Incoming train type, ...)
Input/Output: current time/ETA
:::warning
current issue: data cleaning. How to correctly compute the next incoming train, identify and remove the wrong or repeated records
:::
Station data current issues:
* Timestamp recording is inconsistent
* Service Time - 24:34:15
* Arrived At - 0:32:45
* Causing issues like miscalculation of time gaps
- *timestamp cleaned*
* Some recordings are essentially the same with only slight Arrival/departure difference - *remove duplicate with threshold*
* Some trains parked for long duration, making the data outliers. - *remove outlier*
* Same trip ID - different train ID & different arrival/departure time - *keep both*
* A series of recordings are with same exact Arrival/departure time, (their existance is meaningful but their durations are wrong), for example the recodings on 13th - *didn't touch*
*June 16th*
reports 95 trains passed by at this station, only consider running trains that were not coming to park at this station
Before:

After:

*June 13th*
reports 120 trains passed by at this station, only consider running trains that were not coming to park at this station.
Before:

Dropping recording that stayed for more than 1 hour

Still has overlapping issue - caused by OVERPASS SACBEE?
Gaussian Mixture Model: 96 clusters


:::info
check daily stop schedule
headways - time table by route - count and sum together
533 - blue
507 - gold
519 - green
:::
:::warning
Is it possible to have the full schedule data of one stop:

example: headway --> Info - Sched --> stopid=7016
:::
The real daily recording time is less than the schedule time.
13th Street schedule on weekday:
The number of blue line trains passing by stop 13th Street Light Rail is:
67
The number of gold line trains passing by stop 13th Street Light Rail is:
67
The number of green line trains passing by stop 13th Street Light Rail is:
30
The total number of trains passing by stop 13th Street Light Rail is:
164
['00:18', ~~'04:18', '04:37', '04:52', '05:07',~~ '05:22', '05:33', '05:37', '05:48', '05:52', ~~'06:03', '06:07',~~ '06:10', '06:18', '06:22', ~~'06:33', '06:37'~~, '06:40', ~~'06:48'~~, '06:52', ~~'07:03', '07:07',~~ '07:10', ~~'07:18'~~, '07:22', '07:33', '07:37', '07:40', '07:48', '07:52', '08:03', '08:07', '08:10', '08:18', '08:22', '08:33', '08:37', '08:40', '08:48', '08:52', '09:03', '09:07', '09:10', '09:18', '09:22', '09:33', '09:37', '09:40', '09:48', '09:52', '10:03', '10:07', '10:10', '10:18', '10:22', '10:33', '10:37', '10:40', '10:48', '10:52', '11:03', '11:07', '11:10', '11:18', '11:22', '11:33', '11:37', '11:40', '11:48', '11:52', '12:03', '12:07', '12:10', '12:18', '12:22', '12:33', '12:37', '12:40', '12:48', '12:52', '13:03', '13:07', '13:10', '13:18', '13:22', '13:33', '13:37', '13:40', '13:48', '13:52', '14:03', '14:07', '14:10', '14:18', '14:22', '14:33', '14:37', '14:40', '14:48', '14:52', '15:03', '15:07', '15:10', '15:18', '15:22', '15:33', '15:37', '15:40', '15:48', '15:52', '16:03', '16:07', '16:10', '16:18', '16:22', '16:33', '16:37', '16:40', '16:48', '16:52', '17:03', '17:07', '17:10', '17:18', '17:22', '17:33', '17:37', '17:40', '17:48', '17:52', '18:03', '18:07', '18:10', '18:18', '18:22', '18:33', '18:40', '18:48', '18:52', '19:03', '19:10', '19:18', '19:22', '19:33', '19:40', '19:48', '19:52', '20:10', '20:18', '20:22', '20:40', '20:48', '20:52', '21:18', '21:22', '21:48', '21:52', '22:18', '22:22', '22:48', '22:52', '23:18', '23:22', '23:48']
A real workday (June 5th) reporting:
['0:28:20', '0:25:28', '23:55:14', '23:26:28', '23:19:30', '22:56:38', '22:51:19', '19:50:19', '21:57:37', '19:21:56', '20:57:15', '20:49:43', '20:25:28', '18:06:32', '19:44:14', '19:39:46', '19:26:28', '19:14:41', '18:54:42', '18:33:41', '18:25:41', '18:20:11', '18:09:54', '17:54:59', '17:51:27', '17:40:57', '17:36:20', '17:22:13', '17:10:57', '17:08:15', '16:56:53', '16:44:13', '16:21:00', '16:10:56', '16:12:28', '16:03:59', '15:55:40', '15:49:33', '15:42:13', '15:39:58', '15:33:42', '15:20:29', '15:11:28', '15:06:14', '14:57:28', '14:40:25', '14:36:25', '14:25:44', '14:18:59', '14:05:12', '13:55:10', '13:50:24', '13:44:14', '13:40:28', '13:34:12', '13:26:14', '13:12:24', '13:05:23', '12:49:57', '12:46:42', '12:46:31', '12:42:29', '12:21:14', '12:09:41', '12:09:44', '12:03:13', '11:50:10', '11:43:59', '11:40:13', '11:34:28', '11:26:29', '11:21:28', '11:14:55', '11:10:09', '10:44:24', '10:34:59', '10:24:28', '10:20:01', '10:12:41', '10:09:24', '10:04:23', '9:48:59', '9:42:39', '9:34:39', '9:27:39', '9:10:27', '9:04:07', '8:55:09', '8:51:00', '8:18:40', '8:09:58', '8:04:56', '7:54:56', '7:51:12', '7:42:12', '7:40:09', '7:35:44', '7:25:00', '7:12:53', '7:10:10', '6:54:59', '6:42:58', '6:39:55', '6:24:55', '6:18:25', '6:09:57', '5:48:41', '5:40:14', '5:35:10', '5:25:40']
:::info
focus on what's matching
Consider using Gaussian Process for time series interpolation - calculate the speed function
:::
#### speed imputation with Archives Plaza

:::warning
The graph is filtered by direction = 0, but this may not be a denote for inbound and outbound.
Train from 8th & O station is outbound, train from 13th street is inbound
:::


With the train approaching speed curve estimated, we calculate Approximating Definite Integrals using Samples (trapezoidal rule).
#### Verify the dwell time


The prior stop is likely due to traffic signals
#### tuning the hyperparameters/kernels
$RBF$ Average MSE: -42.83191037259202

$\beta*RBF + WhiteKernel$ Average MSE:-41.73656858944329

$\beta*Matern + WhiteKernel$ Average MSE: -41.79066540058831

$\beta*ExpSineSquared + WhiteKernel$ Average MSE:-41.4875622437085

$\beta*DotProduct + WhiteKernel$ Average MSE: -49.40499007219735

Mixture $\beta_1*RBF + \beta_2*DotProduct + WhiteKernel$ Average MSE: -43.52588725060663

Mixture $\beta_1*RBF + \beta_2*RationalQuadratic + WhiteKernel$ Average MSE: -41.93242999139627

#### scaling issue

90s: 0.10884642300474637
120s: 0.18630325401384573