# Vessels clusterization ## Sketch of a possible approach Compute time associated to each row of the the dataframe with respect to an absolute reference system: ```python time[0] = 0 time[1] = 0 + time_diff_sec[1] ... time[i] = time[i-1] + time_diff_sec[i] ``` ## Data visualization using a scatter plot Let `p(i)` be a point in a 4-dimensional space `p(i) = (x(i),y(i),t(i),s(i))` where: - `x` represents `longitude`; - `y` represents `latitude`; - `t` (represented as heat scale color of the markers) represents `time`; - `s` (represented as marker size) represents `speed`; ## Using a density-based clusting like approch Define a metric to apply a clusterization algorithm *e.g.*, `d(p(i), p(j)) = sqrt((x_j - x_i)^2 + (y_j - y_i)^2 + (t_j - t_i)^2 + (s_j - s_i)^2)`. Using a density-based clusting like approch (https://en.wikipedia.org/wiki/DBSCAN), start a clustering algorithm from each point identified as starting point of a cycle for a certain vessel ,*i.e.*, first point in terms of time for the considered vessel (identified by an id). Then, add to the cluster the point that is the closest to the last added point to the cluster (or the initial point, in case the size of the cluster is 1). Assuming that data are sampled regularly in terms of time t and that x, y, s do not change violently between two consecutives points, this represents a possible approach to identify the different cycles. As a stopping criteria for the algorithm we can consider either: 1) If we are very close (< d_min) to the initial point, then terminate. Here we assume that the vessel returned to the same place where it was started. 2) If the next candidate point to be added to the cluster is too far (> d_max) from the last added point, then terminate. Here we assume that the vessel ended the cycle in another harbour. If we wish to distinguish as well the different tags of the cycle, we split the idetified cycles with respect acceleration and deceleration intervals, *e.g.*, the vessel will start decelerating when it is approaching the location where the turbines have to be installed (transite to the site). ## Considerations In particular, the proposed approach is able to manage the following cases: - `x`,`y` are similar, but `s` discriminates. - `x`, `y`,`s` are similar, but `t` discriminates. # Sketch of the code ```python import pandas as pd from pandasql import sqldf pysqldf = lambda q: sqldf(q, globals()) def pretty_print(df): print(df.to_markdown()) # Read excel file SHIPS = pd.read_excel('sea_impact_task.xlsx', index_col=None, header=0, nrows = 100) # Show general information about dataframe SHIPS.describe() # Data visualization q = pysqldf("SELECT latitude, longitude, time_diff_sec, speed FROM SHIPS") print(q) # ... ```