Reproduction of "EV-Mask-RCNN: Instance Segmentation in Event-based Videos"

# Reproduction of "EV-Mask-RCNN: Instance Segmentation in Event-based Videos" *Written by Group 2:* Jorn Dijk: 5313090 Emiel Witting: 5309239 Timo van Hoorn: 5075408 ## Introduction and Motivation This blog post details our efforts to reproduce the findings of the paper "EV-Mask-RCNN: Instance Segmentation in Event-based Videos" by Baltărețu [1]. Reproducing machine and deep learning research is crucial to confirm whether new claims hold up and to get a clearer picture of a method’s true strengths and limitations, especially since results can often depend on specific implementation choices or experimental details that aren't always obvious [2]. Going through the process of replication helps highlight these sensitivities, allowing us to better distinguish true improvements from those that might only work under particular conditions [3]. Our work here aims to contribute to this validation process by thoroughly examining the EV-Mask-RCNN paper. The original paper by Baltărețu investigates whether deep networks can effectively be used for instance segmentation on event-based camera data. Their core approach involves converting asynchronous event streams into frame-like representations suitable for processing with established models like Mask R-CNN.     To achieve this, event-based data is first transformed into a frame-based RGB-D format to enable processing by standard deep learning models. This is done by aggregating events over a sliding time window. One RGB-D frame is generated per window. * **RGB Channels**: The polarity of events (positive or negative changes in brightness) is encoded into the color channels. "On-events" (pixels changing from black to white) are mapped to the Blue channel, while "off-events" (pixels changing from white to black) are mapped to the Red channel. The Green channel is unused. * **Depth (D) Channel**: The temporal information of the events is encoded into the Depth channel. More recent events are assigned higher values (appearing brighter), while older events are darker, providing the frame-based representation with a sense of time within a given window. The paper utilizes the N-MNIST [4] dataset, an event-based version of the classic MNIST [5] handwritten digits, transformed into the frame-based format. This transformed dataset is then fed into the popular Mask R-CNN [6] model for training. A key challenge was that instance segmentation requires precise ground truth masks, which are not included in the N-MNIST dataset. To address this, the paper introduces an automated pipeline for mask generation. This approach leverages the original MNIST dataset by taking advantage of the fact that the N-MNIST data preserves the original digit order. This allows for a direct mapping between an event-based sample and its corresponding static mask. To align the static mask with the event frame, a method is used to automatically shift the mask until it best overlaps with the detected event pixels. From the experiments, the author concludes that the EV-Mask-RCNN framework successfully demonstrates the feasibility of applying instance segmentation to event-based data. The paper also acknowledges the simplicity of the N-MNIST dataset as a limitation and suggests that future work could explore more complex, realistic datasets. ## Reproduction We reproduced the paper's results based on the published codebase and description of the experimental setup. The main goal being to verify that the results mentioned by the author can indeed be obtained with their given implementation. We also looked out for any discrepancies between paper and implementation. It is not a complete replication however, which would be to recreate the code using only the research paper. ### Software reproducibility Before analyzing the results, we first document the process of obtaining them. There were some practical difficulties regarding software environments, and ambiguities as for how to use the codebase. Direct reproduction was not possible without modifications to the provided codebase. Firstly, the environment and project dependencies were not set up correctly to allow installation according to the instructions in the supplied README file. For example, the dependency list included references to local files on the author's laptop. Removing those references lead to other version conflicts that could not be resolved automatically or easily by hand. Once installed, it was not clear how to configure the system to run the different experiments. The instructions did explain roughly how to switch between single and multi-digit mode. However, changing the time window required editing a default value of a function parameter nested deep in the source code. This was not mentioned anywhere. Because source code had to be modified and exact instructions were missing, it is difficult to confirm whether the modifications were correct and did not have unintended consequences. Most notably, we noticed a large discrepancy between the described experimental setup and the actual implementation. The model was not reset between each experiment, meaning the reported schedules of 2, 5 and 15 epochs were likely 2, 7 and 22 epochs in practice. Additionaly, the 2 and 5 epoch configurations only finetuned the model, while the 15 epoch variant used all layers. This distinction was not mentioned in the paper. Finally, the paper describes pretraining for 5 epochs, and fine-tuning for 10 with lower learning rate. In code, the same learning rate was used throughout all 15 epochs. These differences between experimental setup description and code implementation can be interpreted in two ways. Either the results in the paper have been achieved differently that described, or the final version of the codebase was not the one used while generating the results. ### Software modifications To effectively reproduce the experiments in the paper, the code was modified in the following ways: - The dependency list and environment were remade manually. - The training schedule code was modified to align with the setup from the paper. - Code was refactored to allow for configuration of parameters and large-scale experiments. Beyond the training schedule corrections, our intention was to keep all code functionally equivalent to the old implementation, and not modify the core algorithm or dataset. The author mentioned time constraints being a limiting factor, for that reason we configured GPU training support. Again, this was done in a way that should not have functionally changed the algorithm or outcomes. ### Reproduction results We conducted five independent runs for each experiment to measure the standard deviation of each metric. This adds insight into the stability of the model, but also makes it possible to better analyze any differences between our results and the original. The original data split was non-deterministic, so data variance needed to be acounted for besides model variance. Therefore each of our runs utilized a different train-test split. The original and our results are presented below: **Table 1:** Paper results in terms of accuracy, mean intersection over union and mean average-precision. Using a single digit dataset with different windows sizes and training schedules. (Values are scaled up to the range 0-100) | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | | ---------- | -------- | -------- | ------------------------- | | Acc(10ms) | 93.33 | 95.04 | 95.64 | | Acc(20ms) | 94.23 | 95.63 | 96.29 | | Acc(50ms) | 94.76 | 95.27 | 96.51 | | MIoU(10ms) | 14.19 | 41.05 | 55.47 | | MIoU(20ms) | 20.48 | 47.24 | 58.01 | | MIoU(50ms) | 27.82 | 41.73 | 60.29 | | mAP(10ms) | 13.4 | 32.8 | 42.3 | | mAP(20ms) | 18.7 | 37.1 | 43.2 | | mAP(50ms) | 23.7 | 35.2 | 44.6 | **Table 2:** Our results obtained when reproducting Table 1 based on the experimental setup description in the paper. | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | | ---------- | ------------ | ------------- |:------------------------- | | Acc(10ms) | 93.37 ± 0.71 | 94.93 ± 0.29 | 96.06 ± 0.11 | | Acc(20ms) | 93.16 ± 1.18 | 95.25 ± 0.18 | 96.18 ± 0.04 | | Acc(50ms) | 90.39 ± 3.63 | 93.10 ± 3.97 | 96.14 ± 0.05 | | MIoU(10ms) | 21.08 ± 3.31 | 38.55 ± 3.30 | 55.21 ± 2.05 | | MIoU(20ms) | 19.18 ± 4.86 | 42.46 ± 2.27 | 55.93 ± 1.22 | | MIoU(50ms) | 17.01 ± 7.61 | 35.63 ± 14.81 | 55.23 ± 1.05 | | mAP(10ms) | 19.92 ± 2.34 | 31.79 ± 2.09 | 40.23 ± 1.05 | | mAP(20ms) | 18.65 ± 3.26 | 33.99 ± 1.40 | 41.53 ± 0.76 | | mAP(50ms) | 18.12 ± 6.83 | 29.22 ± 12.22 | 42.25 ± 0.76 | We model each result as a student distribution, and measure the two-sided p-value of the original result score being from this distribution. Since we make 18 comparisons, the probability of falsely detecting any difference between our reproduction and the original results grows. We apply Holm-Bonferroni correction to the p-values. [7] This controls the probability of a false rejection across the entire experiment to be at most the signficance level. **Table 3:** The Holm-Bonferroni corrected two-sided p-values comparing the paper results to our obtained student t-distribution for each metric and configuration. (Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001) | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | | --------- | -------- | -------- |:------------------------- | | Acc(10ms) | 1.000 | 1.000 | 0.006** | | Acc(20ms) | 0.908 | 0.068 | 0.030* | | Acc(50ms) | 0.414 | 1.000 | 0.000*** | | IoU(10ms) | 0.068 | 1.000 | 1.000 | | IoU(20ms) | 1.000 | 0.068 | 0.130 | | IoU(50ms) | 0.248 | 1.000 | 0.002** | | AP(10ms) | 0.024* | 1.000 | 0.075 | | AP(20ms) | 1.000 | 0.060 | 0.061 | | AP(50ms) | 1.000 | 1.000 | 0.015* | The majority of the paper's result are not rejected based on our reproduction distribution. Supporting the original claims. There is an unexplained significant difference however for the combination of 2 epochs and 10ms. The original paper recorded 13.4, while our mean was 19.9. There are several significant differences for the configuration with 5 epoch pretraining and 10 epoch finetuning, especially in combination with the largest time window. This pattern might be explained by them having the most time to converge and therefore yielding less standard deviation. It could also be seen as evidence for the lack of resetting between model runs in the original code, whose effect would accumulate more towards the last schedule. The visualisation below mainly supports the former. **Figure 1** Performance metrics for a 50 ms window, comparing original results against reproduction and 95% confidence interval. ![image](https://hackmd.io/_uploads/r1jynMlVll.png) The author mentioned achieving a 95.38% accuracy, 31.70% MIoU and 42.48% mAP when training for 15 epochs and evaluating on a dataset with multiple digits per image. We extended the results to every configuration. Based on those results (Table 2), it seems more likely that it was actually measured for the 2-epoch schedule instead, given the significantly lower performance. **Table 4:** Results from running the model over multiple epochs (for each metric) for the dataset containing multiple digits. | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | | ------------ | ------------ |:------------ | ------------------------- | | Acc (10 ms) | 95.29 ± 0.20 | 95.48 ± 0.05 | 95.26 ± 0.12 | | Acc (20 ms) | 95.74 ± 0.12 | 95.67 ± 0.19 | 95.69 ± 0.11 | | Acc (50 ms) | 95.82 ± 0.37 | 95.97 ± 0.30 | 95.92 ± 0.04 | | MIoU (10 ms) | 25.55 ± 3.14 | 35.39 ± 4.66 | 53.74 ± 1.78 | | MIoU (20 ms) | 16.83 ± 1.99 | 35.03 ± 4.19 | 55.78 ± 1.56 | | MIoU (50 ms) | 17.86 ± 7.47 | 31.53 ± 2.37 | 55.73 ± 1.61 | | mAP (10 ms) | 46.19 ± 4.87 | 54.85 ± 5.49 | 71.79 ± 2.63 | | mAP (20 ms) | 34.95 ± 2.49 | 53.50 ± 6.17 | 75.55 ± 2.83 | | mAP (50 ms) | 40.73 ± 5.28 | 53.90 ± 1.83 | 78.68 ± 2.27 | ## Testing on a New Dataset: Introducing Overlap The original paper utilized two datasets for testing the EV-Mask-RCNN model's performance on two datasets: 1. A **single-digit dataset** consisting of 34x34 pixel images, each containing one transformed N-MNIST digit. 2. A **multi-digit dataset** consisting of 64x64 pixel images. In this dataset there were four N-MNIST digits placed in fixed, non-overlapping positions (in each of the four corners of the image). This dataset was designed to test the model's capability to detect and segment multiple distinct instances within a single frame. We wanted to extend this approach, to assess the model's performance under more realistic conditions. So, we decided to introduce the possibility of overlapping digits in the multi-digit frames. In a real-world setting it is also possible for objects to be in front of each other, leading to overlap in the final image. Our extension would asses is the model will still perform well in this more realistic setting. ### Generation of Overlapping Multi-Digit Datasets To introduce overlap, we modified the placement strategy of four the four digits within the 64x64 pixel canvas. Instead of fixed corner positions, each digit was shifted by a random offset from its repecive corner's origin (top-left of each 32x32 quadrant). The offset for each digit was sampled uniformly from a range `[0, max_offset]` for both the x and y directions, independently for each digit in each sample image. The instance mask were generated for the complete shape of each digit, where the order of the digits placed on the canvas decided which mask is visible on the image (in front of the others). The order of placing the digits on the canvas was fixed as to not confuse the model with different overlap order. This ensures that the models are trained on segmenting the visible portions of each digit with minimal confusion from other sources (such as inferring stacking order). We generated five test datasets with varying degrees of overlap by controlling the `max_offset` parameter. For each subsequent dataset, this value was increased by a uniform step of eight. This was done to clearly measure the impact of increasing overlap on performance. An increment of eight was selected because it is small enough to identify performance differences in gradual steps, but large enough so the datasets are different enough to provide meaningfull results: * **Very low overlap:** Random offsets were sampled from a range of 0-8 pixels in each direction. This allows for minor overlap of the digits. * **Low overlap:** Random offsets were sampled from a range of 0-14 pixels in each direction. * **Medium overlap:** Random offsets were sampled from a range of 0-20 pixels in each direction. * **High overlap:** Random offsets were sampled from a range of 0-26 pixels in each direction. * **Very high overlap:** The Random offsets where chosen in the range of 0-32 pixels, allowing for the full range of possible positions for each image. Possible settings for each of these offset ranges can be seen in Table 5. **Table 5:** Possible image configurations for each of the chosen offset ranges. | Offset Example | Description | | :--------------------------------------------: | :------------------------------------------------- | | ![image](https://hackmd.io/_uploads/HJeCmM_Xeg.png) | **Baseline / No overlap:** An example of an image with no offset. | | ![image](https://hackmd.io/_uploads/SyixVG_mgl.png) | **Very low overlap:** A possible configuration of the image with `max_offset=8`. | | ![image](https://hackmd.io/_uploads/H1NW4zOQee.png) | **Low overlap:** A possible configuration of the image with `max_offset=14`. | | ![image](https://hackmd.io/_uploads/SkRFV7O7ee.png) | **Medium overlap:** A possible configuration of the image with `max_offset=20`. | | ![image](https://hackmd.io/_uploads/SyszNz_7ge.png) | **High overlap:** A possible configuration of the image with `max_offset=26`. | | ![image](https://hackmd.io/_uploads/H1OmVzuXge.png) | **Very high overlap:** A possible configuration of the image with `max_offset=32`. | ### Results of Overlapping Digits Datasets To evaluate the model's robustness to occlusion, we present the results from our newly generated overlapping multi-digit datasets. Figures 2 through 4 show the model's performance as the max_offset is increased. To ensure the stability and reliability of our findings, each data point represents the average of 5 independent runs. A clear trend emerges from the data: as the degree of overlap increases, the model's performance on both the MIoU and mAP metrics conssitently degrades. This decline is observed across all training schedules. This outcome is logical, as higher overlap creates more challenging scenes. The model struggles to distinguish instance boundaries correctly when large portions of a digit are occluded by another. The accuracy plot in Figure 3 shows a different pattern compared to the mIoU and mAP results. Accuracy remains high, above 95%, and does not decrease as the digits overlap more. In some configurations, it even increases slightly. This isn't because the model is performing better, but rather due to how the accuracy metric works. First, accuracy is a pixel-wise metric that includes the entire background in its calculation. In our images, the background makes up the vast majority of the pixels. The model finds it very easy to correctly classify these background pixels, and this large number of correct classifications inflates the overall score, hiding the poor performance on the much smaller digit regions where the real challenge lies. This effect is strengthened as the overlap increases. When digits overlap, the total number of foreground (digit) pixels becomes smaller. Since most of the model's errors happen on these foreground pixels, having fewer of them simply reduces the number of opportunities for mistakes. This explains why the score doesn't drop and can even appear to improve. For these reasons, accuracy can be a misleading metric for this task, as it doesn't fully reflect the model's ability to correctly segment the actual instances. ![mAP_vs_offset](https://hackmd.io/_uploads/rJkJt5-4eg.png) **Figure 2:** Average Precision (mAP) vs. Data Offset. Performance is shown for different training configurations (by color) and input window lengths (by line style). ![mIoU_vs_offset](https://hackmd.io/_uploads/BJxkK5bNxx.png) **Figure 3:** Accuracy vs. Data Offset. Model accuracy is plotted for various training configurations (color) and window lengths (line style). ![Accuracy_vs_offset](https://hackmd.io/_uploads/HylyF9bVeg.png) **Figure 4:** Mean Intersection over Union (mIoU) vs. Data Offset. mIoU is shown for different training configurations (color) and window lengths (line style). ## Evaluating the Depth Channel To evaluate potential improvements and understand the contributions of different components in how we represent temporal information from events, we're exploring new methods for generating the "depth" channel of our RGB-D like frames, as well as conducting an ablation study on its necessity. ### The Original Depth Encoding Method To provide context, let's first briefly revisit how the temporal "depth" maps were generated in the original implementation for each event frame. The core idea was to visualize the recency of events based on their *order of arrival* within a given frame's time window. 1. **Temporal encoding (event order):** The relative order of arrival (e.g., 1st, 2nd, 3rd) is recorded at the (x,y) location of each event within the time window. If multiple events strike the same pixel, only the order of the latest event is retained. 2. **Contrast normalization (histogram equalization):** The raw map of sequence values then undergoes a histogram equalization. This non-linear step redistributes the values to span the full `0-255` intensity range, improving the visibility of small differences in event order. 3. **Visibility offset:** Finally, a brightness offset is applied to all active pixels. The normalized value is divided by two, and a constant of `50` is added. This ensures that even the earliest-ordered events have a minimum brightness of `51`, preventing them from blending into the black background. The result is a grayscale image where brightness corresponds to the relative sequence of events. While the original paper mentions that the brightness in the depth map indicates the order of events and contrast stretching was used for normalization, the exact algorithm for depth generation wasn't fully detailed. The specific process described above, including the final scaling and offset, was largely inferred from an analysis of the authors' provided source code. A potential limitation of this original approach is that its reliance on event order and non-linear normalization could obscure a direct relationship with time, especially with varying event rates across windows. ### New Algorithm Variant: Timestamp-Normalized Depth Building on the original event-order-based approach, we propose a `timestamp_normalized` variant that encodes each event’s actual timestamp, normalized relative to the start of its processing window, rather than its arrival order. This type of timestamp-based encoding is a common strategy in event-based vision and has been well described in prior work [8, 9]. It aligns with findings showing that timestamp-based encodings provide more informative temporal structure for downstream tasks by capturing motion dynamics and recency more directly, giving promising results in fields such as optical flow and gait recognition [8, 10]. Here’s a step-by-step look at how we computed the depth maps using this new variant. 1. **Raw temporal encoding (time delta):** * For each frame, a processing window is defined with a `window_start_time` and a fixed `window_duration`. * A raw temporal data map (initially marking all pixels as "no event") is created. * If an event with an `event_timestamp` that falls within the window `[window_start_time, window_start_time + window_duration)`, we calculate its time delta: `time_delta = event_timestamp - window_start_time`. * This `time_delta` (representing how far into the window the event occurred) is stored at the event's (x, y) location. If multiple events hit the same pixel, the `time_delta` corresponding to the latest event (largest `time_delta`) is retained. Values for `time_delta` will range from `0` (event at window start) to `window_duration - 1` (event at window end). 2. **Linear normalization to 0-255:** * After all events in the window are processed, the `time_delta` values stored for active pixels are linearly normalized to fit the `0-255` intensity range. * Specifically, a `time_delta` of `0` is mapped to pixel value `0` (black). * A `time_delta` approaching `window_duration - 1` is mapped to pixel value `255` (white). * The formula used is essentially: `pixel_value = (time_delta / (window_duration - 1.0)) * 255.0`. * Pixels where no events occurred remain `0` (black). This method aims to provide a representation where pixel intensity directly correlates with how far, in terms of actual time, an event occurred within its processing window. If the `window_duration` is consistent across frames, this could offer a more comparable measure of "recency" than the order-based method, especially if the number of events per window varies significantly. We evaluated the `timestamp _normalized` variant using the same methodology and configurations as employed for our initial reproduction of the original paper, conducting 5 independent runs for each setup. Table 6 presents the resulting performance metrics for this variant. **Table 6:** Performance metrics (Mean ± Standard Deviation) for the `timestamp_normalized` variant across different evaluation window lengths (10ms, 20ms, 50ms) and training epochs. All values are percentages. | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | |:----------- |:------------ |:------------ |:------------------------- | | Acc (10ms) | 93.23 ± 0.56 | 95.13 ± 0.21 | 96.04 ± 0.05 | | Acc (20ms) | 94.12 ± 0.69 | 95.10 ± 0.22 | 96.16 ± 0.06 | | Acc (50ms) | 92.87 ± 1.07 | 95.26 ± 0.09 | 96.23 ± 0.03 | | MIoU (10ms) | 19.82 ± 4.14 | 41.89 ± 3.23 | 55.31 ± 0.38 | | MIoU (20ms) | 21.61 ± 6.14 | 38.99 ± 2.72 | 55.95 ± 0.97 | | MIoU (50ms) | 22.09 ± 2.27 | 42.10 ± 2.55 | 54.63 ± 1.75 | | mAP (10ms) | 18.64 ± 2.73 | 33.68 ± 1.83 | 40.28 ± 0.27 | | mAP (20ms) | 19.64 ± 3.87 | 32.79 ± 1.51 | 41.53 ± 0.81 | | mAP (50ms) | 21.56 ± 2.57 | 35.18 ± 1.80 | 41.18 ± 1.04 | To determine if this new depth representation significantly altered performance compared to our reproduced baseline, we conducted statistical tests. We chose Welch's t-test because it assumes that the two groups being compared have variances that are not equal, making it a more reliable choice for experimental results [11]. Table 7 shows the Holm-Bonferroni corrected two-sided p-values from Welch's t-tests comparing the `timestamp_normalized` variant against our reproduction results (Table 1 reproduction). Across nearly all metrics and configurations, the p-values were 1 (with one at 0.218), indicating no statistically significant difference between this variant and our baseline reproduction. This suggests that, under these experimental conditions, normalizing depth by actual time delta within the window did not led to a detectable change in model performance. A reason for this could be that the depth channel itself does not influence the results of the model in the original and `timestamp_normalized` variants. This led us to do another experiment to test this. **Table 7: p-values (`timestamp_normalized` Variant vs. Reproduction Results)** The Holm-Bonferroni corrected two-sided p-values, derived from Welch's t-test, comparing the `timestamp_normalized` variant results to our reproduction results (Table 1 reproduction) for each metric and configuration. (Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001) | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | | --------- | -------- | -------- |:------------------------- | | Acc(10ms) | 1.000 | 1.000 | 1.000 | | Acc(20ms) | 1.000 | 1.000 | 1.000 | | Acc(50ms) | 1.000 | 1.000 | 0.218 | | IoU(10ms) | 1.000 | 1.000 | 1.000 | | IoU(20ms) | 1.000 | 1.000 | 1.000 | | IoU(50ms) | 1.000 | 1.000 | 1.000 | | AP(10ms) | 1.000 | 1.000 | 1.000 | | AP(20ms) | 1.000 | 1.000 | 1.000 | | AP(50ms) | 1.000 | 1.000 | 1.000 | ### Ablation Study: No Depth Information Beyond exploring alternative ways to *encode* temporal information into the depth channel, as done with the `timestamp_normalized` variant described above, a fundamental question arose regarding the overall necessity of this channel. Seeing no statistically significant difference with our `timestamp_normalized` variant, we decided to investigate whether the depth channel itself substantially influences performance. To address this, we conducted an ablation study using what we term the `zeros` variant, by effectively removing all temporal information from the depth channel. 1. **RGB Channel Generation:** The R, G, and B channels are generated *identically* to the original implementation described earlier. This ensures that the visual event activity information presented to the model remains consistent across compared conditions. 2. **Depth Channel Override:** * For this `zeros` variant, the depth channel is explicitly set to a constant value of zero for all pixels. This means the depth map becomes a blank, black image. * Consequently, no temporal information,neither order-based (as in the original method) nor timestamp-based, is provided to the model through this channel. This `zeros` variant serves as a critical baseline by providing the model with standard event frames but an entirely uninformative depth channel. This allows us to quantify the contribution of the explicit depth encoding by measuring how performance changes when this temporal information is absent. Furthermore, it lets us assess the model's reliance on a dedicated depth signal by testing if it can achieve comparable results by inferring temporal dynamics from the RGB frames alone. This ablation is key to understanding whether the effort, complexity, and additional memory cost involved in generating and storing an informative depth channel provide a justifiable performance benefit. If the model performs similarly well with a zero depth channel, it could suggest that the explicit temporal depth information is either redundant for the specific task or not effectively leveraged by the model's architecture. We evaluated the zeros variant using the same methodology and configurations as employed for our initial reproduction of the original paper, conducting 5 independent runs for each setup. Table 8 presents the resulting performance metrics for this variant. **Table 8:** Performance metrics (Mean ± Standard Deviation) for the `zeros` variant (depth channel set to zero) across different evaluation window lengths (10ms, 20ms, 50ms) and training epochs. All values are percentages. | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | |:----------- |:------------ |:------------ |:------------------------- | | Acc (10ms) | 93.79 ± 0.67 | 95.05 ± 0.38 | 96.06 ± 0.21 | | Acc (20ms) | 92.64 ± 1.03 | 95.22 ± 0.25 | 96.06 ± 0.10 | | Acc (50ms) | 93.78 ± 0.81 | 94.99 ± 0.63 | 96.17 ± 0.04 | | MIoU (10ms) | 21.39 ± 6.84 | 39.83 ± 3.35 | 55.56 ± 3.33 | | MIoU (20ms) | 16.17 ± 3.47 | 41.80 ± 1.79 | 54.92 ± 1.52 | | MIoU (50ms) | 19.84 ± 5.63 | 40.39 ± 2.20 | 55.05 ± 1.21 | | mAP (10ms) | 19.30 ± 5.72 | 32.78 ± 1.28 | 40.46 ± 1.72 | | mAP (20ms) | 18.10 ± 2.61 | 34.03 ± 1.61 | 41.13 ± 1.10 | | mAP (50ms) | 19.54 ± 2.74 | 34.06 ± 1.31 | 41.41 ± 0.81 | To determine if completely removing depth information significantly altered performance compared to our reproduced baseline, we again conducted statistical tests. Table 9 shows the Holm-Bonferroni corrected two-sided p-values from Welch's t-tests comparing the `zeros` variant against our reproduction results (Table 1 reproduction). Similar to the `timestamp_normalized` variant, the p-values across all metrics and configurations were uniformly equal to 1. This result strongly indicates no statistically significant difference between using a zeroed-out depth channel and our baseline reproduction. This further suggests that, under our experimental conditions for this model and task, explicitly encoding temporal depth information did not impact performance significantly. **Table 9: p-values (`zeros` Variant vs. Reproduction Results)** The Holm-Bonferroni corrected two-sided p-values, derived from Welch's t-test, comparing the `zeros` variant results (where the depth channel is set to all zeros) to our reproduction results (Table 1 reproduction) for each metric and configuration. (Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001) | Metrics | 2 epochs | 5 epochs | 5 epochs + 10 fine-tuning | | --------- | -------- | -------- |:------------------------- | | Acc(10ms) | 1.000 | 1.000 | 1.000 | | Acc(20ms) | 1.000 | 1.000 | 1.000 | | Acc(50ms) | 1.000 | 1.000 | 1.000 | | IoU(10ms) | 1.000 | 1.000 | 1.000 | | IoU(20ms) | 1.000 | 1.000 | 1.000 | | IoU(50ms) | 1.000 | 1.000 | 1.000 | | AP(10ms) | 1.000 | 1.000 | 1.000 | | AP(20ms) | 1.000 | 1.000 | 1.000 | | AP(50ms) | 1.000 | 1.000 | 1.000 | ## Conclusion In summary, our reproduction demonstrates that the core claims of the EV-Mask-RCNN paper are largely valid and reproducible. The conclusion that instance segmentation can be succesful on event-based data holds. Statistical analysis show that that the majority of results do not differ significantly from our reproduction. However, discrepancies were found between the described methodology and implementation, mainly consisting of different training configurations. This highlights challenges in machine learning research, and suggests that future work should focus more on rigorous documentation of procedures. Our analysis of the overlapping digit datasets showed that as the degree of possible overlap increases, both mIoU and mAP scores consistently decrease. This result aligns with the expected behavior, that greater overlap hurts the model's ability to accurately distinguish instance boundaries. Furthermore, our investigation revealed that altering the temporal depth channel with a `timestamp_normalized` encoding method produced no statistically significant performance difference compared to the baseline. An ablation study, where the depth channel was removed entirely by setting it to zero, also failed to produce any significant change in performance against the baseline. These findings strongly suggest that for this model and task, the explicit temporal information provided by the depth channel is redundant and does not contribute meaningfully to the final results. ## Individual contributions Work was divided roughly as follows: - Jorn - Evaluating on the dataset with overlapping digits. - Emiel - Refactoring, reproduction and statistical analysis. - Timo - Evaluate different depth channel representations. All other parts, such as introduction, motivation and conclusion were done together with equal workload. The modified code and experiment results can be accessed on [GitHub](https://github.com/EWitting/dsait4205-ev-mask-reproduction/tree/main). ## References [1] Baltărețu, A. (2022). *EV-Mask-RCNN: Instance Segmentation in Event-based Videos*. Bachelor's thesis, Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS), Delft University of Technology, Delft, The Netherlands. [2] Gundersen, O. E., & Kjensmo, S. (2018). State of the art: Reproducibility in artificial intelligence. [3] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. [4] Orchard, G., Meyer, C., Etienne-Cummings, R., Posch, C., Thakor, N., & Benosman, R. (2015). Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades. Frontiers in Neuroscience. [5] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE. [6] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Proceedings of the IEEE international conference on computer vision (ICCV). [7] Holm, S. (1979). A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6(2), 65–70. http://www.jstor.org/stable/4615733 [8] Zhu, A., Yuan, L., Chaney, K., & Daniilidis, K. (2018). EV-FlowNet: Self-supervised optical flow estimation for event-based cameras. Robotics: Science and Systems XIV. Robotics: Science and Systems Foundation. https://doi.org/10.15607/RSS.2018.XIV.062 [9] Huang, C. (2021). Event-based timestamp image encoding network for human action recognition and anticipation. arXiv preprint arXiv:2104.05145. https://arxiv.org/abs/2104.05145 [10] Wang, Y., Du, B., Shen, Y., Wu, K., Zhao, G., Sun, J., et al. (2019). "EV-Gait: Event-based robust gait recognition using dynamic vision sensors". Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition (CVPR). [11] Welch, B.. (1938). The Significance of the Difference Between Two Means When the Population Variances Are Unequal. Biometrika. 29. 10.1093/biomet/29.3-4.350.