# Reproducing "Privacy-Preserving Action Recognition via Motion Difference Quantization"
## Introduction
As computer vision systems are increasingly being deployed in both personal and public spaces, there is a growing need for privacy preservation within this field. The specific challenge of achieving this privacy lies in the tradeoff between how much privacy information is obfuscated and how much the loss of this information degrades performance. Previous methods have mostly focused on downsampling to achieve privacy preservation, but this is often overly restrictive with a too large loss of fine details.
The paper we explore, [*Privacy-Preserving Action Recognition via Motion Difference Quantization* [1]](https://arxiv.org/pdf/2208.02459), covers the specific context of privacy preservation in action recognition. The paper proposes an alternative technique with a lower loss of information, which uses blurring, motion differencing and quantization with adversarial training to achieve its goal.
Concretely, we aim to achieve the following goals with our paper reproduction. Firstly, we aim to reproduce the performance claims of the model on the SBU dataset, as reported in Figure 3 of the paper. Next, we will apply the same process on a novel dataset, [PA-HMDB51](https://github.com/VITA-Group/PA-HMDB51), which was not tested on in the paper. Lastly, we aim to test the pipeline with an alternative model architecture. We replace the inflated 3D CNN for action recognition with a 2D CNN with a temporal pooling module, and compare their performance.
We believe there is great value in reproducing existing research. It is important to validate scientific claims while ensuring the proposed methods can be properly replicated. We believe this is especially important for papers which lack documentation and detailed reproduction related information, as was the case with this paper. By independently reproducing the results, we can verify the effectiveness or uncover limitations of proposed methods. Additionally, by testing on a new dataset we can test the generalizability of the method under different conditions. Reproductions promote standards of rigor and responsible research within the research community, as researchers are disincentived from publishing unreliable results as they may be uncovered in later reproductions.
Our reproduction can be found at the following [repo](https://github.com/AsthenicDog390/NoFunMDL).
## Background: BDQ Encoder Overview
### Privacy-Preserving Action Recognition: Specific Approaches
To help better understand the contributions of this paper, we first cover some of the techniques and advancements that led to this paper. Early attempts at privacy-preserving action recognition often focused on directly reducing the visual detail of video streams, and learning directly on this low-resolution video [3, Ryoo et al., 2017]. Although such methods could obfuscate most privacy attributes to the naked eye, they were still vulnerable to robust adversaries like deep networks trained on uncovering these privacy attributes.
More recent advancements in the field have focused on adversarial training frameworks. In this framework there is an "adversary" network which aims to predict privacy attributes from an input fed through an encoder. The encoder aims to transform the input such that the adversary cannot predict privacy features while the action recognition network still maintains its performance [4, Wu et al., 2020].
### BDQ: Blur, Difference, Quantization for Privacy-Preserving Action Recognition
The paper builds upon this adversarial training framework by proposing a novel encoder design to process video frames. The encoder is comprised of 3 modules: Blur, Difference and Quantization (BDQ encoder), which aim to obscure sensitive information while retaining finer detail relevant for action recognition. These modules are visualized and outlined below:

Figure 1: Architecture of the BDQ encoder and flow into action recognition, T, and adversary network, P.
- **Blur**: The blur module is a simple 2D convolution performed on each individual video frame. This module uses a kernel of size 5x5 and a gaussian blur with learnable standard deviation.
- **Difference**: The difference module is arguably the most important stage of this pipeline. For every consecutive blurred image, B_i, it subtracts the neighbouring frame's intensity values, B_i-1. This process removed static detail from the video and only preserves motion details between frames, which can still effectively be used for action recognition. This process also results in the loss of 1 frame from each frame batch.
- **Quantization**: The last stage of the encoder is the quantization module, which applys a pixel-wise quantization to the motion differenced frames. This aims to remove some of the subtle and lower level details that otherwise the adversary could still possibly pick up on. The extent of the quantization is also learnable by the model to optimize both privacy preservation and action recognition.
## Methodology
### Datasets
#### SBU
The first dataset we test on is the SBU Kinect Interaction Dataset [5, Yun et al., 2023]. This is one of the datasets that are tested on in the original paper. The dataset is a two-person video action recognition dataset with 8 actions: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands. Following the procedure in the paper, the dataset is reduced from 21 actor sets to 13 distinct actor-pairs (deduplication). For this dataset, the privacy attribute to be predicted by the adversary network is which actor pair performed the action, while the action recognition network must correctly predict the action. Some frames of the dataset are depicted below. The original dataset also includes depth and skeleton positions, however these are not used by the networks.

Figure 2: Examples of actions from the SBU dataset
#### PA-HMDB51
The second dataset we test on is the PAHMDB51 (privacy annotated HMDB51) dataset published in TPAMI paper [9]. The HMDB51 dataset is a large collection of realistic videos from various sources, including movies and web videos. The dataset is composed of 6,766 video clips from 51 action categories (such as “jump”, “kiss” and “laugh”), with each category containing at least 101 clips. The PAHMDB51 provides frame-wise privacy attribute annotations on the original HMDB-51 videos, but only for 10-25 videos per action. We decided to test BDQ encoder on only 8 actions: climb, drink, jump, kick, laugh, punch, push, shake hands. Also, we decided to use the skin color as privacy attribute. The labels used are: 0 (skin color of the person(s) is/are unidentifiable), 1 (skin color of the person(s) is/are white), 2 (skin color of the person(s) is/are brown/yellow), 3 (skin color of the person(s) is/are black), We only used the videos with the annotated privacy attributes. Lastly, we used the code from the TPAMI paper to transform the videos to images and the last image of each video was deleted as it was only a black image and not relevant.

Figure 3: Examples of actions from the PAHMDB51 dataset
### Network Architecture
Following from the encoding step shown in Figure 1, the processed images are independently run through the action recognition network and adversarial privacy attribute prediction network. The action recognition network must operate on inputs of 3 dimensions (height, width, time) while the privacy network takes inputs of 2 dimensions (height, width). For the action recognition network, we additionally experiment with a different network architecture that the author does not cover. The network architectures used are described in this section.
#### Privacy Network
For all experiments, our privacy network uses a standard 2D ResNet model of depth 50, pretrained on ImageNet. The input size is (B, C, T, H, W), corresponding to batch size, RGB, number of frames, height and width. The privacy attributes are predicted for each frame and the result is averaged across the frames to obtain a prediction for a single video.
#### Action Recognition Network: Original Architecture
The action recognition network uses an inflated 3D CNN (I3D) architecture. This is a similar ResNet-50 model with an important modification to work on 3D inputs. It starts with a 2D ResNet-50 model pretrained on ImageNet, but "inflates" the learned 2D convolution kernels to 3D following the procedure in figure 4. This improves convergence in 3D as it gives the network a good initialization of kernel weights which would be harder to optimize in the higher dimensions of 3D with a cold start. Additionally, the I3D model is further pretrained on the Kinetics400 action recognition dataset. The weights can be found in [7].

Figure 4: Process of 2D convolution kernel inflation to 3D. [6]
#### Action Recognition Network: New Architecture Variant
We additionally test the privacy preserving action recognition pipeline with a different action recognition network architecture. Instead of using the I3D architecture to allow for information passing along the time dimension, we use a 2D CNN with Temporal Additive Modules (TAM). With this architecture, the residual blocks in a 2D ResNet-50 model are augmented with a TAM module before each convolution block to allow for reasoning over time. This process does not use an expensive 3D convolution and instead has several mechanisms to aggregate temporal information using separate 1D temporal convolutions. The architecture can be viewed in figure 5. To allow for a fairer comparison, this network was also pretrained on the Kinetics400 dataset. These weights can also be found in [7].

Figure 5: ResNet + TAM module architecture [8]
### Training and Validation
Training was conducted using Kaggle’s notebook platform with an NVIDIA Tesla P100 GPU, which required a few adjustments in order to work. The most important adjustment is that we reduced the batch size to 4 (from the paper’s batch size of 16) due to memory constraints of the P100. Despite the smaller batch, we adhered to the optimization hyperparameters reported in the paper: the model was trained for 50 epochs using stochastic gradient descent (SGD) with an initial learning rate of 0.001, together with a cosine annealing learning rate scheduler. We additionally trained with 100 epochs, as this was the configuration set in the reference code, and we hypothesized that this could have been an inconsistency in the report. These settings, including momentum and weight decay (as inherited from the reference code), were kept consistent to ensure a faithful reproduction. The 3D ResNet-50 action recognition backbone and the 2D ResNet-50 adversary were both initialized with the pre-trained weights (Kinetics-400 and ImageNet, respectively) and training/validation splits for SBU were used as provided to match the authors’ training setup.
## Results
<!-- Results:
(Also tested with 100 epochs since that was the default in the repo, although the paper says otherwise.)
T=action recognition, higher is better, B=privacy prediction, lower is better
Original architecture 100 epochs:
TopT@1: 55.9140 TopB@1: 44.0860
Original architecture 50 epochs:
TopT@1: 59.1398 TopB@1: 50.2509
TAM (new architecture) 100 epochs:
TopT@1: 77.4194 TopB@1: 38.5663
TAM (new architecture) 50 epochs:
TopT@1: 75.2688 TopB@1: 48.0287
PAHMDB51 50 epochs:
TopT@1: 42.4242 TopB@1: 88.2828 -->
The primary target of our reproduction was Figure 3 from the original BDQ paper, which plotted the performance trade-off on the SBU dataset using the BDQ encoder and I3D architecture. To closely mirror this result, we reproduced the model’s performance under the same conditions and then extended our experimentation in two meaningful directions:
1. Using an alternative action recognition architecture (ResNet + TAM)
2. Testing the BDQ pipeline on a new dataset (PA-HMDB51)
The figure below recreates figure 3 from the paper, placing our reproduction, including the results from the new dataset and the new architecture variants, side-by-side with the paper’s reported numbers and other baseline methods:

Figure 6: Performance trade-off between action accuracy and privacy attribute prediction (actor-pair classification)
Alongside this plot, we include below a table with the exact values we obtained from our experiments. The table also includes results from our extended evaluations — including a new architecture (ResNet + TAM) and a test on a second dataset (PA-HMDB51), neither of which were included in the original paper. These additions give a clearer view into how the BDQ method performs when its setup is varied.
| Setup | Dataset | Architecture | Epochs | TopT@1 (↑ Action Acc.) | TopB@1 (↓ Privacy Acc.) |
|-------------------|-------------|------------------|--------|------------------------|--------------------------|
| Original Paper | SBU | I3D | 50 | ~84% | ~37% |
| Our Reproduction | SBU | I3D | 100 | 55.91% | 44.08% |
| Our Reproduction | SBU | I3D | 50 | 59.14% | 50.25% |
| TAM (new architecture) | SBU | ResNet + TAM | 100 | **77.42%** | **38.57%** |
| TAM (new architecture) | SBU | ResNet + TAM | 50 | 75.27% | 48.03% |
| New Dataset Variant | PA-HMDB51 | I3D | 50 | 42.42% | 88.28% |
We also looked at the actual quantization steps that our network learned, and compared them to the ones shown in the original BDQ paper. The figure below shows both versions: on the left is the paper’s result, and on the right is what our model learned. While both started from the same initialization (blue line, shown only in the left figure), the learned quantization (orange line) turned out quite different. The original paper’s version shows some large uneven jumps, while ours ended up with more regular and evenly spaced steps. This suggests our model may have learned a different way to group input values—possibly because of differences in the training setup or optimization behavior.

Figure 7: Learned quantization steps (orange) in BDQ. Left: from original BDQ paper. Right: from our reproduction.
### What the Results Show
- **Big Gap from the Original Paper**: Our I3D results on the SBU dataset were much lower than what the original paper reported (~59% vs ~84% in TopT@1). Privacy accuracy was also worse. This suggests the paper may have left out important training details, or used a setup we couldn’t fully match.
- **New Architecture Worked Better**: When we switched to the ResNet + TAM model, we saw much better results. It got very close to the original paper’s reported accuracy, even though it wasn’t the same model. Privacy accuracy also improved slightly.
- **Poor Results on New Dataset**: When we used the same setup on PA-HMDB51, both accuracy and privacy scores dropped. This shows the BDQ method might not work well outside of the SBU dataset without more changes or tuning.
## Discussion
### Reproduction Struggles
Reproducing the BDQ paper was filled with unexpected obstacles. From missing or outdated code and undocumented dataset formats to broken links and unreported dependencies, nearly every step of the reproduction required manual effort to fix. The lack of key resources, such as pretrained models, README files, and code comments made the process frustrating and error prone. Additionally, the presence of many of these issues point to the fact that the published code repository was not in the final working state, which raised concerns about the validity of our reproduction. These struggles emphasize the importance of transparency and documentation for reproducibility in machine learning research. We provide a concise table with the issues encountered and any applicable fixes we implemented.
| Problem | Solution |
| - | - |
| IPN dataset missing data split and code was tailored for SBU | Switched to SBU dataset |
| Missing imports in BDQ repo (e.g., `sys` in `train.py`) | Used IBM repo with complete imports |
| Imports of missing files (e.g. `s3d, s3d_resnet` in `model_builder.py`) | Removed imports |
| BDQ repo appeared to be copied from IBM repo without attribution | Referred to IBM repo for consistency and completeness |
| No README file to explain codebase | Manually explored and mapped code structure |
| SBU split file lacked explanation for columns | Manually inferred column meanings |
| Incorrect file paths in SBU dataset | Wrote a script to fix paths |
| Original SBU dataset link was broken and not reported in paper | Found an alternative source on Kaggle |
| Code lacked comments | Understood functionality through manual inspection |
| Pretrained models, Resnet-50 on ImageNet and I3D Resnet on Kinetic400, not included in the repository | Found and used models from external sources |
| Pretrained models used old PyTorch format | Corrected PyTorch load pipeline |
| Dataset had missing and incorrectly named files | Renamed and fixed files manually |
| Conv2D described in paper but code used Conv3D, causing shape mismatches | Added `threed_data` argument to handle 3D inputs |
| Files had un-commented debug code causing errors | Manually identified and commented out problematic code |
| Training output was undocumented | Deduced metrics were actor-pair and action accuracy |
| No explicit evaluation script used | Deduced that validation was directly performed in the `train.py` script |
| Memory issues when training with P100 | Lowered batch size to 4 |
| Inconsistent (or missing) hyperparameters between repo and paper | Tested with either default or both from the paper and repo |
| Initial low performance due to incorrect video frame sampling | Used `--dense-sampling` argument to ensure frame differencing assumptions hold |
| Shape mismatches when using 2D CNN with TAM | Used `--without_t_stride` argument |
| PAHMDB51 dataset download link was broken | Located correct link through web search |
| No `requirements.txt` in PAHMDB51 repo | Manually resolved dependencies and versioning issues |
| PAHMDB51 had fewer privacy annotations per action class than SBU | Accept reduced performance; no direct fix |
## Conclusion
Reproducing the *Privacy-Preserving Action Recognition via Motion Difference Quantization* paper turned out to be a much more involved and uncertain process than we initially anticipated. The paper provided only partial details, and the official codebase was incomplete and undocumented. Despite our best efforts to align with the stated setup, our reproduction failed to match the paper's reported results. The action recognition performance we achieved was significantly lower, and the privacy accuracy was notably worse, suggesting that essential implementation details were missing.
Yet, this project was far from a failure. By looking at each part of the original system step by step, and then trying two new approaches (implementing a different architecture and testing the original one on new data), we gained valuable insights. Our alternative architecture (ResNet + TAM) delivered superior performance and approached the paper's claimed results, demonstrating that the BDQ concept can be effective when paired with the right implementation. When applied to the PA-HMDB51 dataset, however, the original method showed limited generalization capabilities.
This work highlights the critical importance of reproducibility in research and how challenging it can be in practice. Reproductions like ours test not only the technical soundness of an idea but also whether it can be reliably implemented by others, which is what gives scientific claims their weight. The difficulties we encountered underscore the need for comprehensive documentation, complete codebases, and transparent reporting in research publications.
By publishing both our successes and struggles, we hope to help others navigate similar challenges and to advocate for better open-source and reproducibility practices in machine learning research. These reproduction efforts, though difficult, provide essential verification of published methods and often reveal both limitations and potential improvements that advance the field.
## Contribution
- **Alexandru Preda**: Reproduction of the Original Method on the SBU dataset
- **Sebastian Nechita**: Evaluation of the Original Method on a new dataset (PAHMDB51)
- **Sten Wolfswinkel**: Implementation of the new architecture (TAM)
## References & Links
Our repository with reproduction: https://github.com/AsthenicDog390/FunMDL
[1] https://arxiv.org/pdf/2208.02459
[2] https://github.com/VITA-Group/PA-HMDB51
[3] https://ojs.aaai.org/index.php/AAAI/article/view/11233
[4] https://arxiv.org/abs/1906.05675
[5] https://ieeexplore.ieee.org/document/6239234
[6] https://medium.com/data-science/deep-learning-on-video-part-three-diving-deeper-into-3d-cnns-cb3c0daa471e
[7] https://github.com/IBM/action-recognition-pytorch
[8] https://github.com/liu-zhy/temporal-adaptive-module
[9] https://htwang14.github.io/PA-HMDB51-website/index.html