# CoRL24 Rebuttal
We thank you for the critical assessment of our work and valuable feedback.
## To Meta Reviewer (Secret Note)
We thank you for your efforts in charge of this area. Despite our efforts to minimize additional workload on you, we have concerns about the alignment of Reviewer WtUL's comments with the scope and intent of our work. The reviewer's critique seems to misunderstand key aspects of our contribution:
1. The reviewer's feedback implies an expectation for our method to directly improve computer vision (CV) models for privacy preservation, which is not the focus of our contribution. Our method is designed to be complementary and adaptable to future advancements in computer vision (CV) models. Our method provides sequential probabilistic guarantees in video streams using CV models.
2. We clearly addressed the reviewer's concerns about the computational complexity of our method and the potential use of cloud computing with additional experiments. The reviewer notes that the local YOLO model underperforms compared to the "large" cloud-based vision language models (VLMs). However, our experimental result from Figure 1 in `rebuttal.pdf` and Figure 5 in the paper shows that our method with YOLO outperforms the benchmark VLMs, Video-LLaVA and GPT-4V.
3. Lastly, privacy preservation with low-resolution video has no alignment with our contribution.
## New Response (WtUL)
We appreciate your feedback and response to our initial comment. We are confident that we have addressed most of the feedback from all reviewers, and the paper will be properly revised.
**Point 1:** Our contribution does not involve improving the computer vision models such as object detectors. Our method is intentionally designed to serve as a complementary framework for advancements in CV models. As CV models inevitably improve, our performance will improve, since we can "plug and play" the best CV models in our framework.
**Point 2:** The privacy preservation success ratio is evaluated on a per-video basis, not per frame. To clarify, the method requires an entire video as input. However, our approach processes video frames individually, as it handles real-time video streams. The privacy preservation success ratio refers to the number of instances of each "privacy-sensitive object" that are safely preserved for the entire duration of the video. Please see the equation of Privacy Preservation Success Ratio in the paper (L234). For the evaluation in Figure 5a, we used five different lengths of video, with each video length containing 25 videos. The main focus of our method is to quantify the probability of successful privacy preservation, rather than achieving a state-of-the-art preservation success ratio.
**Point 3:** In Figure 5, YOLOv9 outperforms the benchmark video language models, Video-LLaVA and GPT-4V. Furthermore, the open-vocabulary object detection model, YOLOv8x-worldv2 in Figure 1 of the `rebuttal.pdf`, demonstrates better performance than YOLOv9. Hence, your concerns about a "large" cloud-based VLM versus an efficient object detector model should be alleviated. We strongly claim that our proposed method aims to mitigate the risk of privacy leaks associated with relying solely on computer vision models for privacy preservation.
**Point 4:** First, privacy preservation on low-resolution video is not included in our contribution. We use HD images for demonstration due to the resolution of the robot's camera. The method is applicable to any resolutions and models that can work with any resolution.
We agree that the mentioned papers can help real-time privacy preservation. However, it is unclear that whether reducing image resolution would negatively impact vision-based decision-making (without adding new mechanisms). We also have shown that our method works for real time even using high-resolution images. Furthermore, none of the referred papers provide sequential probabilistic guarantees in video streams.
In summary, using high resolution image is our DESIGN CHOICE, because we also want to use vision for decision-making. Our work IS CAPABLE OF ANY VIDEO RESOLUTION.
## Meta Reviewer
We appreciate your efforts in coordinating the area.
**Reviewer's Point 1:** Address how mobile platforms would run the proposed method (due to limited compute).
**Response 1:** We tested our method with YOLOv9 on both CPU (Intel Xeon Gold) and GPU (NVIDIA A5000 GPU). The average runtimes for processing one image frame (1600x900 pixels) using CPU and GPU are **0.1159** seconds and **0.0055** seconds, respectively. Therefore, mobile platforms with CPU can also preserve privacy in real-time at an approximate 10 frames per second (fps) frequency.
**Reviewer's Point 2:** Discuss privacy guarantees wrt other methods using DP.
**Response 2:** We added a related work section in `rebuttal.pdf` in the zip file we uploaded for rebuttal. We include the discussion about differential privacy and other methods in the related work section.
**Reviewer's Point 3:** Elaborate on chosen prompts.
**Response 3:** We present the prompts in the supplementary material (`[CoRL 2024] Supplementary_Material_Privacy_Constrained_Video_Streaming.pdf` Section B).
**Reviewer's Point 4:** Related work section and situate the novelty.
**Response 4:** Due to the page limit, we merged the related work section into the introduction in the manuscript. To include more details, we added a related work section in `rebuttal.pdf` and will add this section to the revised manuscript.
## Reviewer abVg
**Reviewer Q1:** The reviewer requests a wider coverage of literature highlighting differences with existing approaches.
**Ans 1:** We provide a revised literature review under "Related Works" in `rebuttal.pdf`. Existing methods are black-box, data-driven, and lack calibrated quantitative guarantees for privacy. Our approach offers a calibrated probabilistic guarantee, distinguishing it from others.
**Reviewer Q2:** The reviewer finds the approach of specifying and filtering privacy constraints straightforward and asks for clarification on novelty.
**Ans 2:** Specifying constraints per frame is straightforward, but our novelty lies in providing a probabilistic guarantee for sequences of frames using conformal prediction (CP) and a real-time verifiable video abstraction model. This enhances the expressiveness of privacy rules compared to existing methods.
**Reviewer Q3:** The reviewer suggests discussing failures of existing methods, noting that differential privacy (DP) can be used in real-time and questioning our approach's guarantee of complete concealment.
**Ans 3:** We agree that existing DP approaches can run in real-time; however, our approach provides calibrated probabilistic guarantees on the satisfaction of privacy constraints for a sequence of frames, which existing DP approaches do not provide. Existing DP approaches are incapable of incorporating such expressive and diverse privacy constraints.
**Reviewer Q4:** The reviewer notes that the threat model is not stated, and possible attacks are not specified.
**Ans 4:** The automated object detection model forms the threat model. Its resilience to input variations and concealment attempts represents plausible attacks. Our expressive privacy specifications, such as $\text{person} \rightarrow \neg \text{face}$, protect against these threats such as by ensuring no faces are exposed.
**Reviewer Q5:** The reviewer is concerned that CP confidence calibration may be incorrect under test data distribution.
**Ans 5:** CP provides theoretical probabilistic bounds with stricter tolerance as the number of samples in the calibration set increases. In theory, 100 samples are required for about 1% error tolerance, and our calibration set contains 28,600 samples. We assume that the test data is identically distributed with the calibration data. Our experimental results comply with the assumption.
**Reviewer Q6:** The reviewer questions the claim of benchmark methods degrading with longer videos (line 239) and notes that Figure 5a contradicts this.
**Ans 6:**
Performance degradation with increased video length is based on a hypothesis study excluded due to page limit. We compare our method to VLM baselines like ViCLIP and VideoLLaMA (See **Figure 5** in `rebuttal.pdf`). Although inconsistent with Figure 5a, there is a slight decline with GPT-4V. The difference arises from using video sequences in the hypothesis study versus individual frames in the experiment.
**Reviewer Q7:** The reviewer notes that the analysis of Figure 6 needs further clarity.
**Ans 7:** The different colors represent varying numbers of privacy-sensitive (PS) objects. For example, $\phi=1$ indicates one PS object, such as a face, while $\phi=3$ indicates three PS objects, like a face, cell phone, and credit card. The experiment evaluates non-privacy-sensitive (NPS) object preservation, such as ensuring a person remains detectable after concealing a face. This is critical for vision-based policies that rely on person visibility. The NPS object preservation ratio fluctuates slightly before $\phi=5$ due to randomly selected PS objects, but it significantly decreases after $\phi=4$ as more of the frame is concealed.
**Reviewer Q8:** The reviewer suggests having a thorough description and justification of the experimental procedure.
**Ans 8:** We calibrate a YOLOv9 model using 28,600 ImageNet images. We then deploy PCVS with the calibrated model to a Franka robot arm (Intel i9 + RTX 3090) and a Clearpath Jackal robot (Intel i7 + Tesla T4). A privacy-constrained video stream (see Figure 3) is provided to a vision-based policy. We use the same experimental procedure for quantitative analyses, collecting results with the metrics defined in the paper.
**Reviewer Q9:** The reviewer questions why the evaluation datasets have images that are not from a continuous sequence.
**Ans 9:**
Evaluation dataset $\mathrm{I}$ includes a privacy-sensitive object, like a person, in videos of varying lengths to test privacy preservation. We need ground truth labels to verify if the $n^{\text{th}}$ frames with sensitive objects are concealed. Evaluation dataset $\mathrm{II}$ requires labels for both privacy-sensitive and non-privacy-sensitive objects. For example, to check the preservation of objects A, B, and C after concealing object D, we need labels for A, B, C, and D. Since no continuous video datasets with these labels exist, we created two evaluation datasets with randomly inserted images.
## Reviewer tq8t
**Reviewer's Point 1:** The reviewer is concerned that our method's privacy guarantee heavily depends on the performance of the segmentation model, which can only be evaluated with YOLOv9.
**Response 1:** First, our method uses an object detection model, not a segmentation model. We agree that our method depends on the computer vision (CV) model. However, our method is not limited to the YOLOv9 model (See **Figure 1 in rebuttal.pdf**). Our method is intentionally designed to serve as a complementary framework for advancements in CV models. As CV models inevitably improve, our performance will improve since we can "plug and play" the best CV models in our framework. In addition, our key benefit is "zero training". We can simply choose pre-trained CV models and layer on logical reasoning for better privacy preservation. Lastly, the performance of CV models can be significantly enhanced by fine-tuning the model for specific domains, and these models can be integrated into our method.
**Reviewer's Point 2:** The reviewer is concerned that our method's evaluation is limited to specific scenarios, making it challenging to assess the generality of the proposed approach.
**Response 2:** Temporal logic (TL) has strong expressiveness. If we have an exhaustive set of atomic propositions, we can use TL specifications to express properties such as safety, liveness, and fairness. Among these properties, our method is designed to support a safety property that ensures "something bad will never happen." Privacy is just one example of a safety property. Hence, our method can be widely adapted and used in general safety scenarios beyond privacy. Additional demonstrations with complex privacy specifications such as $\Box ( (\text{bicycle} \rightarrow \neg \text{person}) \land (\text{car} \vee \text{bus} \rightarrow \neg \text{person}))$ were presented in the submitted video supplementary material. We add one more scenario (`house_privacy.mp4` in the rebuttal's supplementary ZIP file), similar to iRobot's Roomba case. In the demonstration, we conceal private features defined as $\Box(\neg \text{laptop} \wedge \neg \text{television} \wedge \neg \text{person})$ and maintain privacy guarantee above the threshold, 0.80.
**Reviewer's Point 3:** The reviewer is concerned about whether the privacy guarantees provided by our method are comparable to those in prior works. In other words, no discussion exists on which algorithm can most effectively prevent privacy issues.
**Response 3:** We discuss different algorithms for privacy guarantees in the introduction. However, those methods, including differential privacy (DP) and cryptography, are incapable of diverse and complex privacy rules, such as $\Box(\neg \text{laptop} \wedge \neg \text{television} \wedge \neg \text{person})$. Our approach provides calibrated probabilistic guarantees on the satisfaction of privacy constraints for a sequence of frames, which existing DP approaches do not provide. Furthermore, existing DP approaches cannot incorporate expressive and diverse privacy constraints.
**Reviewer's Point 4:** The reviewer would like a detailed discussion of how the prompt for VLM-based baselines was chosen.
**Response 4:** We provided the prompt for VLM-based baselines in the original supplementary material (See `[CoRL 2024] Supplementary_Material_Privacy_Constrained_Video_Streaming.pdf` Section B).
**Reviewer's Point 5:** The reviewer requests a wider coverage of literature and to clarify the relationship between the proposed method and existing approaches.
**Response 5:** We provide a literature review with wider coverage under the "Related Works" section in the file named `rebuttal.pdf`. Existing works provide a black-box solution for privacy preservation, primarily relying on purely data-driven approaches. Due to their black-box nature, these methods fail to provide a calibrated quantitative guarantee for privacy preservation. On the other hand, our approach provides a calibrated probabilistic guarantee for privacy preservation, which is the primary attribute that sets it apart from other works.
## Reviewer WtUL
**Reviewer's Point 1:** The reviewer is concerned about the computational complexity of our method and the potential use of cloud computing.
**Response 1:** Our method is designed for running locally on the mobile robot, hence avoiding the risks of privacy leakage from the cloud-based service. We tested our method with YOLOv9 on both CPU (Intel Xeon gold) and GPU (A5000). The average runtimes for processing one image frame (1600x900 pixels) using CPU and GPU are **0.1159** seconds and **0.0055** seconds. Therefore, a robot with a CPU can also preserve privacy in real-time at an approximate 10 frames per second (fps) frequency and a robot with a GPU is capable of videos with 20 fps. We will report these numbers in the paper. Given the low latency in CPU, we do not need cloud computing for real-time privacy preservation.
**Reviewer's Point 2:** The reviewer wants to see more experiments on complex privacy requirements and crowded environments.
**Response 2:** We presented additional demonstrations with complex privacy specifications and crowded environments (e.g., many objects or many people) in the original supplementary material (See a video file). It includes complex specifications such as $\Box ( (\text{bicycle} \rightarrow \neg \text{person}) \land (\text{car} \vee \text{bus} \rightarrow \neg \text{person}))$. Furthermore, we added one more scenario (`house_privacy.mp4` in the rebuttal's supplementary zip file), similar to iRobot's Roomba case. In the demonstration, we conceal private features defined as $\Box(\neg \text{laptop} \wedge \neg \text{television} \wedge \neg \text{person})$ and maintain privacy guarantee above the threshold, 0.80.
**Reviewer's Point 3:** The reviewer wants to see other object detection models as benchmarks in addition to VLM.
**Response 3:** As the object detection models alone cannot interpret privacy requirements, we integrate our method with other object detection models, including YOLOv9, Yolo-World-v2, and FasterRCNN, and compares how our method performs under different detection models. We present the results in **Figure 1 in rebuttal.pdf** (the PDF is in the zip file for rebuttal).
**Reviewer's Question 1:** What is the required performance (in terms of FPS and GPU) of your platform, and how can it be used in real-time on such devices? If a cloud-based service is used for detection, does it introduce latency? And how are the images sent to the cloud service preserved for privacy?
**Answer 1:** Please see the response for **Response 1**.
**Reviewer's Question 2:** Blurred images can be easily retrieved by an attacker using an ad-hoc network; can the method be used with other techniques used to ensure privacy? Does this require additional computational capabilities?
**Answer 2:** We use blurred images for demonstrations, but our method is not limited to blurring images. Since we can detect the bounding boxes for privacy-sensitive objects, we can also add other masks instead of blurring. An example is in **Figure 2 in the rebuttal.pdf**.
**Reviewer's Question 3:** How is the set of m images used for calibration collected and labeled?
**Answer 3:** We use a subset of 28,600 images from ImageNet as the calibration set. They are already labeled.
## Reviewer fPtk
**Technical Question 1:** How is labelling function $L$ chosen and computed?
**Response:** The labelling function basically determines which propositions a state satisfies. We describe it's construction and computation in Algorithm 1 (lines 1, 3 and 9).
**Technical Question 2:** How is a bad prefix defined in Definition 3?
**Response:** We apologize for any confusion that has occurred. In definition 3, we meant to say that any finite prefix $\hat{\psi}$ of $\psi$ such that $P_{\textrm{safe}}\cap \{\psi'\in(2^{AP})^\omega |\, \hat{\psi} \, \mathrm{is\, a\, prefix\, of\, }\psi'\} = \Phi$ is called a bad prefix. We will revise our manuscript to reflect this.
**Technical Question 3:** Different descriptions for probabilistic guarantees for frame sequence.
**Response:** We apologize for any confusion. The first mention of probabilistic guarantee on a frame sequence in Definition 1 is the true description. The second mention of the same in Section 3.1 (around line 160) is in fact not a different description. Here, we define $P(\psi)$, which is the probability of a particular trace. $P(\psi)$ is needed to calculate the probabilistic guarantee, and Equation 2 relates the overall probabilistic guarantee for the frame sequence with the probability of each trace.
**Empirical Question 1:** Do ImageNet and MS-COCO cover all object types for privacy protection.
**Response:** In our experiments, the classes from these two datasets cover all the object types in our privacy specifications. For example, some of the classes we cover are cellphone, laptop, television and human face.
**Empirical Question 2:** The proposed method does not seem to be compared with any existing approaches.
**Response:** Please refer to the **Related Work** section in `rebuttal.pdf` from the zip file we uploaded for rebuttal. In addition, we show some empirical comparisons with existing benchmarks in **Figure 1 and 5** in `rebuttal.pdf`.
**Empirical Question 3:** How about the incurred delay compared to the non-private version?
**Response:** We show the delay of our method in **Figure 3** in `rebuttal.pdf`. The delay is less than 0.1 seconds for each frame; hence, it is negligible.
**Minor issue 1:** Is $\sigma$ undefined?
**Response:** $\sigma$ denotes an element in the set $2^{|AP|}$. Intuitively, if we have two propositions, the set $2^{|AP|}$ will take the form \{\{True, True\}, \{True, False\}, \{False, True\}, \{False, False\}\}, and $\sigma$ will be used used to iterate over these 4 elements in the loop of Algorithm 1 (i.e., $\texttt{for}$ $\sigma$ in $2^{|AP|}$ $\texttt{do}$...). We define $\sigma$ in Algorithm 1 (line 6). We will revise our manuscript to make this more clear.
**Minor issue 2:** Is The expression on line 224 should be $\Phi=\square\left(\neg p_1 \wedge \neg p_2 \wedge \neg p_3\right)$, right?
**Response:** Yes, thank you for pointing out this error; we will fix it.
**Minor issue 3:** The notation $\pi$ is reused for policy and path.
**Response:** Yes, thank you for pointing out this notation overload; we will fix it.