# Responses to all We thank the reviewers for their constructive and insightful feedback. We answer questions separately for each reviewer. Further feedback and comments are very welcome! # Responses to Reviewer Y9ed 1. **Results quantity and quality**: Qualitative results are shown both in Fig.5 (main paper) and Appendix (Fig.8-12). Specifically: - Appendix Fig.8 shows that CMC (ours) generalizes to real-world dataset GREYC without further fine-tuning. Note, the duck example has a complex and non-uniform texture. - Appendix Fig.9-11 show additional results on S3D-Tex data for different scenarios and unseen objects. - Appendix Fig.12 shows some failure cases where we test CMC (ours) on extremely hard examples (e.g., a stack of buckets of different colors).       In addition, we provide even more qualitative results via this anonymous link: https://anonymous.4open.science/r/neurips22_1377_rebuttal-E708, where CMC yields more consistent and realistic results than baseline methods. 2. **Evaluation**: Thanks for the suggestion on the stronger in-painting baseline. Following the advice, we project normals and mesh vertex features into the UV space. We then re-train the 2D in-painting network which operates on features containing projected 3D information to predict the entire UV-map. For scenario 2 (partial completion), the results are shown in below table, where we refer to this new baseline as Tex-mapping*. We observe that the new baseline improves results slightly over the vanilla texture mapping method, but still falls short of CMC (ours) by a margin. | Methods | RMSE ↓ | PSNR-3D ↑ | SSIM-3D ↑ | | ---------- | ---------- | ---------- | ---------- | | Tex-mapping | 48.56 | 14.65 | 0.716 | | Tex-mapping* | 47.13 | 14.93 | 0.729 | | CMC (ours) | **40.76** | **16.58** | **0.771** | 3. **Fig.5 bottom row**: The guitar in the middle column is an example which demonstrates CMC (ours) results given few observations (L311-312). Specifically, only the side of the guitar is visible as input color prior, while the goal is to recover the color of the entire mesh. This is a very challenging test case and CMC (ours) predicts consistent and reasonable color, albeit incorrect. 4. **Evaluation metrics**: 3D evaluation: We believe 3D evaluation metrics are important. They provide an unbiased comparison (projection to a 2d plane hides artifacts or is susceptible to scale). 3D evaluation metrics are not used or studied in prior work. We hence extend the 2D photorealism scores to 3D and observe these metrics to suit mesh colorization well. We believe the new established benchmarks in this paper will be useful and informative for future research. Following the reviewer's suggestion, we further consider a 2D evaluation metric where multiple novel-view 2D rendering results are scored. Note that this 2D metric will be inevitably biased since novel views need to be sampled at test time. Specifically, we render the ground-truth and the predicted meshes into 3 random novel views and compare the photometric scores (PSNR, SSIM, L). The results for scenario 2 (partial completion) are shown below: | Methods | PSNR ↑ | SSIM ↑ | LPIPS ↓ | | ---------- | ---------- | ---------- | ---------- | | Tex-mapping |17.90 | 0.561 | 0.213 | | IF-nets | 14.35 | 0.419 | 0.304 | | MLP | 20.95 | 0.671 | 0.169 | | CMC (ours) | **22.67** | **0.734** | **0.134** | 5. **Eq.5 elaboration**: Variables $i$ and $j$ refer to the indices for two different color schemes as stated in L262. For our task, they represent the predicted colorization and the GT colorization. 6. **L243**: Incomplete means the mesh color information is only partially available, i.e., only a subset of the vertices have known color. We’ll clarify. # Responses to Reviewer zikc 1. **Scale-up**: In fact there are multiple ways to scale our method (Eq.1). For example: - Batching: we can process the input set $$\mathcal{V}_t$$ in batches. Doing so results in computation being proportional to $$\text{BATCH_SIZE}\times|\mathcal{V}_s|$$ and BATCH_SIZE can be adjusted according to available hardware. - Linear-attention: our method adopts the vanilla attention mechanism (squared complexity) for symplicity. However, many linear attention mechanisms have been proposed recently [a,b,c] and competitive results have been achieved. Using these modules would reduce computation. - Down-sampling: even though meshes often contain millions of vertices, they are often redundant w.r.t. color and could be down-sampled. E.g., we can design sampling strategies to extract diverse yet representitive vertices as seeds. This reduces the size of $$\mathcal{V}_s$$. 2. **High-fidelity mesh**: The collected dataset S3D-Tex contains meshes of high-fidelity and complex textures as shown in Fig.3. In our experiments, we focus on vertex colorization and use the vertex color as training data. To vividly recover the high-fidelity meshes shown in Fig.3, we re-meshed 3D models and tested the current pre-trained model. Qualitative results are shown in this anonymous link: https://anonymous.4open.science/r/neurips22_1377_rebuttal-E708. 3. **3D interactive editing**: We believe use of the palette is actually an advantage for 3D interactive editing. If a user only brushes a single color, we think the most reasonable result is a single-color mesh. Note that the baseline methods cannot achieve this as the output space is unconstrained and the results are often noisy. In contrast, by introducing the palette, our prediction is guaranteed to be aligned with the user input. 4. **Color palette size K**: In our experiments, we kept K constant across different datasets (S3D-Tex and GREYC) to show generality. We determine the “optimal K” via empirical experiments on S3D-Tex. It’s true that K cannot be set too small as it hurts palette capacity. 5. **[Minor] Fig.5 color**: The collected S3D-Tex is rendered under different light (L216), which causes the obtained images to look darker (due to lighting, reflectance, and shadow effects as mentioned in L139-141) than the real vertex color. As shown in Fig.5, CMC (ours) is able to recover the real surface color despite these factors, which is valuable for many real-world applications. 6. **[Minor] Embedding size**: We find the optimal embedding size 16 via an empirical study. Increasing the embedding size would increase the capacity of the framework but may also lead to slow training convergence. 7. **[Minor] Title**: We use the word “colorful” in the title as our method recovers more vivid mesh colors, while baselines often converge to a single color or yield noisy predictions. References a. Reformer: The Efficient Transformer, Kitaev et al., 2020 b. Linformer: Self-Attention with Linear Complexity, Wang et al., 2020 c. Efficient Attention: Attention with Linear Complexities, Shen et al., 2018 # Responses to Reviewer f5i6 1. **Ablation on the influence of the number of vertices with known color**: The absolute number of vertices is not a good indicator for visibility as the number of vertices of different meshes vary a lot. Instead, we define visibility to be the percentage of visible vertices in scenario 2, and have provided a detailed study in **Appendix C (Tab.5-8)**. In this study, we partition results across three different visibility levels. We encourage the reviewer to take a second look. 2. **Category-specific prior**: We agree, a class-specific prior is useful for many 3D tasks. Nonetheless, in our formulation, we didn’t enforce priors like symmetry and semantics for the following reason: while such priors help certain classes, they hurt others and make a method more complex. Going forward, introducing category-specific priors or generic systematic regularization is certainly very interesting future work. 3. **Multi-view visualization**: Following the reviewer’s suggestion, we provide additional results via this anonymous link: https://anonymous.4open.science/r/neurips22_1377_rebuttal-E708. We observe that CMC yields more consistent and vivid results than baselines. 4. **Ablation on input embedding**: We conducted ablation studies for the CMC model in scenario 2 (partial completion). Specifically, the input embedding in this scenario can be simplified to $ \left[v_{c_0}; v_\text{pos}; v_\text{nml} \right] $ where $ v_\text{vis} $ is set to zero and $ v_\omega$ is turned off compared to Eq.4. The ablation study results are shown below, where we find that 1) removing positional encoding leads to the biggest performance drop; and 2) all three features are important for the final performance. We’ll incorporate this study in the final version. | Methods | RMSE ↓ | PSNR-3D ↑ | SSIM-3D ↑| | ---------- | ---------- | ---------- | ---------- | | CMC (ours) | **40.76** | **16.58** | **0.771** | | w/o $ v_{c_0} $ | 44.89 | 15.91 | 0.726 | | w/o $ v_\text{pos} $ | 46.93 | 14.07 | 0.711 | | w/o $ v_\text{nml} $ | 45.07 | 15.55 | 0.724 | # Responses to Reviewer Eazn 1. **Problem setting/training**: - As stated in L129ff, the rendered images are only used for scenario 3 (image-based colorization). They are not used in the other two scenarios. - In both scenario 1 (3D interactive colorization) and scenario 2 (partial completion), the input is merely a mesh where only a subset of vertices are colorized. We train CMC (ours) in a class-agnostic setting, i.e., no category information is provided as input. - Regarding models: we train two models in total: one for scenario 3 (image-based colorization) which incorporates an image encoder (L183-186); and one for scenario 2 (partial completion) which is also tested in scenario 1 (3D interactive colorization) without further fine-tuning. - For rendering images, we use 5 different views which are randomly sampled on a semi-sphere, with objects located in the center. For testing, all 5 views of the same mesh are used. 2. **Motivation of the attention-based palette**: We observed that direct prediction of the color values (MLP baseline) often leads to noisy and degenerate results. We hence introduce the palette which regularizes the predicted color to be within a convex combination of K possible colors (color palette). This yields smooth predictions while not losing fidelity. 3. **MLP baseline**: The MLP baseline is a direct regression network conditioned on the color prior. Essentially, it operates on the constructed features (Eq.4) and predicts the color value without palette (L251-253). The color prior is encoded via Eq.4. 4. **IF-Net results**: Following the reviewer's suggestion, we trained an IF-Net on S3D-Tex data. Given the limited time during the rebuttal period, we train IF-Net for scenario 2 (partial completion) and will add scenario 3 in the final version. The results are shown below, where we observe that re-training improves results slightly but the obtained IF-Net still falls significantly behind CMC performance. | Methods | RMSE ↓ | PSNR-3D ↑ | SSIM-3D ↑ | | ---------- | ---------- | ---------- | ---------- | | IF-Net | 55.18 | 13.61 | 0.688 | | IF-Net (re-trained)| 52.73 | 14.24 | 0.712 | | CMC (ours) | **40.76** | **16.58** | **0.771** | 5. **More baselines (Texture fields [46], PIFu [50], IF-Net [6])**: We already compared to IF-Net [6] in the paper and in this rebuttal. We exclude Texture fields [46] and PIFu [50] because they are class-specific methods. In contrast, we focus on a more generic setting: CMC (ours) colorizes meshes in a class-agnostic manner. 6. **Visual comparison**: Besides the visual comparisons in Fig.5, additional qualitative results can be found in Appendix Fig.8-12. Specifically: - Appendix Fig.8 shows that CMC (ours) generalizes to real-world dataset GREYC without further fine-tuning. Note, the duck example has a complex and non-uniform texture. - Appendix Fig.9-11 show additional results on S3D-Tex data for different scenarios and unseen objects. - Appendix Fig.12 shows some failure cases where we test CMC (ours) on extremely hard examples (e.g., a stack of buckets of different colors).       In addition, we provide more qualitative results via this anonymous link: https://anonymous.4open.science/r/neurips22_1377_rebuttal-E708, where we show additional results, multi-view visualization, and visualization of high-fidelity meshes (as requested by different reviewers). 7. **Why does CMC generalize to unseen classes**: We believe generalization is mainly due to two reasons: 1) large-scale pre-training on the introduced dataset S3D-Tex greatly helps CMC generalize to unseen classes. Our dataset is diverse in both categories and colors, photorealistic, and large-scale. 2) The introduced palette module helps to prevent the prediction from deviating. It regularizes colors to closely align with the input prior. Additional geometric information could be further leveraged either as input embedding or as additional regularization, which we leave to future work. 8. **Effects of viewpoints**: Limited visibility is challenging for our framework. This is genuinely true for any conditional prediction task. In Fig.5 bottom row, we illustrate such a case: the input guitar image is only visible from the side. In this example, CMC predicts consistent colors on the back of the guitar, albeit incorrect. We also provided a detailed study regarding the influence of visibility on the final performance, where we partition results across three different visibility levels. We refer the reviewer to Appendix C (Tab.5-8) for more details.