Table R1: Learning spatiotemporally varying appearance on novel view video synthesis on ScalarFlow real dataset.
| | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ |
|-----------------|------|------|-------|
| Ours w/ SV color| 31.07 | 0.9316 | **0.0881** |
| Ours w/o SV color| **31.14** | **0.9330** | 0.0966 |
Table R2: Comparison to NeuroFluid on novel view video synthesis on ScalarFlow real dataset.
| | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ |
|-----------------|------|------|-------|
| NeuroFluid | 22.41 | 0.8452 | 0.1560 |
| Ours | **31.14** | **0.9330** | **0.0966** |
Table R3-A: Comparison to GlobTrans on novel view video synthesis on ScalarFlow real dataset.
| | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ |
|-----------------|------|------|-------|
| GlobTrans | 25.97 | 0.9312 | **0.0783** |
| Ours | **31.14** | **0.9330** | 0.0966 |
Table R3-B: Comparison to GlobTrans on novel view re-simulation on ScalarFlow real dataset.
| | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ |
|-----------------|------|------|-------|
| GlobTrans | 24.55 | 0.8988 | **0.1017** |
| Ours | **28.37** | **0.9158** | 0.1171 |
Table R5-A: Evaluation on synthetic data with low viscosity levels. Measured by SiRMSE.
| | Density $\downarrow$ | Warp error$\downarrow$|PSNR$\uparrow$|SSIM$\uparrow$|LPIPS$\downarrow$|
|-----------------|------|------|-|-|-|
| PINF | 5.01| 4.84 | 23.43 | 0.8555 | 0.2153
| Ours | **2.94** | **3.37**|
Table R5-B: Evaluation on synthetic data with high viscosity levels. Measured by SiRMSE.
| | Density $\downarrow$ | Warp error$\downarrow$|PSNR$\uparrow$|SSIM$\uparrow$|LPIPS$\downarrow$|
|-----------------|------|------|-|-|-|
| PINF | 4.91| 4.91 | 27.26 | **0.8728** | 0.1537
| Ours | **2.85** | **3.21**|
## Global response (common questions and one-pager PDF for figures and tables)
We thank all reviewers for their time and feedback. We clarify that since our goal is to reconstruct plausible fluid fields from real videos, **all experiments in our main paper are on real captured data, as specified in L201**. Please find our summary of major changes and response to some common questions below. We will incorporate these changes to our revised paper.
**Summary of major changes per reviewers' suggestions**:
1. [*qUcR*, *3dGa*, *4cV8*, *MRuw*] We add evaluations on synthetic data that provides groundtruth 3D fields.
2. [*qUcR*, *3dGa*, *4cV8*, *MRuw*] We add evaluations on different viscosity levels using synthetic data.
3. [*5YF8*, *MRuw*] We add ablation study on using spatiotemporally-varying appearance.
4. [*3dGa*, *4cV8*, *MRuw*] We add discussion and comparison to GlobTrans which is a SOTA non-learning-based fluid reconstruction method.
5. [*qUcR*, *3dGa*, *MRuw*] We add discussion and comparison to NeuroFluid which is a recent fluid dynamics learning method.
6. [*4cV8*] We improve velocity visualization using slice rendering.
7. [*MRuw*] We add more references and discussion to recent related work.
**[*qUcR*, *3dGa*, *4cV8*, *MRuw*] Evaluation on synthetic data**:
We include synthetic examples for evaluating 3D density and velocity fields. We use ScalarFlow synthetic dataset generation code [Eckert2019]. We generate five examples with different inflow source (randomized inflow area and density distribution) with higher viscosity and another five examples with lower viscosity. Since numerical viscosity is unavoidable, we simply use different simulation domain resolution to synthesize fluids with different viscosity levels. For the low viscosity group, we use 100x178x100. For the high viscosity group, we use 80x142x80.
We compare to the state-of-the-art neural fluid reconstruction method PINF [Chu2022] which has shown competitive results in 3D fluid fields reconstruction. Since the simulation groundtruth are up to a scale, we use scale-invariant RMSE to measure the performance. We only compute metrics where groundtruth density is greater than $0.1$ to rule out empty space (which is otherwise dominant) for clearer quantitative comparison. In particular, we consider volumetric density error (by querying density networks at the simulation grid points) to evaluate density prediction, and warp error (i.e., using velocity to advect density and comparing to GT density) to evaluate both density and velocity prediction. We also report novel view re-simulation results. From the table below, we see that ours outperforms PINF on both 3D fields and 2D rendering, similar to our main paper's observations. We show qualitative examples in Figure R1 and R2 in the PDF.
We show quantitative results in Table 1 below. We observe that ours outperforms PINF in all metrics. This is consistent with our observations on real data in our main paper.
Table 1: Evaluation on **synthetic** data.
|| Density error$\downarrow$| Warp error$\downarrow$ |PSNR$\uparrow$|SSIM$\uparrow$|LPIPS$\downarrow$|
|-|-|-|-|-|-|
| PINF | 4.95 | 4.88| 25.34 | 0.8641 | 0.1845
| Ours | **2.89**| **3.29** |**27.93** |**0.8643**|**0.1259**|
**[*qUcR*, *3dGa*, *4cV8*, *MRuw*] Evaluation on different viscosity levels**
In addition to overall evaluation above, we also include separate evalution on the different viscosity levels. We show the results in Table R1 in the PDF. We see that ours outperforms the SOTA method PINF on all metrics except for high-viscosity SSIM, likely due to that the high-viscosity velocity fields are dominated by laminar flows which PINF tends to recover.
**[*qUcR*, *3dGa*, *MRuw*] Comparison to NeuroFluid**:
NeuroFluid [Guan2022] is a recent method on learning fluid dynamics. We use NeuroFluid official code and their released pretrained transition model, and train it on ScalarFlow real dataset following their instructions. NeuroFluid assumes known initial state while in real data there is no groundtruth initial state; thus, we use the SOTA fluid reconstruction method PINF [Chu2022] to reconstruct the first frame as the initial state. We include a comparison to NeuroFluid in Table R2 and Figure R3 in the PDF.
From the results we observe that NeuroFluid does not produce meaningful novel view synthesis, since it does not target at real fluid reconstruction.
**[*3dGa*, *4cV8*, *MRuw*] Comparison to GlobTrans**:
GlobTrans [Franz2021] is a global grid optimization-based method specifically designed for fluid re-simulation and reconstruction. We use the official code release and show an comparison on Table R2 and Figure R4 in the PDF.
From the results we observe that ours allows much better reconstruction fidelity reflected by higher PSNR and SSIM, yet GlobTrans yields better LPIPS. We note that GlobTrans assumes known lighting and uses a more sophisticated shading model. In contrast, ours does not assume known lighting and thus can be more general.
**[*5YF8*, *MRuw*] Spatiotemporally varying appearance**:
We add an ablation on predicting spatiotemporally-varying color to account for spatially-varying lighting. We show the novel view video synthesis (which mainly evaluates appearances) results in Table R3 and Figure R5 in the PDF.
We observe that this does not lead to significant differences on the ScalarFlow real dataset. This may be due to that the fluid is homogeneous in material and that the capturing environment is controlled. For more complex scenes with complex lighting, using spatially-varying color may help further.
**Reference**:
[Chu2022] Physics informed neural fields for smoke reconstruction with sparse data. TOG2022
[Eckert2019] ScalarFlow: a large-scale volumetric data set of real-world scalar transport flows for computer animation and machine learning. TOG2019
[Guan2022] Neurofluid: Fluid dynamics grounding with particle-driven neural radiance fields. ICML2022
[Franz2021] Global transport for fluid reconstruction with learned self-supervision. CVPR2021
## R1 5YF8 (borderline accept)
Thank you for your time and comments! Please see our response below.
- **Physical intuition on laminar loss**:
Laminar flow does not manifest local density change, and thus inferring it from pure visual observations is challenging. We introduce this regularization term to account for the fact that even in constant-density fluid regions, there can still be laminar flow. Since this also depends on prior knowledge of the fluid to reconstruct, the laminar loss takes a flexible form: the hyper-parameter $\gamma$ models the prior belief of having laminar flow, e.g., when $\gamma=0$ it allows zero velocity even in high-density regions.
- **Further exploration on visual appearance**:
Following your suggestion, we allow learning spatially and temporally varying appearances. Please see the results and discussion in the "Spatially varying appearance" in the global response above.
- **Rendering parameter**:
We use volumetric rendering in our formulation (Eq. (3) in the main paper). The parameters include {near plane $t_n$, far plane $t_f$, number of samples for numerial integration $N$}, where $t_n$ and $t_f$ are determined by centering it at the scene center (which is the geometric center of camera principal rays) and empirically scale it. We find that as long as there are enough samples (in our case, more than 64), the numerical integration is stable and thus the results are not sensitive to these parameters.
- **On MGPCG**:
We use MGPCG in every training step to compute the projection loss. We implement MGPCG using Taichi [Hu2019] that allows GPU acceleration. We use three levels with resolution 128^3. On average, this costs around 100ms per step. Thanks to our efficient implementation, the whole training takes only ~9 hours on a A100 GPU, as noted in our supplementary material.
**Reference**:
[Hu2019] Hu, Y., Li, T. M., Anderson, L., Ragan-Kelley, J., & Durand, F. (2019). Taichi: a language for high-performance computation on spatially sparse data structures. ACM Transactions on Graphics (TOG), 38(6), 1-16.
## R2 qUcR (Reject)
Thank you for your time and comments! Please see our response below.
- **Relation to NeuroFluid**:
We clarify that our setting is different from NeuroFluid. NeuroFluid focuses on learning fluid dynamics from a large amount of data. Therefore, they train and evaluate the model on synthetic data, yet the generated data distribution can be very different from real data to generalize. Moreover, they assume known initial state and no inflow source, which do not hold in some real scenes such as smoke plume scenes we used. In contrast, our goal is to reconstruct plausible fluid fields from real sparse multiview videos without assuming additional training data, so as to facilitate applications on real fluid videos such as novel view synthesis, re-simulation, future prediction, and turbulence editing.
- **Comparison to NeuroFluid**:
We show a comparison to NeuroFluid in the "Comparison to NeuroFluid" in the global response above. Please note that since we use a real dataset for evaluation (as specified in L201 in our main paper), we do not have groundtruth initial states that NeuroFluid requires as input. Therefore, we use the previous state-of-the-art method PINF [Chu2022] to reconstruct the first frame for the initial state of NeuroFluid. We will add reference and discussion to our paper.
- **Physics-based losses for visual ambiguity and supervision from velocity**:
We clarify that we aim at reconstructing plausible fluid velocity field from real videos to allow re-simulation and future prediction. We do not assume training data of groundtruth velocity as it is very scarce in real scenes. Therefore, we propose physics-based losses to regularize the recovery of fluid velocity such that they are physically plausible for the downstream applications.
- **Initial state of fluid**:
Different from NeuroFluid which requires initial states and learns fluid dynamics, we aim at reconstructing the fluid fields. Thus, the "initial state" is our model output rather than input.
- **Evaluating 3D density**:
We clarify that all our experiments are done on **real videos** which do not have groundtruth 3D density and velocity fields. Therefore, we indirectly evaluate them by downstream applications including novel view synthesis, re-simulation, and future prediction. Following your suggestion, we additionally evaluate our method on synthetic examples which provide groundtruth 3D density. Please refer to the "Evaluation on synthetic data" in the global response above to see the results and discussion.
- **Evaluating rendering results**:
We clarify that we evaluate rendering results in a hold-out novel view that is unseen during training (L205-L206 in the main paper). In particular, each example in the ScalarFlow real dataset and synthetic dataset has 5 views. We take 4 views for training, and 1 view for testing. We use this training-testing split for all our experiments including novel view video synthesis, novel view re-simulation, and novel view future prediction.
- **Experiments on different examples**:
Please note that the additional synthetic data in "Evaluation on different viscosity levels" in the global response demonstrate different material properties (e.g., they are more viscid than real smokes), shapes and inflows. We believe these additional examples provide more diverse evaluations.
**Reference**:
[Chu2022] Chu, M., Liu, L., Zheng, Q., Franz, E., Seidel, H. P., Theobalt, C., & Zayer, R. (2022). Physics informed neural fields for smoke reconstruction with sparse data. ACM Transactions on Graphics (TOG), 41(4), 1-14.
## R3 3dGa (Reject)
Thank you for your time and comments. Please see our response below.
- **Detailed presentation of experiment results**:
We clarify that we aim at reconstructing plausible fluid velocity field from real videos to allow re-simulation and future prediction. For real videos, it is very challenging to collect groundtruth 3D density and velocity.
Therefore we evaluate on the applications including novel view video synthesis, novel view re-simulation, and novel view future prediction. For each of them, we show detailed quantitative results in Table 1 (three metrics for each task) and qualitative results in Figure 3, 4, 7 in our main paper. In addition, we show qualitative results on turbulence editing and velocity recovery in Figure 5 and Figure 6. These downstream applications reflect that our approach can reconstruct plausible real fluid fields.
- **Additional quantitative evaluation**:
In addition to the experiments on real fluid videos, we include new experiments on synthetic fluid scenes. These scenes provide 3D groundtruth density and velocity, allowing quantitative evaluations on them. Please refer to the "Evaluation on synthetic data" and "Evaluation on different viscosity levels" in the global response for results and discussion.
- **No comparison with other methods in addressing the visual ambiguity of fluid velocity**:
We respectively disagree with this. We clarify that we have comparisons to PINF [Chu2022] and NeRFlow [Du2021], both of which aim to address visual ambiguity of velocity/flow estimation from real videos, and they both showcase plausible reconstruction on fluid scenes in their results. In particular, PINF [Chu2022] is the state-of-the-art fluid reconstruction method which tries to address visual ambiguity by physics-informed losses similar to Physics-informed Neural Networks (PINN) [Raissi2019]. PINF shows extensive results in synthetic scenes and a few real examples. NeRFlow approaches this by a set of temporal consistency losses and showcase fluid reconstruction in their "milk pouring" scene. Our comparison to them in Table 1 and Figure 3, 4, 5, 7 clearly demonstrate that our approach achieves better results than these existing methods.
- **Additional comparison with other methods**:
In addition to both PINF and NeRFlow, we compare to NeuroFluid [Guan2022], and GlobTrans [Franz2021]. NeuroFluid learns fluid dynamics to address velocity ambiguity. GlobTrans aims for reconstructing fluid fields using advection constraints and regularization terms to solve visual ambiguity of fluid velocity. Please refer to the "Comparison to NeuroFluid" and "Comparison to GlobTrans" in the global response for results and discussion.
- **Computation time and memory consumption**:
As we noted in L215 in our main paper, we leave computation resource usage in our supplementary material. As in L29 in our supplementary material, we train our model on a single A100 GPU (the GPU memory usage is around 30GB) for around 9 hours in total.
**Reference**:
[Chu2022] Chu, M., Liu, L., Zheng, Q., Franz, E., Seidel, H. P., Theobalt, C., & Zayer, R. (2022). Physics informed neural fields for smoke reconstruction with sparse data. ACM Transactions on Graphics (TOG), 41(4), 1-14.
[Du2021] Du, Y., Zhang, Y., Yu, H. X., Tenenbaum, J. B., & Wu, J. (2021, October). Neural radiance flow for 4d view synthesis and video processing. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 14304-14314). IEEE Computer Society.
[Raissi2019] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378, 686-707.
[Guan2022] Guan, S., Deng, H., Wang, Y., & Yang, X. (2022, June). Neurofluid: Fluid dynamics grounding with particle-driven neural radiance fields. In International Conference on Machine Learning (pp. 7919-7929). PMLR.
[Franz2021] Franz, E., Solenthaler, B., & Thuerey, N. (2021). Global transport for fluid reconstruction with learned self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1632-1642).
## R4 4cV8 (Weak accept)
Thank you for your insightful comments and constructive suggestions! Please see our response below.
- **Comparison to ScalarFlow and GlobTrans**:
Thank you for your suggestion! Since both ScalarFlow and GlobTrans are optimization-based methods, we compare to GlobTrans which is a later work and it has shown better results than ScalarFlow. Please refer to the "Comparison to GlobTrans" in the global response above for results and discussion.
- **Synthetic data for 3D evaluation**:
Following your suggestion, we additionally include synthetic examples using ScalarFlow synthetic dataset generation code. Please refer to the "Evaluation on synthetic data" and "Evaluation on different viscosity levels" in the global response above for results and discussion.
- **Self-advection of velocity**:
We do not include a physical loss for self-advection of velocity like $\mathcal{L}=D\mathbb{u}/Dt-\mathbb{f}$ (where $\mathbb{f}$ denotes external force) as we empirically found that it often leads to oversmoothed velocity fields. This may be due to that the material derivative term for velocity $D\mathbb{u}/Dt=\partial \mathbb{u}/\partial t+\mathbb{u}\cdot\nabla \mathbb{u}$ admits local trivial solutions where the velocity is spatiotemporally constant. In optimization-based methods such as GlobTrans, this local trivial solution is solved by a global optimization. However, in neural continuous reconstruction, this is not straightforward to address and we leave it as future exploration. Actually, we suspect that this is the reason that PINF (which uses a velocity advection loss) reconstructs only laminar flows for real videos.
- **Render velocity field as slice**:
We rendered velocity field to 2D by projecting every 3D velocity vector to the camera plane and then use volumetric rendering to integrate them. This indeed smoothes the visualization. Following your suggestion, we additionally include slice rendering in Figure R6 in the global response PDF. From the comparison we can see that our velocity field recovers more vortical details than PINF. We will include this figure in our revised paper.
## R5 MRuw (Borderline Reject)
Thank you for your insightful comments and constructive suggestions! Please see our response below.
For methodology:
1. **Technical novelty**:
[Chu et.al.] is indeed most relevant. Our approach incorporates novel losses including projection loss and laminar loss as well as a hybrid representation to capture turbulent velocity fields. Our technical novelty leads to better velocity reconstruction compared to [Chu et.al.] as can be seen in novel view re-simulation and velocity visualization (especially the videos in supplementary material). [Baieri et.al.] and [Li et.al.] focus on general dynamic object and they do not consider complex fluid dynamics such as turbulence. We will add these references and discussion to our paper.
2. **Spatially-varying appearance modeling**:
Following your suggestion, we allow learning spatially and temporally varying appearances. Please see the results and discussion in the "Spatiotemporally varying appearance" in the global response above.
For reference:
3. **References**:
Thank you for the note. We will add these references to our paper. We will also add a clarification that our setting is different from NeuroFluid. NeuroFluid focuses on learning fluid dynamics from a large amount of data. Therefore, they train and evaluate the model on synthetic data. Moreover, they assume known initial state and no inflow source. In contrast, our goal is to reconstruct plausible fluid fields from real sparse multiview videos without assuming any training data.
For experiments:
4. **Different viscosity level**:
Following your suggestion, we include additional synthetic data experiments which have different viscosity levels. Please refer to the "Evaluation on different viscosity levels" in the global response above for results and discussion.
5. **Spatially-varying appearance modeling**: Please see response to 2. above.
6. **Comparison to other existing methods**:
We include NeuroFluid and GlobTrans [Franz2021] as additional compared methods. Please refer to the "Comparison to GlobTrans" and "Comparison to NeuroFluid" in the global response above.
7. **Quantitative results regarding velocity field reconstruction**:
We clarify that our goal is to reconstruct plausible fluid fields from real videos, and thus in our experiments in the main paper **we only use real data** (as specified in L201) which does not provide groundtruth 3D fields but only multi-view videos. To evaluate 3D fields, we include experiments on synthetic examples in the "Evaluation on synthetic data" in the global response.
For questions:
1. **How to compute losses**:
For L_density, in each training step, we randomly select one timestamp (i.e., one frame) and we sample continuous points in 3D space. For these points, we compute the first-order derivatives by auto-gradient computation provided by PyTorch. For L_project, we do not use sampling as our MGPCG solver requires a regular grid. Thus, we use regular grid of 128^3 to solve for projection. These projected velocity vectors at the regular grid points are then used to supervise the velocity network outputs at those exact regular grid points.
2. **Differences compared to solely using high-frequency position embedding**:
Please note that the state-of-the-art neural fluid reconstruction method PINF [Chu2022] uses an advanced high-frequency position embedding [Sitzmann2020] for velocity field. Our comparison to PINF shows that our approach reconstructs better fluid fields with richer vortical details (in particular, please see the re-simulation video in supplementary material).
3. **Comparison to GT velocity**:
Please see our response in 7. above.
4. **HyFluid on real scenes**:
As clarified above, all our experiments use real data which inevitably has spatially-varying lighting due to global illumination. Please also see our response in 2. above for modeling spatially-varying appearance.
5. **More results of other baselines**:
Please refer to our response in 6. above.
**Reference**:
[Chu2022] Chu, M., Liu, L., Zheng, Q., Franz, E., Seidel, H. P., Theobalt, C., & Zayer, R. (2022). Physics informed neural fields for smoke reconstruction with sparse data. ACM Transactions on Graphics (TOG), 41(4), 1-14.
[Sitzmann2020] Sitzmann, V., Martel, J., Bergman, A., Lindell, D., & Wetzstein, G. (2020). Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33, 7462-7473.