*** ## Reviewer 3. Satq We appericiate your insightful comments. We address your concerns below. ### W1: Technique Novelty **While TACO may appear similar with existing image generation works, TACO is specialized for image compression tasks.** * **Cross Attention (CA) in Text-adapter** As the reviewer mentioned, previous works directly use the output of CA (Chen et al. (2023), Rombach et al. (2022)). Our usage of CA in Text-adapter differs from the previous approaches using CA in the following aspects: (1) We concentrate on enhancing the output features with text's semantic information during the encoding process, rather than depending solely on the outputs of the attention mechanism to identify semantic areas in the image. (2) In our approach, we use CA mechanisms on image features at multiple scales. It is helpful to downsampled image features to compensate and enhance the local and global information obtained from text description. * **Usage of CLIP embeddings** In generation work, CLIP embedding is used to link images with their relevant textual descriptions, for bridging the gap between visual content and language. However, in the context of compression, it's crucial to maintain a correlation with the original image. We utilized CLIP embeddings to associate the features of the original image with those of the compressed image and to connect the features of the compressed image with the textual information from the original image. ### (W2) SOTA performance in both PSNR and LPIPS metrics areas **We achieve the SOTA performance in LPIPS and best position between perceptual-oriented and pixel-oriented.** LIC-TCM and MS-ILLM currently lead as SOTA methods in pixel-fidelity and perceptual quality, respectively. However, both models exhibit limitations, underperforming in the metric not primarily focused. LIC-TCM is underperformed in LPIPS, FID and MS-ILLM is underperformed in PSNR. In recent efforts to address this trade-off, some approaches (Multi-realsim(Agustsson et al. (2023)), Tongda et al. (2023)) have aimed to enhance both pixel-level and perceptual fidelities. We show that TACO exactly achieve SOTA in LPIPS with significantly improved PSNR scores. ### (W3) Improvement brought by Text-Adatper and proposed loss We validate the critical role of each component in attaining high PSNR and enhanced perceptual quality through an ablation study.We compare the performance between original TACO(ELIC + proposed loss + text adapter), ELIC + proposed loss, ELIC + text adapter, and ELIC. The results of our ablation study are as follows: * Without text-adapter: To assess the impact of incorporating a text adapter on performance, we initially train TACO without a text adapter (relying solely on the proposed loss function). Then, we compare this configuration to the original TACO setup that includes a text adapter. The results are available at [link](https://drive.google.com/file/d/1FNcdFjIg3gDtbro3wsEm7sPQA7RCiyl3/view?usp=sharing). In the result, TACO with a text adapter and TACO without it show a big difference in perceptual quality.Thus, we can conclude that our proposed text adapter effectively utilizes text containing semantic information about the image, leading to improved image compression performance. * Without Proposed Loss Functions: To assess the impact of the proposed loss on performance, we employ traditional loss(rate, MSE, LPIPS) without our proposed loss. The [RD graphs](https://drive.google.com/file/d/1l_6AE71Fe_wAQdUoi6EZLwfMMCIcsUCp/view?usp=sharing) show that TACO without proposed loss leads to worse results in LPIPS and FID. Specifically, we observe a significant decrease in the FID score when comparing to MS-ILLM. We can assert that TACO surpasses MS-ILLM in FID scores by employing a proposed loss. Therefore, we conclude our proposed loss is necessary in achieving the best performance in both pixel fidelity and perceptual quality. We recognize that an ablation study is crucial to clarify the effectiveness. Therefore, we will add the discussion and RD graph results to the main paper. [References] * Rombach et al., "High-Resolution Image Synthesis With Latent Diffusion Models", CVPR, 2022. * Chen et al., "Vision Transformer Adapter for Dense Predictions", ICLR, 2023. * He et al., "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", CVPR, 2022. * Liu et al., "Learned Image Compression with Mixed Transformer-CNN Architectures", CVPR, 2023. * Agustsson et al., "Multi-Realism Image Compression with a Conditional Generator", CVPR, 2023. * Tongda et al., "Conditional Perceptual Quality Preserving Image Compression." arXiv:2308.08154 (2023) ## Reviewer 4. JmbA We appreciate the reviewer to provide constructive comments. ### (W1) Usefulness of text **The volume of textual data is sufficient for incorporating semantic information during the encoding process.** By transforming the caption into a token sequence of length 38 and subsequently converting these tokens into embedding vectors, we generate a textual embedding sequence represented by a [38, 512] matrix, utilizing 32-bit floating-point numbers. This conversion results a total of 622,592 bits (=38x512x32) of textual information per image. Consequently, it is validated the claim that the volume of text is available for use as a semantic guide is sufficiently large to be deemed valuable in the context of image processing <!-- text adapter 는 성능 향상을 가져온다 --> <!-- 실험 돌려보고 결과 비교해봤는데 원래 타코가 더 좋더라...라는 스토리 --> ### (W2, W3) Ablation Study on the Text Adapter **Our text adapter brings improvement of perceptual quality.** To assess the impact of incorporating a text adapter on performance, we initially train TACO without a text adapter (relying solely on the proposed loss function). Then, we compare this configuration to the original TACO setup that includes a text adapter. The results are available at [link](https://drive.google.com/file/d/1FNcdFjIg3gDtbro3wsEm7sPQA7RCiyl3/view?usp=sharing). In the result, TACO with a text adapter and TACO without it show a big difference in perceptual quality.Thus, we can conclude that our proposed text adapter effectively utilizes text containing semantic information about the image, leading to improved image compression performance. We will add the RD graph with the results to the main paper. ### (W4) Ablation study on using proposed components with other base model **Our proposed component contribute performance gain even using with other base models.** To verify whether the proposed text adapter and loss function bring performance improvement in each, we add MS-ILLM(Muckley et al. (2023)) to the text adapter or proposed loss function and train by using MS-COCO 2014 training split, while other training settings(number of training steps, scheduler, optimizer, learning rate, and loss function) are preserved. As it already uses LPIPS as part of the loss function, we only add our Joint Image-Text loss to MS-ILLM when adding our proposed loss to MS-ILLM. The results are available at [link](). From this link, we analyze and conclude like following paragraphs. (결과가 나와야 분석을 할 수 있음) **MS-ILLM with text adapter.** In the figure, we can see the performance gain when using text information thought the proposed text adapter. **MS-ILLM with proposed loss function.** In the figure, we can see the performance gain when using proposed loss function. [References] * Muckley et al., "Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models", ICML, 2023. ## Reviewer 5. oekG We thank the reviewer for their constructive suggestions. ### (W1, L1) Limited novelty in untilizing CLIP features **Our approach for exploiting CLIP embeddings shows significant differnces.** Although we use CLIP embeddings that are usually adapted to various vision tasks, our usage of CLIP embeddings is different from other approaches. To follow the goal of image compression that compresses images as close as the original ones, we exploit CLIP features in two ways. First, we use the CLIP embeddings as the side information that is leveraged to guide the encoding of the image to more suitable features for reconstruction. By this semantic guidance, TACO shows SOTA in both pixel fidelity and perceptual quality aspects. Also, we use the CLIP embedding-based loss function to reduce the semantic distance between the compressed image from TACO and the original image. We demonstrate that this constraint enables TACO to semantically compress the image in a manner closely aligned with the original image, as evidenced by the RD graphs presented in the main paper. Consequently, our novelty lies in the use of CLIP embeddings to link the original image feature with those of the compressed image and to associate the compressed image's features with the textual information of the original image. <!-- Text guidance를 쓰면 성능 향상이 된다 --> ### (W2, Q1, L2) Ablation study about text-adapter **Our text adapter brings improvement of perceptual quality.** To assess if text guidance enhances performance, we first train TACO without a text adapter (relying solely on the proposed loss function) and compare this version with the original TACO that includes the text adapter. The results are available at [link](https://drive.google.com/file/d/1FNcdFjIg3gDtbro3wsEm7sPQA7RCiyl3/view?usp=sharing).In the result, TACO with a text adapter and TACO without it show a big difference in perceptual quality.Thus, we can conclude that our proposed text adapter effectively utilizes text containing semantic information about the image, leading to improved image compression performance. We will add the RD graph with the results to the main paper. <!-- PerCo와 비교해보니 PSNR, LPIPS는 좋았는데 FID가 안좋더라. 이를 보아 Decoding에 넣으면 FID는 좋아지고 PSNR, LPIPS는 안좋아지고 Encoding에 넣으면 그 반대인거 같다. 그러니 TACO는 효과적으로 Text를 사용해서 여러 측면의 성능을 강화한 사례라고 주장할 수 있겠다. --> ### (W3, Q2, L3) Comparison with PerCo **TACO shows better in PSNR and LPIPS.** We compare TACO with PerCO in MS-COCO 30K valid dataset as we used in the paper. As PerCo has a different compression ratio from our bpp range, we extrapolate PerCo's results for comparison. The results are available at [link](https://drive.google.com/file/d/1FYbR5nQNbNtwiC-LlTqq7Y_LBb2qvbwC/view?usp=sharing). In our findings, TACO outperforms PerCo significantly in terms of PSNR and LPIPS. However, PerCo achieves a better FID score, suggesting that it might be more effective in capturing the realism by generative model like diffusion. This discrepancy indicates that conditioning the decoding process on text enhances the realism of the images, albeit at the expense of metrics like PSNR and LPIPS. We include this discussion in the main paper. <!-- FID는 좋아지지만 PSNR, LPIPS는 좋아지지 않는다는 측면에서 decoding이 ineffective하다고 말한 것이라는 논리를 가져가기 --> [References] * Rombach et al., "High-Resolution Image Synthesis With Latent Diffusion Models", CVPR, 2022. * He et al., "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", CVPR, 2022. * Careil et al.,"Towards image compression with perfect realism at ultra-low bitrates", ICLR, 2024. ## Reviewer 6. nuuF We appreciate the reviewer's constructive feedback. ### (W1, W4, Q1) Comparison with text-guided decoding baselines **Text-guided encoding brings higher PSNR than text-guided decoding.** Previous works of text-guided image compression such as TGIC (Jiang et al. (2023)), and Qin et al. (2023), they use text as a side information in decoding. These methods show competitive increased percetual quality, however show low PSNR. So, we hypothesize "text-guided decoding may not be an effective strategy for PSNR". The comparison between text-guided decoding baselines, shown in Fig. 8, supports our hypothesis. We observe TACO shows superior performance in all metrics. It indicates that using text-guided encoding strategy is more effective in all metrics than utilizing a text-guided decoding strategy. For more concrete verification, we train text-guided decoding TACO that uses a text adapter when decoding and compared with the original TACO. The results are available at [link](https://drive.google.com/file/d/1lCKZ7bc7bcq3ABME-GaOTBB7yQAmLbBz/view?usp=sharing). In the results, we observe the text-guided encoding strategy shows better performance than text-guided decoding. In particular, the performance gap between the two methods in PSNR is most significant. Therefore, we can claim our proposed hypothesis is plausible. We will the discussion about this ablation study in the paper. ### (W2) Computational cost of cross-attention in text adapter **The computation cost of attention mechanism is not relatively huge.** For the detailed checking the computational burden of attention mechanism in TACO, we measure processing time of high-resolution image ([Details of Pearl Earring Girl with a Pearl Earring Renovation ©Koorosh Orooj](http://vphotobrush.com/images/Details-of-Pearl-Earring-Girl-with-a-Pearl-Earring-Renovation.jpg)) that has the size of (1788 x 2641). The methods we use are TACO, ELIC(He et al. (2022)), MS-ILLM(Muckley et al. (2023)), and LIC-TCM(Liu et al. (2023)). We only compare the encoding time as the TACO adapter uses an attention-based text adapter in encoding only. You can check the results in the Table (1), (2). In Table (1), it is observed that TACO's encoding time is similar to other methods that use only the image for encoding. As compared with ELIC, our base model, the inference time is increased by 32.95ms(4.15%). In addition to this, we also check the increasing rate of process time from low-resolution (size of 256 x 256) image to high-resolution image (size of 1788 x 2641) as the computational cost of attention mechanism is grown quadratically when data sequences are growing linearly. We report the result in Table (2). In this result, we find the notable observation which says the increasing rate of TACO is similar to other methods. Therefore, from these results, we can say the computational cost of the attention mechanism in text adapter is not a huge burden in various image resolutions, including high-resolution images. <br> **Table (1)** | Methods | Image Encoding time at High-resolution (ms) | Text encoding time (ms) | Total Encoding time (ms) | | :-----: |:------------------:| :----------------: | :-------: | | LIC-TCM | 960.09 | 0.0 | 960.09 | | MS-ILLM | 800.45 | 0.0 | 800.45 | | ELIC | 793.41 | 0.0 | 793.41 | | TACO | 826.36 | 5.71 | 832.07 | **Table (2)** | Methods | Total Encoding time at low-resolution (ms) | Total Encoding time at high-resolution (ms) | Increasing rate (%) | | :-----: |:------------------:| :----------------: | :-------: | | LIC-TCM | 112.07 | 960.09 | 857% | | MS-ILLM | 70.39 | 800.45 | 1137% | | ELIC | 71.35 | 793.41 | 1112% | | TACO | 78.60 | 826.36 | 1051% | <br> ### (W3) Report the number of parameters **TACO contains the capacity of a CLIP text encoder which is enough for generating semantic textual embeddings, therefore getting better results although its size is not bigger than other methods.** We calculate the number of parameters in TACO and report it in Table (3). As TACO's text adapter uses text embeddings from pre-trained CLIP text encoder, we include the number of parameters in the text encoder of CLIP to that of the text adapter. <br> **Table (3)** |Modules | Parameters (M)| |:-------------------:|:-------------:| |Encoder |7.34 | |Decoder |7.34 | |Hyper-prior Encoder |2.40 | |Hyper-prior Decoder |8.14 | |Other modules (e.g. entropy bottleneck) |11.72 | |Text-Adapter (including the text encoder of CLIP) |64.82 | <br> In this table, we find the number of paramters in the text adpater is relatively larger than other modules. From this observation, we can say two claims. * **TACO is not a large-scale model.** It can be considered as concerning that the number of parameters in TACO is increased with a large gap by text adapter. However, we only train the parameters of the text adapter without the text encoder of CLIP, its number is only 1.64M. It means if we save the checkpoint of TACO, its number of parameters is 36.94M. It is less than the number of the LIC-TCM that has 44.96M parameters in case of small version, and SwinT-ChARM (Zhu et al. (2022)) that has 60.55M parameters. * **TACO leverages the pre-trained text encoder which has enough capacity.** From Table (3) and the above sentences, we know the pre-trained text encoder trained on a web-scale dataset has a large number of parameters. It means TACO leverages the context-rich information of text extracted from the text encoder as semantic guidance, achieving the better performance in both pixel fidelity and perceptual quality. Therefore, we can claim TACO is the method that leverages the guidance of meaningful textual embeddings from the high capacity of pre-trained text encoder, therefore showing better performance in various aspects of image quality, although it only requires the training of fewer parameters than previous methods like LIC-TCM and SwinT-ChARM. We will add a table with the results to the Appendix. ### (W3, Q2) Ablation study on Joint-Image-Text loss **Joint-Image-Text loss contibutes to increase performance in perceptual quality.** To assess the impact of the proposed loss on performance, we employ traditional loss(rate, MSE, LPIPS) without our proposed loss. The results are available at [link](https://drive.google.com/file/d/1l_6AE71Fe_wAQdUoi6EZLwfMMCIcsUCp/view?usp=sharing). The RD graphs show that TACO without proposed loss leads to worse results in LPIPS and FID. Specifically, we observe a significant decrease in the FID score when comparing to MS-ILLM. We can assert that TACO surpasses MS-ILLM in FID scores by employing a proposed loss. Therefore, we conclude our proposed loss is necessary in achieving the best performance in both pixel fidelity and perceptual quality. ### (Q3) Consideration of the computation time of the CLIP's text encoder when calculating the encoding time **The processing time of text encoder in CLIP is relatively small.** As extracting textual embeddings also part of TACO's processing phases, we consider the text encoder's processing time and add it to encoding time, also writing this separately in Table (1). In table (1), we find the inference time of extracting text embeddings is just 5.71 ms which is relatively smaller than overall computation time. It means TACO enables to get textual embeddings for semantic guidance with little cost. ### (Q4) The reason for not using Adversarial Losses **We decide to not use adversarial losses for optimizing TACO to perform better in various aspects of image quality.** In the image compression field, the adversarial losses are used for training the model that generates high perceptual quality images, such as HiFi-C(Mentzer et al. (2020)), Qin et al. (2023), and TGIC (Jiang et al. (2023)). These methods show competitive performance with respect to perceptual quality while coming with **low pixel fidelity**. It means if we add the adversarial loss, TACO will be focused only reconstruct the high perceptual quality image. In other words, TACO will lose its unique contribution that achieves both high perceptual and pixel fidelity. Therefore, to achieve better performance in various aspects of image quality, we decide to not use adversarial losses. [References] * He et al., "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", CVPR, 2022. * Liu et al., "Learned Image Compression with Mixed Transformer-CNN Architectures", CVPR, 2023. * Muckley et al., "Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models", ICML, 2023. * Mentzer et al., "High-Fidelity Generative Image Compression", NeurIPS, 2020. * Jiang et al., "Multi-Modality Deep Network for Extreme Learned Image Compression", AAAI, 2023. * Qin et al., "Perceptual Image Compression with Cooperative Cross-Modal Side Information", arXiv 2311.13847 (2023) * Careil et al.,"Towards image compression with perfect realism at ultra-low bitrates", ICLR, 2024. * Zhu et al. "Transformer- based transform coding", ICLR, 2022.