ICML2024_Rebuttal

Dear Area Chair, We deeply appreciate your time and effort in organizing the reviews on our manuscript. As the reviewers highlight, our work proposes an **intuitive and well-motivated** algorithm (1mna, yNEC, JmbA) with **strong experimental results** on various datasets (1mna, yNEC, oekG) in a **clear presentation** (1mna, yNEC, Satq, JmbA, nuuF). Concerns of the reviewers were mostly about providing additional ablations and analyses or asking for further clarification. As we have summarized in the common response (see below), we have carefully addressed each concern one-by-one. The reviewers who participated in the discussion phase have acknowledged that our response has been satisfactory, and raised-to/maintained a positive evaluation of our paper (oekG, yNEC, 1mna). However, despite our thorough answers, we noticed that a couple of reviewers - Reviewer Satq, JmbA, nuuF - did not respond until the end of the discussion phase. While this was unfortunate, we remain confident that our answers addressed their concerns thoroughly and effectively. We hope our response as well as our supplementary results can successfully and sincerely clarify the concerns raised by the reviewers. * Satq: Asked for further clarifications on what our contributions are. We answer by highlighting our key contribution and the strength of our empirical performance. * JmbA: Some comments seem to be misunderstood in our work (e.g. bits of text). We provide a clean answer to the reviewer’s questions by showing clear numerical results and additional ablations. * nuuF: Requested for more detailed analyses (e.g. comparisons on the number of parameters, encoding time for high-res. image, the reason for not using adversarial losses) and down-toning our claims on the limitations of competitors. We answer the questions by conducting additional ablations. Thank you again for your consideration, Authors. ## Reviewer 1. 1mna We appericiate the reviewers for the valuable comments and question. ### (Q1) PSNR-Perceptual curves on using text in both encoding and decoding **Text used in both encoding and decoding improves perceptual quality at the trade-off of PSNR.** We conduct experiment where we apply our proposed text adapter and loss in text-conditioning on both encoding and decoding process. ([Result: PSNR-Perceptual graph](https://drive.google.com/file/d/1I10kGkbv1VoA9qPGdCO7eeMOUwRRTye1/view?usp=sharing)). We observe the perceptual quality improvement when using text in both encoding and decoding at almost same degree like only using text in encoder. However, we **find using text in decoding also brings more PSNR performance drop compared with using only encoding**. It means using the text in encoding only brings the better PSNR-Perceptual trade-off than using the text in both. We will add our paper to include a discussion on the ablation study. This result will help a more comprehensive understanding of why we condtion the text in encoding process only, which achieves the best PSNR-Perceptual trade-off. ### (Q2) Ablation study on using our proposed loss only  **Using proposed loss contributes competitive result, but it would be better to have a text adapter for further improvement.** To investigate standalone impact of the our proposed loss, we compare original ELIC (which is our base model (He et al. (2022)) and the ELIC with our proposed loss. As observing the PSNR-Perceptual curves in this [link](https://drive.google.com/file/d/1FNcdFjIg3gDtbro3wsEm7sPQA7RCiyl3/view?usp=sharing), we find **the performance gain in perceptual quality** when using our proposed loss, compared with original ELIC. This proves that our proposed loss helps to achieve a high semantic similarity of compressed image to the original image and caption, *therefore achieving performance gain in LPIPS and FID.* However, it is also shown that the TACO (consists of ELIC, text adapter, and proposed loss) is achieving the best perceptual perforamnce among the ELIC, ELIC with proposed loss. So, we conclude using our proposed loss brings better perceptual performance, but using text adapter brings further effect on performance gain. We will enhance our paper to include a discussion on the effects of the proposed loss. It will provide a more comprehensive understanding of how proposed loss contributes to the observed improvements. [References] * He et al., "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", CVPR, 2022. ## Reviewer 2. yNEC We thank the reviewer for suggesting the more qualitative results to strengthen our paper. ### (W1) Present additional qualitative results **We present more qualitative results that prove the effectiveness and versatility of TACO.** TACO can capture detailed attributes of the original source image. Specifically, [in this reconstructed image](cat_photo), the texture of white wall and cat's pupils are reconstructed better than other methods(Perceptual oriented: MS-ILLM, PSNR oriented: LIC-TCM).We will attach these additional qualitative results in main paper for making readers with a more comprehensive understanding of its performance across various scenarios. ### (W2) Analysis on qualitative results * **White bar(figure A3).** We analyze the figure A3 focusing on white bars below ([link](https://drive.google.com/file/d/1nxS3pP8ejKvNCV846XO4O-Fwh2z21wBk/view?usp=sharing)). In this figure, we observe TACO reconstructs the white bars just as well as the other methods(MS-ILLM,LIC-TCM). However, we find only TACO can reconstruct vertical lines (a part of white bars), where it is hard to distinguish regions(i.e. white bar with white background) compared with other methods. * **Aliasing pattern images.** From the qualitative results, we find TACO perform to reconstruct well in fine detailes like texture, pattern. As with your careful observations, we are also interested in how well TACO reconstructs geometrically repetitive images(e.g., aliasing patterns). [Here](https://drive.google.com/file/d/11x9DZps9KhTZ11hUxgqiBsm8zA18HWQw/view?usp=sharing) are some additional results on Urban dataset(Huang et al. (2015)), which gathers building images and it is commonly used for evaluating the super-resolution task. In this reconsturction, we observe TACO can capture fine details of the specific patterns in original source image. We will add these additional qualitative results to the main paper to verify TACO can capture well on hard images (like white bar and aliasing patterns) to readers. [References] * Liu et al., "Learned Image Compression with Mixed Transformer-CNN Architectures", CVPR, 2023. * Muckley et al., "Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models", ICML, 2023. * Huang et al., "Single Image Super-Resolution From Transformed Self-Exemplars", CVPR, 2015. *** ## Reviewer 3. Satq We appericiate your insightful comments. We address your concerns below. ### W1: Technique Novelty **While TACO may appear similar with existing image generation works, TACO is specialized for image compression tasks.** * **Cross Attention (CA) in Text-adapter** As the reviewer mentioned, previous works directly use the output of CA (Chen et al. (2023), Rombach et al. (2022)). Our usage of CA in Text-adapter differs from the previous approaches in the following aspects: (1) We concentrate on enhancing the output features with text's semantic information during the encoding process, rather than depending on the outputs of the attention mechanism to identify semantic areas in the image, such as segmentation, object detection. (2) In our approach, we use CA mechanisms on image features at multiple scales. It is helpful to downsampled image features to compensate and enhance the local and global information obtained from text description. * **Usage of CLIP embeddings** TACO utilizes CLIP feature distance to close the features of the original image and the compressed image and features of the compressed image with the text * In generation work, CLIP embedding is used to link images with their relevant textual descriptions, for bridging the gap between visual content and language. However, in the context of compression, it's crucial to maintain a correlation with the original image. We utilized CLIP embeddings to associate the features of the original image with those of the compressed image and to connect the features of the compressed image with the textual information from the original image. ### (W2) SOTA performance in both PSNR and LPIPS metrics areas **We achieve the SOTA performance in LPIPS and best position between perceptual-oriented and pixel-oriented.** LIC-TCM and MS-ILLM currently lead as SOTA methods in pixel-fidelity and perceptual quality, respectively. However, both models exhibit limitations, underperforming in the metric not primarily focused. LIC-TCM is underperformed in LPIPS, FID and MS-ILLM is underperformed in PSNR. In recent efforts to address this trade-off, some approaches (Multi-realism (Agustsson et al. (2023)), Tongda et al. (2023)) have aimed to enhance both pixel-level and perceptual fidelities. We show that TACO exactly achieve SOTA in LPIPS with significantly improved PSNR scores. ### (W3) Improvement brought by using Text-Adatper and proposed loss We validate the critical role of each component in attaining high PSNR and enhanced perceptual quality through an ablation study. We compare the performance between original TACO (ELIC + proposed loss + text adapter), ELIC + proposed loss, ELIC + text adapter, and ELIC. The results of our ablation study are as follows: * Without text-adapter: To assess the impact of incorporating a text adapter on performance, we train TACO without a text adapter (relying solely on the proposed loss function). Then, we compare this configuration to the original TACO includes a text adapter. The results are available at [link](https://drive.google.com/file/d/1FNcdFjIg3gDtbro3wsEm7sPQA7RCiyl3/view?usp=sharing). In the result, TACO and TACO without text adapter show a big difference in perceptual quality.Thus, we can conclude that our proposed text adapter effectively utilizes text, leading to improved image compression performance. * Without Proposed Loss Functions: To assess the impact of the proposed loss on performance, we train TACO without proposed loss. The [RD graphs](https://drive.google.com/file/d/1l_6AE71Fe_wAQdUoi6EZLwfMMCIcsUCp/view?usp=sharing) show that TACO without proposed loss leads to worse results in LPIPS and FID. Specifically, we observe a significant decrease in the FID score when comparing to MS-ILLM. We can assert that TACO surpasses MS-ILLM in FID scores by employing a proposed loss. Therefore, we conclude our proposed loss is necessary in achieving the best performance in both pixel fidelity and perceptual quality. We recognize that an ablation study is crucial to clarify the effectiveness. Therefore, we will add the discussion and RD graph results to the main paper. [References] * Rombach et al., "High-Resolution Image Synthesis With Latent Diffusion Models", CVPR, 2022. * Chen et al., "Vision Transformer Adapter for Dense Predictions", ICLR, 2023. * He et al., "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", CVPR, 2022. * Liu et al., "Learned Image Compression with Mixed Transformer-CNN Architectures", CVPR, 2023. * Agustsson et al., "Multi-Realism Image Compression with a Conditional Generator", CVPR, 2023. * Tongda et al., "Conditional Perceptual Quality Preserving Image Compression", 2308.08154 , 2023 ## Reviewer 4 Dear reviewer JmbA, We appreciate you to provide constructive comments. we write the answer about your comments in below. --- > 1. It is difficult to conclude the usefulness of the text adapter module itself that augments the encoder architecture. Intuitively, it is difficult to imagine that a short token length of 58 would significantly add any information about the image. 58 tokens corresponds to about 6 bits total. It is hard to believe that the encoder can leverage such a small amount of information, when it is communicating a bpp of 0.2-0.8 (at least 13000 bits total). **The volume of textual data is sufficient for incorporating semantic information during the encoding process.** By transforming the caption into a token sequence of length 38 and subsequently converting it using text encoder of CLIP, we get a length-38 sequence of 512-dimensional text embedding sequence consisting of of 32-bit floating-point. It means we leverage a text data that has a total of 622,592 bits (=38x512x32) per image, whose bpp equal to 9.5 when using with 256x256 resolution image. Therefore, it is validated claim that the volume of text is available for use as a semantic guide is sufficiently large to be deemed valuable in the context of image processing. --- > 2. An ablation study would have helped to demonstrate whether this is true. For example, one combination would have ELIC (with no text adapter), but use the same objective function in the paper (which leverages textual information). **Our text adapter brings improvement of perceptual quality.** To assess the impact of incorporating a text adapter on performance, we train TACO without a text adapter (i.e. only using the proposed loss). Then, we compare it with original TACO that includes a text adapter. The results are available at [link](https://drive.google.com/file/d/1FNcdFjIg3gDtbro3wsEm7sPQA7RCiyl3/view?usp=sharing). In the result, TACO without text adapter shows a performance drop in perceptual quality, compared with the original TACO. Thus, we conclude out proposed text adapter effectively the sematnic information from text, leading to improved image compression performance. We will add the RD graph with the analysis to the main paper. --- > 3. In addition, for a further ablation study and comparison to HiFiC or MS-ILLM, I think those architectures should also have a text adapter, or training HiFiC/MS-ILLM with ELIC transforms. Essentially, my point is that further ablation studies are needed to show if the text adapter really helps for other architectures, and if the loss function proposed can also help other architectures. **Our proposed component contribute performance gain even using with other off-the-shelf image compression models.** To verify whether the proposed text adapter and loss function bring performance improvement in each, we add MS-ILLM(Muckley et al. (2023)) to the text adapter or proposed loss function and train by using MS-COCO 2014 training split. Other training settings(number of training steps, scheduler, optimizer, learning rate, and loss function) are preserved. As it already uses LPIPS as part of the loss function, we only add our Joint Image-Text loss to MS-ILLM when adding our proposed loss to MS-ILLM. The results are available at [link](). From this results, we can make two analysis as following. **MS-ILLM with text adapter.** In the figure, we can see the performance improvement when using text information thought the proposed text adapter. **MS-ILLM with proposed loss function.** In the figure, we can see the performance improvement when using proposed loss function. Therefore, we can claim our proposed components make performance improvement. --- Please let us know if you have an additional concerns or quetstions. Best regards, authors. [References] * Muckley et al., "Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models", ICML, 2023. --- ## Reviewer 5 Dear reviewer oekG, We appreciate you to provide constructive comments. we write the answer to your comments below. --- > 1. Utilizing CLIP features from image captions for neural compression has limited novelty, as it is not a new idea, although the authors have proven that the proposed method outperforms the previous test-guided methods. **Our approach to leverage CLIP features shows significant differences.** Although we use CLIP features that are usually adapted to various vision tasks, our usage of CLIP features is different from other approaches. To follow the goal of image compression that compresses images as close as the original ones, we exploit CLIP features in two ways. First, we use the CLIP features as the side information that is leveraged to guide the encoding of the image to more suitable features for reconstruction. By this semantic guidance, TACO shows SOTA in both pixel fidelity and perceptual quality aspects. Also, we use the CLIP feature-based loss function to reduce the semantic distance between the compressed image and the original image. We demonstrate that this constraint enables TACO to semantically compress the image in a manner closely aligned with the original image, as evidenced by the RD graphs presented in the main paper. Consequently, our novelty lies in the use of CLIP features to simultaneously reduce the gap between the original image feature with those of the compressed image and the compressed image's features with the textual information of the original image in semantic aspects, resulting in better reconstruction of the image. --- > 2. The effectiveness of using image caption needs a more direct proof, e.g., an ablation study on the proposed method without using caption and trained with the proposed loss function Eqn. (6). Otherwise, it is hard to distinguish whether the test-guidance or the proposed loss result in the performance. **Our text adapter brings improvement in perceptual quality.** To assess if text guidance enhances performance, we train TACO without a text adapter (i.e. ELIC with proposed loss function) and compare it with the original TACO. The results are available at [link](https://drive.google.com/file/d/1FNcdFjIg3gDtbro3wsEm7sPQA7RCiyl3/view?usp=sharing). In the result, we observe the performance drop in perceptual quality when not using a text adapter. Thus, it is a reasonable claim that our proposed text adapter effectively utilizes text information about the image, leading to improved image compression performance. We will add the RD graph and its analysis to the main paper. --- > 3. A comparison with PerCo (Careil et al., 2024) should be conducted. Although PerCo works in lower bitrates than the proposed method, showing their rate-distortion/perception curves in a figure is also meaningful and inspiring. **TACO shows better in PSNR and LPIPS than PerCo.** We compare TACO and PerCO in MS-COCO 30K valid dataset as PerCo was evaluated by using this dataset like ours. As PerCo has a different compression ratio from our bpp range, we extrapolate PerCo's results for comparison. The results are available at [link](https://drive.google.com/file/d/1FYbR5nQNbNtwiC-LlTqq7Y_LBb2qvbwC/view?usp=sharing). In these figures, TACO outperforms PerCo significantly in terms of PSNR and LPIPS. However, PerCo achieves a better FID score. This observation suggests that it might be more effective in capturing realism by using a generative model like the latent diffusion model (Rombach et al. (2022)) in decoding. This discrepancy indicates that conditioning the decoding process on text enhances the realism of the images, albeit at the expense of metrics like PSNR and LPIPS. We will include this discussion in the main paper. --- If you have any additional concerns or questions, please let us know. Best regards, authors. [References] * Rombach et al., "High-Resolution Image Synthesis With Latent Diffusion Models", CVPR, 2022. * He et al., "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", CVPR, 2022. * Careil et al.,"Towards image compression with perfect realism at ultra-low bitrates", ICLR, 2024. --- ## Reviewer 6 Dear reviewer nuuF, We deeply appreciate your constructive feedback. We write the answer to your feedback below. --- > 1. Lack of convincing evidence or analysis to support “Our key hypothesis is that text-guided decoding may not be an effective strategy for PSNR.” (line 094, page 2). **Text-guided encoding brings higher PSNR than text-guided decoding.** Previous works about text-guided image compression such as TGIC (Jiang et al. (2023)), and Qin et al. (2023) use text as a side information in decoding. These methods show competitively increased perceptual quality by leveraging the semantic information in the text as guidance when decoding, however commonly suffer from the degradations in pixel-wise fidelity. So, based on our observation, we hypothesize "text-guided decoding may not be an effective strategy for PSNR". The comparison between text-guided decoding baselines, shown in Fig. 8, supports our hypothesis. We observe TACO shows superior performance in all metrics. It indicates that using our text-guided encoding strategy is more effective than utilizing a text-guided decoding strategy in image compression. For more concrete verification, we train text-guided decoding TACO that uses a text adapter when decoding and compared with the original TACOFor more concrete verification, we train text-guided decoding TACO that uses a text adapter when decoding and compared with the original TACO. The results are available at [link](https://drive.google.com/file/d/1lCKZ7bc7bcq3ABME-GaOTBB7yQAmLbBz/view?usp=sharing). In the results, we observe the text-guided encoding strategy shows better performance than text-guided decoding. In particular, the performance gap between the two methods in PSNR is most significant. Therefore, we can claim our proposed hypothesis is plausible. We will the discussion about this ablation study in the paper. --- > 2. The use of the Attention mechanism increases computational expense when processing high-resolution images. **The computation cost of attention mechanism is not relatively huge.** For the detailed checking the computational burden of attention mechanism in TACO, we measure processing time of high-resolution image ([Details of Pearl Earring Girl with a Pearl Earring Renovation ©Koorosh Orooj](http://vphotobrush.com/images/Details-of-Pearl-Earring-Girl-with-a-Pearl-Earring-Renovation.jpg)) that has the size of (1788 x 2641). The methods we use are TACO, ELIC(He et al. (2022)), MS-ILLM(Muckley et al. (2023)), and LIC-TCM(Liu et al. (2023)). As the TACO adapter uses an attention-based text adapter in encoding, We compare the encoding time only. You can check the results in the Table (1), (2). In Table (1), it is observed that TACO's encoding time is similar to other methods that use only the image for encoding. As compared with ELIC, our base model, the inference time is increased by 32.95ms(4.15%). In addition to this, we also check the increasing rate of process time from low-resolution (size of 256 x 256) image to high-resolution image (size of 1788 x 2641) as the computational cost of attention mechanism is grown quadratically when data sequences are growing linearly. We report the result in Table (2). In this result, we find the notable observation which says the increasing rate of TACO is similar to other methods. Therefore, from these results, we can say the computational cost of the attention mechanism in text adapter is not a huge burden in various image resolutions, including high-resolution images. <br> **Table (1).** Comparison of inference time for processing the high-resolution image between various methods. As TACO adds an attention mechanism-based text adapter at the encoder, we only consider encoding time. As getting textual embeddings is considered the part of encoding, we include the text encoding time to the total encoding time of TACO. | Methods | Image Encoding time at High-resolution (ms) | Text encoding time (ms) | Total Encoding time (ms) | | :-----: |:------------------:| :----------------: | :-------: | | LIC-TCM | 960.09 | 0.0 | 960.09 | | MS-ILLM | 800.45 | 0.0 | 800.45 | | ELIC | 793.41 | 0.0 | 793.41 | | TACO | 826.36 | 5.71 | 832.07 | **Table (2).** Comparison of growh rate of inference time from the processing of low-resolution (256x256) image to that of high-resolution (1788 x 2641) image. | Methods | Total Encoding time at low-resolution (ms) | Total Encoding time at high-resolution (ms) | Increasing rate (%) | | :-----: |:------------------:| :----------------: | :-------: | | LIC-TCM | 112.07 | 960.09 | 857% | | MS-ILLM | 70.39 | 800.45 | 1137% | | ELIC | 71.35 | 793.41 | 1112% | | TACO | 78.60 | 826.36 | 1051% | --- > 3. The paper lacks a discussion on the number of parameters **TACO contains the capacity of a CLIP text encoder which is enough for generating semantic textual embeddings, therefore getting better results although its size is not bigger than other methods.** We calculate the number of parameters in TACO and report it in Table (3). As TACO's text adapter uses text embeddings from pre-trained CLIP text encoder, we include the number of parameters in the text encoder of CLIP to that of the text adapter. <br> **Table (3).** Number of parameters in each modules in TACO. We include the number of parameters in the text encoder of CLIP to the text adapter as the textual embedding generation is a part of the text adaptation of TACO. |Modules | Parameters (M)| |:-------------------:|:-------------:| |Encoder |7.34 | |Decoder |7.34 | |Hyper-prior Encoder |2.40 | |Hyper-prior Decoder |8.14 | |Other modules (e.g. entropy bottleneck) |11.72 | |Text-Adapter (including the text encoder of CLIP) |64.82 | <br> In this table, we find the number of paramters in the text adpater is relatively larger than other modules. From this observation, we can say two claims. * **TACO is not a large-scale model.** It can be considered as concerning that the number of parameters in TACO is increased with a large gap by text adapter. However, we only train the parameters of the text adapter without the text encoder of CLIP, its number is only 1.64M. It means if we save the checkpoint of TACO, its number of parameters is 36.94M. It is less than the number of the LIC-TCM that has 44.96M parameters in case of small version, and SwinT-ChARM (Zhu et al. (2022)) that has 60.55M parameters. * **TACO leverages the pre-trained text encoder which has enough capacity.** From Table (3) and the above sentences, we know the pre-trained text encoder trained on a web-scale dataset has a large number of parameters. It means TACO leverages the context-rich information of text extracted from the text encoder as semantic guidance, achieving the better performance in both pixel fidelity and perceptual quality. Therefore, we can claim TACO is the method that leverages the guidance of meaningful textual embeddings from the high capacity of pre-trained text encoder, therefore showing better performance in various aspects of image quality, although it only requires the training of fewer parameters than previous methods like LIC-TCM and SwinT-ChARM. We will add a table with the results to the Appendix --- > 4. How does the joint image-text loss affect the final performance of the model? **Joint-Image-Text loss contibutes to increase performance in perceptual quality.** To assess the impact of the proposed loss on performance, we employ traditional loss(rate, MSE, LPIPS) without our proposed loss. The results are available at [link](https://drive.google.com/file/d/1l_6AE71Fe_wAQdUoi6EZLwfMMCIcsUCp/view?usp=sharing). The RD graphs show that TACO without proposed loss leads to worse results in LPIPS and FID. Specifically, we observe a significant decrease in the FID score when comparing to MS-ILLM. We can assert that TACO surpasses MS-ILLM in FID scores by employing a proposed loss. Therefore, we conclude our proposed loss is necessary in achieving the best performance in both pixel fidelity and perceptual quality. --- > 5. Is the CLIP model considered when calculating the encoding time? **The processing time of text encoder in CLIP is relatively small.** As extracting textual embeddings also part of TACO's processing phases, we consider the text encoder's processing time and add it to encoding time, also writing this separately in Table (1). In table (1), we find the inference time of extracting text embeddings is just 5.71 ms which is relatively smaller than overall computation time. It means TACO enables to get textual embeddings for semantic guidance with little cost. --- > 6. What is the reason for not using adversarial losses in this method? **We decide to not use adversarial losses for optimizing TACO to perform better in various aspects of image quality.** In the image compression field, the adversarial losses are used for training the model that generates high perceptual quality images, such as HiFi-C(Mentzer et al. (2020)), Qin et al. (2023), and TGIC (Jiang et al. (2023)). These methods show competitive performance with respect to perceptual quality while coming with **low pixel fidelity**. It means if we add the adversarial loss, TACO will be focused only reconstruct the high perceptual quality image. In other words, TACO will lose its unique contribution that achieves both high perceptual and pixel fidelity. Therefore, to achieve better performance in various aspects of image quality, we decide to not use adversarial losses. --- If you have any additional concerns or questions, please let us know. Best regards, authors. [References] * He et al., "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", CVPR, 2022. * Liu et al., "Learned Image Compression with Mixed Transformer-CNN Architectures", CVPR, 2023. * Muckley et al., "Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models", ICML, 2023. * Mentzer et al., "High-Fidelity Generative Image Compression", NeurIPS, 2020. * Jiang et al., "Multi-Modality Deep Network for Extreme Learned Image Compression", AAAI, 2023. * Qin et al., "Perceptual Image Compression with Cooperative Cross-Modal Side Information", arXiv 2311.13847 (2023) * Zhu et al. "Transformer- based transform coding", ICLR, 2022. # Dear reviewer yNEC, We deeply appreciate your thoughtful comments and feedback. We respond to your suggestions below. --- > 1. The paper would be stronger if the authors can show more qualitative results in the supplementary material. Following the reviewer's suggestion, we have doubled the number of qualitative results (3 -> 6); see [**LINK**](https://drive.google.com/file/d/1Dd5mlshKMF90vy59ApTxitHKCQMDD16U/view?usp=sharing) for the added figures. In the figures, we find that TACO consistently provides high quality reconstructions, while: - LIC-TCM tends to overly smooth out the textures (see, e.g., white wall under the cat) - MS-ILLM hallucinates small details (see, e.g., eyes of the cat) We will add these examples in the supplementary materials of the revised manuscript. --- > 2. In the supp, in the example of "A young man riding a skateboard in a parking lot", how does TACO work for the white bars below? Will TACO work well for such (possible) aliasing patterns? Yes. TACO also works well on possibly aliasing patterns. We zoom in on two different areas covered with white bars in [this link](https://drive.google.com/file/d/1nxS3pP8ejKvNCV846XO4O-Fwh2z21wBk/view?usp=sharing). We observe that TACO achieves either the better or similar quality to existing codecs. In particular, in the upper image, we observe that TACO better preserves vertical patterns. --- We hope you find our response reasonable. Please let us know if you have any remaining concerns or questions. Best regards, authors.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.