Research progress / Activity report

## Research progress / Activity report Paulo Linares (d12922028) ----------------------------------- <style> .full-width-bg { background-color: lightgreen; color: #333; padding: 8px 12px; margin-top: 15px; margin-bottom: 15px; line-height: 1.2 } </style> ### June 24, 2025 #### <h4 class=full-width-bg>Research Progress</h4> <div style="text-align: justify;"> 1. **Code cleaning and organization of the repository:** During last week the project's code was revised and unused or deprecated scripts and functions were removed. Additionally, the [repository](https://github.com/Vegnics/HourglassNet_RGBD/tree/devel_2) containing the project was better organized, separating the two main parts of the project: In-bed pose estimation, and In-bed Action Recognition. 2. **Paper revision:** From June 19 to date, the _Related Work_ and _Proposed Methodology_ sections of the manuscript has been being revised to correct the grammar and polish some paragraphs containing ambiguous or imprecise ideas. Currently, the figures are being edited to match with the current version of the model (the architecture of the Spatial Attention Mechanism has been changed to adopt a multi-head attention approach). The accuracy of the proposed model is $97.1\%$ PCKh@0.5, outperforming previous works on SLP. Finally, the work is about developing an accurate light-weight pose estimation model. Thus, I am considering the addition of a table depicting the parameter count of the proposed attention mechanisms to compare them with other works or modules (ResNet block, Convolutional Layer, Transformer's Multi-Head Attention, Squeeze-and-Excitation, CBAM), and to justify the parameter overhead. 3. **Functionality test for the action recognition model:** The first version of the in-bed Action Recognition (BedAR) Model was developed during the last 2 weeks of May. This model employs the heatmaps generated by the HuPE model over a sequence of images to determine the action being performed by a subject (9 actions are regarded). The data collected with the Intel RealSense L515 camera was employed to train/test the model. A sample of the results obtained with BedAR is shown below. </div> <div style="text-align: center;"> <img src="https://i.ibb.co/SDYXNqpV/Bedar-Sitting-sample.gif" width="450"> </div> ### <h4 class=full-width-bg>Additional Activities </h4> ### <h4 class=full-width-bg>Plans and upcoming activities </h4> - Finish the proofreading of my manuscript for the following sections: - Related Work - Proposed Methodology - Datasets and Performance Metrics - Implementation and Training details. - Update the Overleaf's project. - Keynote speech (06/30): _From Data to Digital Twin: Enabling the Future of Manufacturing_. ---------------------------- ### July 01, 2025 <div style="text-align: justify;"> <h4 class=full-width-bg>Research Progress</h4> 1. **Paper revision**: The _Related Work_ section has been revised entirely. The images included at the _Proposed Methodology_ section are still under edition. Subsections III.A - III.B of the _Proposed methodology_ have been polished (several paragraphs have been better organized and shortened). I am currently reviewing subsection III.C (_Proposed Heatmap Generation Scheme_), as well as modifying the figures. The section about _Experimental Results_ has been updated, but still have to revise some paragraphs. All the mentioned modifications have been uploaded to the project in [Overleaf](https://www.overleaf.com/project/67ff2b6d8e175c36fb96df1d). <h4 class=full-width-bg>Additional Activites</h4> - UVA judge: - UVA 212: Use of Hospital facilities. - UVA 573: The Snail. - UVA 12356: Army Buddies. <h4 class=full-width-bg>Plans and upcoming activites</h4> - Finish the edition of the figures for the _Proposed Methodology_ section. - Revise the _Experimental Results_ section. - Draft the subsection about _Qualitative Results_. </div> ----------------------- ### July 10, 2025 <div style="text-align: justify;"> <h4 class=full-width-bg>Research Progress</h4> 1. **Paper revision**: The images included at the _Proposed Methodology_ section have been modified according to the new multi-head attention approach (**see the images below**). The section _Datasets and performance metrics_ has been updated. The section regarding the _Proposed Methodology_ is being revised to match with the modifications made to the figures. The _Experimental Results_ section is being updated. All the mentioned modifications have been uploaded to the project in [Overleaf](https://www.overleaf.com/project/67ff2b6d8e175c36fb96df1d). The latest quantitative results are presented in the table below. <div style="text-align: center;font-weight: bold"> HG Feature Attention Mechanism</div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HkxdHSaHeg.png" width="450"> </div> <div style="text-align: center;font-weight: bold"> HG Spatial Attention Mechanism</div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJPdHSprge.png" width="450"> </div> ### Overall HuPE (pose estimation) performance comparison on SLP according to PCKh@0.5 | Method | Modality | PCKh@0.5($\uparrow$) | | | | | :--------------------------- | :--------- |:------------------- | :---- | :---- | :---- | | | | C0 | C1 | C2 | Total | | SHG+DAug+KD | LWIR | - | - | - | 76.13 | | HRNet+iAFF | RGB+LWIR | 96.5 | 92.5 | | 94.3 | | HRNet+Fusion | Depth+LWIR | - | - | - | **97.3** | | SHG$^\dagger$ | Depth | 97.6 | 96.1 | 95.8 | 96.5 | | --------- |--------- | ---------| --------- | --------- | ---------| | HG-1 | Depth | 97.37 | 94.79 | 95.02 | 95.73 | | HG-2 | Depth | 97.46 | 95.78 | 95.81 | 96.35 | | HG-At1-1 | Depth | 97.77 | 96.25 | 96.25 | 96.75 | | HG-At1-1 | RGB-D | 96.3 | 95.7 | 94.6 | 95.53 | | HG-At4-1 | Depth | 97.70 | 95.96 | 96.31 | 96.65 | | HG-At6-2 | Depth |**98.08** | **96.78** | **96.56** | 97.14 | **Notes:** $^\dagger$ 2 residuals per block are employed. Our proposed model only uses 1 residual per block and is more efficient for processing speed and storage space. <small> LWIR: Long-Wave Infrared </small> **Covering scenarios**-> <small> C0: No Cover; C1: Thin cover; C2: Thick cover </small> <h4 class=full-width-bg>Additional Activites</h4> - UVA judge: - UVA 516: Prime Land. - UVA 11498: Division of Nlogonia. - UVA 10050: Hartals. <h4 class=full-width-bg>Plans and upcoming activites</h4> - Finish revising the _Experimental Results_ section. - Draft the subsection about _Qualitative Results_. </div> ------------------------------- ### July 17, 2025 <h4 class=full-width-bg>Research Progress</h4> <div style="text-align: justify;"> 1. **Project Development**: The data parsers for **UTD-MHAD** and **MPII** datasets have been implemented. In order to compare the proposed method with SOTA works on in-bed and regular pose estimation, the experimental trials will include results on 6 datasets (_see the table below_). So far, results have been obtained on **SLP**, **MKV**, and **DCCV-BedPose**. On the other hand, the parameter count is also being taken into account for comparison. The idea behind this comparison (_check Table XI below_) is to demonstrate that the proposed model (including attention modules) is a light-weight one. The images for the _Qualitative Results_ section are under edition. **Benchmarks used for the experimental trials** | Dataset | Modality | Scenario | Controlled | # Images | Purpose | |-------------------|------------|--------------|------------|---------|------------| | [MPII](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset) | RGB | HuPE | No | 25,000 | Train/Test | | [MKV](https://lmb.informatik.uni-freiburg.de/resources/datasets/KinectDatasets.en.html) | RGB-D | HuPE | Yes | 22,000 | Train/Test | | [UTD-MHAD](https://personal.utdallas.edu/~kehtar/UTD-MHAD.html) | Depth | HuPE | Yes | - | Train/Test | | [SLP](https://ostadabbas.sites.northeastern.edu/slp-dataset-for-multimodal-in-bed-pose-estimation-3/) | RGB-D | In-Bed HuPE | Yes | 13,700 | Train/Test | | DCCV-BedPose | RGB-D | In-Bed HuPE | Yes | 240 | Test | _HuPE: Human Pose Estimation_ <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HJieDY8Ulg.png" width="600"> </div> _Occ. :Occlusion level_ <h4 class=full-width-bg>Plans and upcoming activites</h4> - Draft the subsection about _Qualitative Results_. - Start the experimental trials on **MPII**, and **UTD-MHAD**. </div> ------------------------- ### August 01, 2025 <h4 class=full-width-bg>Research Progress</h4> <div style="text-align: justify;"> 1. **Project Development**: The implemented data parsers for **UTD-MHAD** and **MPII** are being revised. After checking the pre-processed images (cropped and centered according to the bounding box) comprising the training/testing dataset, several incorrect samples were found. For instance, the subject does not appear in the image, or some limbs are missing after cropping the original images. Besides conducting experimental trials on public datasets, results on **DCCV-BedPose v0.0** are also included. Different from SLP, DCCV-BedPose contains RGB-D sequences besides RGB-D images of subjects under different covering scenarios. The sequences and images were obtained with a RealSense L515 camera in a similar setting as SLP. The purpose of this dataset is two-fold. First, the models trained on SLP are evaluated on DCCV-BedPose during in-bed pose estimation. Second, in-bed action recognition is conducted on the RGB-D sequences provided by DCCV-BedPose. 2. **Paper revision:** The section regarding _qualitative results_ has been drafted. The purpose of this section is to provide illustrations about the benefits of including the proposed modules into the Baseline. These illustrations include: - **Heatmap regression accuracy**: Comprises the differences between the heatmaps obtained with the model and the ground-truth. The heatmaps obtained with the _Baseline+Proposed Modules_ are compared with the ones computed with the Baseline. This comparison aims to showcase that these modules yield more focalized heatmaps. Additionally, pose estimation results (i.e. skeletons) are also included to depict the cases (e.g. specific poses, self-occlusion) for which the model succeed and fails. - **Localization of hard limbs under heavy occlussion**: There are some limbs (body joints) whose detection is harder for any pose estimation model. These limbs include: ankles, wrists, and knees (see the images below). Thus, aside from showcasing the success and failure cases of the proposed model, results depicting the proper localization of these limbs are included. Even though the Right Knee and Right Ankle are self-occluded in the figure below, the proposed model is still able to spot these joints. Nevertheless, it cannot localize the Right Wrist due to a perceived "invisibility" in the depth image. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJuwHk5vxe.jpg" width="500"> </div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/rJ66rJqvxx.jpg" width="700"> </div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/S1W0rJ9Dee.jpg" width="700"> </div> - **Regular pose estimation**: As mentioned before, experimental results are also obtained on regular pose estimation datasets (MKV, UTD-MHAD, MPII). Successful and failure examples (skeletons) are also provided on these datasets. The purpose of these illustrations is to show that the porposed modules can be extended to regular pose estimation. - **Multi-head attention**: The working principle of the proposed attention mechanisms are closely related to [CBAM](https://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun_Woo_Convolutional_Block_Attention_ECCV_2018_paper.html) (_Convolutional Block Attention Module_) and [SE](https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html) (_Squeeze-and-Excitation_). The main difference is the adoption of a multi-head approach (the Multi-head scaled-dot attention is not employed) instead of a single-head. The advantage of this approach is that different inter-channel relationships can be modeled by the heads comprising HGFAM (feature/channel wise attention). Whereas, the heads of HGSAM can focus on different and complementary spatial regions. Examples of these attention maps will be included in this section. _HGSAM: Proposed Spatial Attention Mechanism_. _HGFAM: Proposed Feature Attention Mechanism_. <h4 class=full-width-bg>Plans and upcoming activites</h4> - Continue with the redaction of the _Qualitative Results_ subsection. - Plot the attention maps from HGSAM. - Fix the parsers for **MPII**, and **UTD-MHAD**. </div> ---------------------------- ### August 8, 2025 <h4 class=full-width-bg>Research Progress</h4> <div style="text-align: justify;">  1. **Project Development**: The implemented data parser for **MPII** has been fixed. Whereas the one for **UTD-MHAD** is still under revision. The main problem with UTD-MHAD is that the dataset was aimed at Action Recognition. It includes 3D skeletons obtained with Kinect Tracking SDK. However, 2D ground-truth annotations are not provided. Actually, these annotations are being estimated from the 3D annotations, and the calibration parameters of the camera. As shown in the GIF image below, there is a mismatch between the RGB and Depth streams, as well as the 2D annotations. On the other hand, the baseline model for **MPII** is being trained on Colab. <div style="text-align: center;"> <img src="https://i.ibb.co/8DT79fWZ/blended.gif" width="1000"> </div> _Left: RGB stream. Center: Depth Stream. Right: Blended streams with annotations_  2. **Paper revision:** As mentioned previously, the attention maps obtained from the multi-head modules comprising the proposed spatial attention mechanism will be included in the qualitative results section. These attention maps are shown in the images below. To ease the visualization of attention maps, 4 attention heads were employed. Also, the maps correspond to the last residual of an Hourglass module (64x64 px). The final attention map is obtained by a learned convex combination between the head's attention maps. The heads are regularized to have orthogonal outputs. As can be seen, different heads somehow focus more on different regions of the subject's body. **Example 1:** _Uncover scenario_ <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/ByIi587ugg.jpg" width="600"> </div> **Example 2:** _Thin cover scenario_ <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/SJKocL7Ogg.jpg" width="600"> </div> </div> <h4 class=full-width-bg>Plans and upcoming activites</h4> * Fix the parser for UTD-MHAD. * Finish training the baseline on MPII, and start training the models using for ablations on this dataset. ------------- ### August 21,2025 <h4 class=full-width-bg> Research Progress </h4> <div style="text-align: justify"> 1. **Project development:** The parser for **UTD-MHAD** has been fixed. Accordingly, all the data parsers used for the datasets included in the manuscript have been implemented. The baseline for MPII has been trained, and the models used for ablations (_see the Table below_) are currently being trained. </div> <span style="font-size: 12px;"> |Model name |S2F Layers|F2S Layers|Skip Layers|Use 1J heatmaps?|Use 2J heatmaps?|Description| |-----------|----------|----------|-----------|-------|-------|-----------| |**HG-1** | - | - | - | ✅ | ❌ | Baseline | |**HG-2** | - | - | - | ✅ | ✅ | Baseline with 2J heatmaps| |**HG-AttFNN-1**| HGFAM | - | - | ✅ | ❌ | HG-1 with HGFAM Attention| |**HG-AttFNN-2**| HGFAM | - | - | ✅ | ✅ | HG-2 with HGFAM Attention| |**HG-AttFFN-1**| HGFAM | HGFAM | - | ✅ | ❌ | HG-1 with HGFAM Attention| |**HG-AttFSN-1**| HGFAM | HGSAM | - | ✅ | ❌ | HG-1 with HGFAM/HGSAM Attention| |**HG-AttSFN-1**| HGSAM | HGFAM | - | ✅ | ❌ | HG-1 with HGFAM/HGSAM Attention| |**HG-AttFSN-2**| HGSAM | HGFAM | - | ✅ | ✅ | HG-2 with HGFAM/HGSAM Attention| |**HG-AttFSF-1**| HGFAM | HGSAM | HGFAM | ✅ | ❌ | HG-1 with HGFAM/HGSAM Attention| </span> <div style="text-align: center; font-size: 12px;"> **Hourglass module** </div> <div style="display: flex; justify-content: center;"> <img src="https://hackmd.io/_uploads/r1HeA_4Kxg.png" width="400" title="Hourglass module with attention modules placed at different layer types."> </div> <div style="text-align: justify"> The table above details the place where attention mechanisms (HGFAM, HGSAM) are incorporated within an Hourglass module, as well as the inclusion of 2-Joint heatmaps during the training. The progress of training for the models used during ablation studies is detailed in Table below. </div> <div style="display: flex; justify-content: center;"> |Model name |SLP|MKV|UTD-MHAD|MPII| |-----------|---|---|--------|----| |**HG-1** | Done✅ | Done✅ | Done✅ | Done✅ | |**HG-2** | Done✅ | Done✅| In progress$^\ddagger$ ☕ | In progress$^\ddagger$ ☕ | |**HG-AttFNN-1**| Done✅ | Done✅ | In progress$^\ddagger$ ☕ | In progress$^\ddagger$ ☕ | |**HG-AttFNN-2**| Done✅ | Done✅ | Not Yet | Not Yet | |**HG-AttFFN-1**| Done✅ | Done✅ | Not Yet | Not Yet | |**HG-AttFSN-1**| Done✅ | Done✅ | Not Yet | Not Yet | |**HG-AttSFN-1**| Done✅ | In progress$^\dagger$ | Not Yet | Not Yet | |**HG-AttFSN-2**| Done✅ | In progress$^\dagger$ | Not Yet | Not Yet | |**HG-AttFSF-1**| Done✅ | In progress$^\dagger$ | Not Yet | Not Yet | </div> $^\dagger$: Due to a change in the architecture the models are being trained again. $^\ddagger$: The models are being trained on Colab. <div style="text-align: justify"> 2. **Paper revision:** Experimental results, both quantitative and qualitative ones, have been obtained for **SLP**. As mentioned before, the proposed Spatial attention module comprises 4 heads, and each head generates and attention map. The images depicting these attention maps are currently under edition ( it is desirable to show special cases where attention is necessary to deal with heavy occlusion). On the other hand, the results for **MKV** are being modified due to a change in the model's architecture. The experimental trials are still ongoing for **MPII** and **UTD-MHAD**. It is worth mentioning that, the latest version of the manuscript has not being updated to Overleaf. To ease the writting of the manuscript, I am using a local copy containing all the results already mentioned. </div> <h4 class=full-width-bg> Plans and upcoming activities </h4> * Finish training the remaining models used for ablations. * Gather the experimental results for the remaining benchmarks. * Update the manuscript in Overleaf. ------------------------------------------ ### December 03, 2025 <h4 class=full-width-bg> Research Progress </h4> <div style="text-align: justify"> <u> **In-bed pose estimation and action recognition** </u> According to the previous reports and updates, the first part of the research on in-bed pose estimation has come to an end. Currently, the paper about in-bed pose estimation has been submitted to [IEEE Transactions on Image Processing](https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=83) (IF = 13.7), and it is still under review. In case it was rejected, the following journals would be considered as the next options: - [IEEE Transactions on Cybernetics](https://ieeexplore.ieee.org/Xplore/home.jsp): IF = 10.5 - [IEEE Transactions on Multimedia](https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6046): IF = 9.7 - [International Journal of Computer Vision](https://link.springer.com/journal/11263?): IF = 9.3 - [Machine Intelligence Research](https://link.springer.com/journal/11633?): IF = 8.7 - [Pattern Recognition](https://www.sciencedirect.com/journal/pattern-recognition/about/insights): IF = 7.6 - [IEEE Journal on Biomedical and Health Informatics](https://www.embs.org/jbhi/statistics/): IF = 6.8 All the above mentioned journals have _SCIE_ indexation, and their scopes are aligned with the manuscript's. Accordingly, the first option would be **IEEE Transactions on Cybernetics**. It is worth mentioning that, top-tier IEEE journals have a long first-response time. Conversely, the peer-review process is shorter for Elsevier and Springer journals. On the other hand, the following part of the project on in-bed pose estimation comprises the incorporation of the developed model into already deployed heatmap-based action recognition models. In fact, this has been done before obtaining the final experimental results of the first manuscript. According to these first findings on in-bed action recognition, it has been observed that **the accuracy of the pose estimation model (e.g. PCKh@0.5) does not have a critical impact on action recognition performance**. Additionally, the model employed to obtain these findings is a novel light-weight action recognition model (BedAR), and would be introduced in the next manuscript. In addition to detailing the experiments, and results in the next manscript, the first version of the **DCCV-BedPose benchmark would be publicly released**. The benchmark utilization protocols and evaluation metrics would be described in detail for researchers interested in this topic. Currently, the mentioned benchmark contains very few samples (3 subjects performing 9 actions, with 3 repetitions per action). Thus, part of the experimental setup for the next manuscript would include **data collection and augmentation**. To do this, it would be needed to collect RGB-D sequences in cooperation with the lab members. Additionally, **purchasing a new RGB-D camera** (Intel RealSense D435 or L515) is necessary. <h4 class=full-width-bg> Plans and upcoming activities </h4> <u> **Research plan (2026)** </u> Besides working on the in-bed action recognition projects, other topics are being proposed for upcoming manuscripts during 2026. The current ideas are presented as follows: 1. **Low-rank Feature Representation within ResNet Blocks**: For this project, I am extending the idea of [Tensor Low-Rank Reconstruction for Semantic Segmentation](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123620052.pdf) (ECCV 2020). In that work, a feature map $X \in \mathbb{R}^{C \times H \times W}$ is used to build a **low-rank attention tensor** $A$ via a CP (_CANDECOMP/PARAFAC_) decomposition: $$A = \sum_{i=1}^r \lambda_i \left( v^c_i \otimes v^h_i \otimes v^w_i \right), r \text{ is the desired tensor rank}$$ where $v^c_i, v^h_i, v^w_i$ are obtained through global average pooling along each axis, and the final output is $Y = A \odot X$. My idea is to keep the low-rank direction but avoiding the global-pooling bottleneck. Given $X \in \mathbb{R}^{B \times C \times H \times W}$, I'm defining a low-rank mapping: $$ \mathrm{LowRankFeat}(X) = \mathrm{reshape}\!\left( W_{\mathrm{up}} W_{\mathrm{down}} \, \mathrm{vec}(X) \right), \qquad \mathrm{rank}(W_{\mathrm{up}} W_{\mathrm{down}}) \le r, $$ Where $r$ is the specified rank, and $\mathrm{vec}(\cdot)$ is implemented via patch splitting and flattening, including 2D positional encodings-- similar to ViT (Vision Transformer). As can be seen, Fully-Connected layers are employed to generate low-rank features from the input features $X$, instead of Convolutional layers. This module is integrated inside a ResNet block as $$\mathrm{OUT}(X) = X + \mathrm{ResConvBlock}(X) + \mathrm{LN}\!\left( \mathrm{LowRankFeat}(X) \right)$$ In summary, the **key idea** is to complement the local features (obtained through convolutional layers), with this global low-rank representations (features) inside a ResNet block. The **final outcome of this project** is to obtain a novel low-rank+local representation learning approach, which can be applied to both conventional CNN pipelines and hybrid CNN-transformer models. 2. **Skeleton-scheme agnostic action recognition model** This is just a continuation of the BedAR (Bed Action Recognition) model I proposed before. This time, I am proposing to prompt skeleton scheme, as well as action labels. Additionally, the model will operate with a maximum heatmap context window. Thus, the **overall final goal** is to obtain a Heatmap-based action recognition model that can be placed on top of any heatmap-based pose estimation model, regardless of the skeleton-scheme, and the action vocabulary. This would produce a **plug-and-play action recognition model** that could be used across different datasets, keypoint configurations, and action label spaces. <u> **Upcoming conferences:** </u> The calendar for the upcoming conferences on 2026 is shown in the table below. According to the research plan, the work on **Low-rank Feature Representation within ResNet Blocks** is planned to be submitted to ECCV (main track) first. In case it was rejected or we miss the submission deadline, it would be submitted to ECCV workshops. On the other hand, the work on **Skeleton-scheme agnostic action recognition model**, is meant to be submitted to **ACCV** or **VCIP**, depending on the relevance of the results (comparison with SOTA works). Given that this is the initial research plan, and modifications are likely to occur, **TAAI** would be a good option to submit simplified versions of these works. Furthermore, other additional incremental works (applications) could be submitted to this conference. </div> |Conference|Submission Deadline|Workshop?|Workshop paper submission deadline|Conference Date| |----------|-------------------|---------|----------------------------|----| |International Conference on Image Processing ([ICIP 2026](https://2026.ieeeicip.org))|01/21/2026|❌| --- |09/13/2026 -- 09/17/2026| |European Conference on Computer Vision ([ECCV 2026](https://eccv.ecva.net/))|03/06/2026|✅|~July 15,2026|09/08/2026 -- 09/13/2026| |Asian Conference on Computer Vision ([ACCV 2026](https://accv2026.org/))|07/15/2026|✅|Unknown|12/14/2026 -- 12/18/2026| |International Conference on Visual Communication and Image Processing ([VCIP 2026](https://vcip-2026.org/))|~07/20/2026|❌| --- |12/13/2026 -- 12/16/2026| |International Conference on Technologies and Applications of Artifical Intelligence (TAAI 2026)|~09/10/2026|❌| --- | Unknown | ------------------------------------------ ### January 22, 2026 <div style="text-align: justify"> Following the research schedule for 2026, I have been working on the paper that will be submitted to ECCV 2026 (main track and/or workshop). The following part of my report elaborates on the proposal for this project, addressing the development of an **Effective and Efficient Visual representation learning framework through low-rank feature generation**. Additionally, I am including a brief description of recent related works, SOTA methods for comparison, benchmarks, and proposed ablations. <h4 class=full-width-bg> Research Progress </h4> <u> **Problem context and motivation** </u> Since the inception of the vision transformer, plenty of work have been conducted on the development of scalable Large Vision Models (LVMs). The main problem with the **prototype ViT** was its **poorer performance on dense prediction tasks** (semantic segmentation, object detection) compared with CNNs backbones. Up to 2025, a promising approach to overcome this pitfall has been the **hybridization between ViT-based backbones and convolutional blocks** (see the table below). The literature shows that the main methodologies to accomplish this includes the **complete replacement of FC (perceptrons) layers** with convolutional ones with their corresponding modifications to the token embedding block, self-attention module, and feedforward layers. Another approach is the **partial inclusion of convolutional layers**, where token/feature flattening and reshaping are employed to introduce convolutions while maintaining the core of ViT. Other works have focused on very specific setups aiming to reduce the **low global representational power of CNNs** with transformer-like blocks: context similarity/ local affinity, and **cross-attention between parallel CNN and Transformer backbones**. It is worth mentioning that, another key aspect addressed in the reviewed works is the **computational efficiency** of these LVMs (to maintain a low parameter count and FLOPs). On the other hand, the **scalability of pure transformer LVMs** (in the order of 10B params) has been addressed (see Swin Transformer v2, DINO v3) aiming to reach the size of LLMs. <u> **Hybrid CNN-Transformer backbones for visual tasks (classification, segmentation, detection)** </u> |Method | Publication instance | Main contributions |Remarks | Links| |---|----|-----|----|-----| |**RepViT** (Revisiting MobileNet from a ViT perspective)| CVPR 2024| <ul> <li> Modernizes MobileNet by introducing new: block design (token mixer-channel mixer), macro design (placement of RepViT blocks), micro design (fine details about RepViT blocks). </li> <li> SAM-RepViT: Replacemenent of SAM image encoder with RepVit. Comparable performance with heavy SAM variants. </li></ul>| <ul> <li> Supervised pre-training on ImageNet 1K. </li> <li> Integration of pre-trained RepViT into downstream tasks: object detection, semantic segmentation </li> </ul>| [Repo](https://github.com/THU-MIG/RepViT) [Paper](https://arxiv.org/abs/2307.09283) | |**MetaFormer** (IdentityFormer, RandFormer, ConvFormer, CAFormer)|TPAMI 2024| <ul> <li> Proposes different token mixers for convolutional transformers (identity, random, convolutional) </li> <li> The proposed Convolution-Attention transformer (CAFormer) set a new record on ImageNet-1K</li> <li> New activation function StarReLU, cheaper than GELU and showcasing a better performance. </li> </ul> |Explorative backbones for CNN-based transformers| [Repo](https://github.com/sail-sg/metaformer) [Paper](https://arxiv.org/abs/2210.13452) | |**UniFormer**|TPAMI 2023| <ul> <li> Dynamic positional encoding (robust to image resolution changes) in the form of DW convolution.</li> <li> Multi-head Relation Aggregation: Local affinity between neighboring _"sub-tokens"_, and Context similarity among tokens (global).</li> </ul>| <ul> <li> Hybrid CNN-Transformer</li> <li> Focuses on making the models lightweight. </li></ul>|[Repo](https://github.com/Sense-X/UniFormer) [Paper](https://arxiv.org/abs/2201.09450) | |**Mobile-Former**|CVPR 2022| <ul> <li> 2-way bridge between MobileNet and Transformers through cross-attention between the 2 backbones.</li> <li> Mobile-former block has 4 parts: mobile, former, mobile-former, former-mobile </li> <li> Adoption of [Dynamic ReLU](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123640341.pdf) at the mobile sub-block. </li> </ul>| <ul> <li> Hybrid CNN-Transformer </li> <li> Focuses on effectively encoding both local processing, and global interaction. </li> </ul>| [Repo](https://github.com/AAboys/MobileFormer) [Paper](https://arxiv.org/abs/2108.05895) | |**Swin Transformer V2**| CVPR 2022 | <ul> <li> Evidences the issues in the activations when scaling up ViTs, and post-norm-residual configuration is proposed to address this problem. </li> <li> Highlights the necessity of adopting self-supervised pre-training for large vision models (LVMs). </li> <li> Adopts a hybrid CNN-transformer architecture compared to the first pure ViT version (Swin Transformer V1). </li> </ul> |<ul> <li> Hybrid CNN-Transformer </li> <li> Includes implementations for cutting GPU memory requirements (Zero redundancy optimizer, sequential self-attention).</li> </ul>| [Repo](https://github.com/ChristophReich1996/Swin-Transformer-V2) [Paper](https://arxiv.org/pdf/2111.09883)| |**BoTNet** (Bottleneck Transformers) | CVPR 2021| <ul> <li> Following bottleneck ResNets, introduces a convolutional MHSA within the bottleneck block. </li> <li> Use of global self-attention to features from CNNs.</li> <li> Relative position encodings show better performance that absolute encodings for vision tasks.</li> </ul>| <ul> <li> Hybrid CNN-Transformer </li> <li> Presents an informative taxonomy of vision models: CNN-based, ViTs, Hybrids </li> </ul>| [Repo](https://github.com/leaderj1001/BottleneckTransformers) [Paper](https://arxiv.org/abs/2101.11605)| |**ConViT** (Convolutional Vision Transfomer)| ICML 2021| <ul> <li> Replace ViT token embedding with convolutional embedding. </li> <li> Convolutional transformer block: convolutional projection, flattening, reshaping.</li> </ul> | <ul> <li> Hybrid CNN-transformer </li> <li> One of the first works on using convolutions within the ViT backbone </li> </ul> | [Repo](https://github.com/microsoft/CvT) [Paper](https://arxiv.org/abs/2103.15808)| |**PVT** (Pyramid Vision Transformer)| ICCV 2021| <ul> <li> PVT V2 replaces the original pure transformer approach (PVT V1) with a hybrid CNN-transformer one. </li> <li> Linear Spatial Attention Reduction (pooling to reduce dimensions of the input token sequence to a fixed spatial size) </li> <li> Overlapping patch embedding with convolutions, and convolutional Feedforward network (FFN) </li> </ul> | <ul> <li> Hybrid CNN-transformer </li> <li> Points out the limitations of PVT V1: Impractical computation complexity for high-res images, Non-overlapping patches breaks local continuity </li> </ul> |[Repo](https://github.com/whai362/PVT) [Paper](https://arxiv.org/pdf/2106.13797)| <u> **Problem definition**</u> According to the context described above we can see that foundation vision models require efficient and effective visual representation learning architectures and training regimes. These representations should capture local relationships and global dependencies within images. Different from previous works (2021-2025), I am proposing to generate local and global features in a single module without using separate backbones, and without feature flattening-reshaping techniques. To accomplish this, I am keeping regular convolutional modules (sequential convolutions, batch norm, activation) for local feature extraction, and computing low-rank global features in parallel. Both group of features will be combined in the same module. A simplified version of the module's design is shown below: ![lrresnet](https://i.ibb.co/FL3K875r/lrresnet.jpg) The idea is simple. I am keeping the original ResNet bottleneck convolutional block for local feature extraction. The input features have a shape of $N \times N$. Feature patching and tokenization are applied to the input features to obtain two low-rank components ($\mathbf{H}$, $\mathbf{V}$). These components (shapes of $N\times r$, $r\times N$) are generated by means of two modules (depicted in red and blue), and multiplied later to generate low-rank $N \times N$ features. Local and global features are concatenated and fused through a $1 \times 1$ convolutional layer. Finally the output is added with the residual connection. According to this proposal, there are some open questions that will be resolved during the project realization: 1. How can we develop a lightweight hybrid model with a performance comparable with SOTA? 2. What type of analysis can be applied to demonstrate the representativeness of low-rank features (theoretical and empirical)? 3. Which backbones may be used for conducting the experiments on low-rank feature extraction? 4. What configurations of the proposed low-rank block (within a ResNet module) can adopted for performance comparison (at least 3)? 5. **[MOST IMPORTANT]**: How tokenization and low-rank global feature generation will be conducted? <u> **Other SOTA Methods for comparison (Hybrids and pure transformer-based ViTs)** </u> In order to verify the performance of the proposed approach on visual tasks (image classification, semantic segmentation, object detection), the following SOTA works will be considered for comparison. |Method|Tasks|Remarks| |---|----|----| |[Swin transformer](https://openaccess.thecvf.com/content/CVPR2022/papers/Liu_Swin_Transformer_V2_Scaling_Up_Capacity_and_Resolution_CVPR_2022_paper.pdf)| <ul> <li> Image recognition </li> <li> Object detection </li> <li> Semantic segmentation </li> <li> Video Action recognition </li><ul>| <ul> <li> Implemented for 6 scales: tiny, small, base, large, huge, giant </li> <li> Adopts self-supervised pre-training </li> </ul> | |[ConvNet for 2020s](https://arxiv.org/abs/2201.03545)| <ul> <li> Image recognition </li> <li> Object detection </li> <li> Semantic segmentation </li> </ul> |<ul> <li> Supervised pre-training on ImageNet-22K </li> <li> Pre-training at $224 \times 224$. Fine-tuning at $224 \times 224$ and $384 \times 384$ </li> </ul> | |[Masked Auto-Encoders](https://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf)| <ul> <li> Image recognition </li> <li> Object detection </li> <li> Semantic segmentation </li> <li> Instance segmentation </li> </ul>|<ul> <li> Self-supervised pre-training on ImageNet-1K. </li> <li> Transfer learning to downstream tasks through: partial fine-tuning, full fine-tuning, and linear probing. </li> </ul>| |[ConvNeXt V2](https://arxiv.org/pdf/2301.00808)|<ul> <li> Image recognition </li> <li> Object detection </li> <li> Semantic segmentation </li> <li> Instance segmentation </li> </ul>| <ul> <li> Self-supervised pre-training on ImageNet-1K. </li> <li> Fully convolutional Masked Autoencoder </li> </ul> || |[SegFormer](https://arxiv.org/pdf/2105.15203)| Semantic segmentation | <ul> <li>Full Transformer-based backbone with overlaping patch embedding through convolutions, and $3 \times 3$ convolution within the FFN of the transformer blocks.</li> <li>It uses a hierarchical encoder, and merge multi-scale features within the decoder. </li> <li> Encoder pre-training on ImageNet-1K </li></ul>| |[DINO (V1-V3)](https://arxiv.org/pdf/2508.10104)| <ul> <li> Image recognition </li> <li> Object detection </li> <li> Semantic segmentation </li> <li> *Other more specific tasks that are not included in other foundation vision models.* </li> </ul> | <ul> <li> Introduces a self-distillation (student and teacher networks are trained together) self-supervised framework (DINO v1) <li> Full transformer-based foundation model . Uses linear probing for downstream tasks (DINO v3). </li> </li> <li> Focuses on obtaining emerging properties (like LLMs) by developing very large vision models at scale (foundation models) (DINO v3) </li> </ul>| <u> **Benchmarks (image classification, semantic segmentation, object detection)** </u> |Benchmark|Task|Remarks| Metrics |Downloadable? [size]| |---|----|------| ---- |---| |ImageNet| Image classification| Adopted in supervised and self-supervised vision foundation models| Top-K (e.g. Top-1, Top-5) accuracy | 167GB, Torch| |Cub-200-201 Birds| Image Classification | May be used for linear probing from a pre-trained model | Top-K (e.g. Top-1, Top-5) accuracy| 1.1GB, | |Omniglot| Image Classification | Potential use for few-shot learning , classification of new classes from very few examples | Top-1 accuracy | 10MB , Torch| |Places 205, 365| Image Classification| Fine tuning after pre-training on ImageNet | Top-1, Top-5, mean accuracies| Torch | |ADE20K|Semantic Segmentation| Adopted in most of the foundation models for obtaining experimental results. | mean Intersection-over-union (mIoU) | Torch| |Pascal VOC 2012| Semantic Segmentation, Object detection| Adopted for experimental results on other foundation models. | mIoU (segmentation), mean average precision mAP (detection) | Torch| |MSCOCO | Object detection| Used for most of the works on object detection. | mAP, and mAP for other thresholds T (mAP@T) | Torch| <u> **Ablations** </u> The optimal setup of the proposed local-global combination module will be found through the following ablation studies: |Ablation type|Purpose|How?| |------|-------|----| |Auxiliary losses|Showcase that the utilization of auxiliary losses at the low-rank modules may improve the overall performance|Enforcing global feature diversity (i.e. prevent redundancy) by including auxiliar losses| |Feature split percentage|Show that under a fixed total feature count, the ratio between global and local feature number have a significant impact over different visual tasks| Include a parameter within the implemented module to select how many global and local features will be employed| |Tensor Rank| Demonstrate empirically that the hypothesis about modifying the rank of global features to enhance overall performance is indeed correct |Include a parameter to specify the global feature tensor rank within the module implementation| <h4 class=full-width-bg> Plans and upcoming activities </h4> - Implementation of the mentioned feature patching and tokenization procedure. - First experimental results on image classification with ResNet18 or ResNet50 as the backbones. - Define the placement of the proposed ResNet module into both hybrid CNN-transformer and CNN backbones.