Headless Rendering

# Headless Rendering A personal study on how to leaverage complex rendering on Super Computers ## Progress 9/12/24 I am studying VirtualGL to more detail. I got hands on the source code, and now I am trying to understand how they fake the LD library path to make the OS think it has a display attached to its hardware. Currently: I am building the source code on my own. Some problems arisen when trying on headless machine. - No root permissions to install some libraries. - Making from source, presents the same problem, there is a need to install some speciall libraries. Maybe I will try Dockers or some other virtual environment. After succeding on making from source on headlles machines I will proceed to play around with the source code. The final question is how to couple this with Unreal Engine or Unity. ## Progress 20/01/2025 I am understading better the concept of remote rendering. - Render Locally. - Render remotely - Local Support - XForwarding. - Render remotely - Remote Support - VirtualGL. We should not confuse: - Headless Rendering: - Uses not display attached and render to GPU RAM memory. This is very specific case to render, and not so many applications are supported. - Remote Rendering: - It looks like local rendering. Render to a Fake Display.This is General way to render remotely.All applications must work here. ### Progress on building from source: There are a lot of dependecies to build, one big problem now on ABCI is: - TurboJPEG - libXv-devel.i686 libXext-devel.i686 libXtst-devel.i686 libX11-devel.i686 libxcb-devel.i686 xcb-util-keysyms-devel.i686 mesa-libGLU-devel.i686 mesa-libGL-devel.i686 mesa-libEGL-devel.i686 glibc-devel.i686 libstdc++-devel.i686 libstdc++-static.i686 -------- Local Machine ----------------- Compiles. *Still need to perform basic test... -------- On ABCI ----------- Not Compiles. I need to install thoses dependecies, specially TurboJPEG. Possible solutions: - Use Dockers. - Manually compile TurboJPEG. ## Progress 10/02/2025 First, It was working on my submission to AIST paperwork. Second, I ran more experiments on Style Fractals and made presentation on HPCI past tuesday. ### Demo on VGL I was able to run virtualGL even from MacBook. Two task to make this week: - Run on Hinadori - Run on ABCI ## Progress 2/02/2025 I am stil having troubles with ABCI 3.0. Here is a summary of my progress so far: - NCCL test are working -> Changed module from 2.23 to 2.25 - Changing the moduel the test are working fine. Means, the communication library works. - I follow the instruction to execute MPI code from the ABCI 3.0 guide: - ![image](https://hackmd.io/_uploads/BkazQ99qJg.png) - I am still having trouble with the communication library: - Error on the console... - From there, I am studying about the MPI communication library. ![image](https://hackmd.io/_uploads/ry2cf9q91g.png) ## Progress 4/03/2025 I needed to go again to city hall to fix my pension again. Finally, I got the final paper to submit to get my contract on the middle of March. Final feedback from the ZERO [IO] - CVP2025 paper: **REJECT** - Reviewer 1: One level UP - ![image](https://hackmd.io/_uploads/BkLBTx4j1g.png) - Reviewer 2: One level UP - ![image](https://hackmd.io/_uploads/S1zs6gNo1l.png) - Reviewer 3: One level UP - ![image](https://hackmd.io/_uploads/SyA6TlEoke.png) Now, I am fixing the paper to SUBMIT to ICCV2025: I am chaging several parts. First, I am including all the major and minor comments from the previous submission. Also, I got feedback from Kataoka san: ![image](https://hackmd.io/_uploads/BJ1zJZ4oyg.png) Currently, I am checking the scaling backwards paper and implementation. However, ABCI is still not working. ## Progress 19/03/2025 Finally, ABCI 3.0 **----------- is working ---------------**. The original error when launching MPI was the following: ![image](https://hackmd.io/_uploads/S1SsOaw3Jx.png) The debug on how I was able to run is as follows: The support team updated **nvhpc/24.9** module which includes: - CUDA 12.6 - HPC-X 2.20 - NCCL 2.19.3 Also, they included new **gdrcopy/2.4.1** -> I am not sure this is related to any DL functionality.. Then, the las part was to set-up correctly the MPI communication Protocol as follows: - -map-by ppr:8:node -mca pml ob1 -mca btl self,tcp **-mca btl_tcp_if_include bond0** Now, this kind of settings needs to be done by the administrator under **/etc/openmpi-mca-params.conf** Now, the runs works but I still having this message: ![image](https://hackmd.io/_uploads/rJWzFTP2yg.png) I had benchmark some of the fine-tune datasets: - CIFAR100 - Storage: SSD - Epochs: 1000 ![image](https://hackmd.io/_uploads/HyIef0whJe.png) - IMNET-1k - Storage: SSD - Epochs: 300 ![image](https://hackmd.io/_uploads/B1kSM0P3kx.png) PyTorch seems faster than DALI in this new machine. I need to assure this... One possible reason for this behaviour: - More and faster CPUs on ABCI - up 192 per Node. - Performance on pre-processing algorithms (PIL or OpenCV) over PyTorch 2.6. Yet, there still some issues that I can not fix on ABCI 3.0: - I can not launch more than 192 MPI process... - My renderer using EGL on H200 crashes suddenly with **Segment Fault ()** Things going on: - Try to Downloa LAION - At least I want to make 21k version like ImageNet-21k - Compress ImageNet 21k to upload to SSD for faster fine-tune. - Profile carefully LLMs I/O. - Run NeRF to generate meshes from images on ABCI. ## Progress 26/03/2025 Checking on MPI process limit ( 192 MPI process): - Still there is no way to launch more than 192 process and this might due to PBS altair configuration. - I double check the launch of MPI process using CPU based Python. Time to Generate RT-FDB-1k CPUversion: - 1 Million images - 362x362 ![Screenshot from 2025-03-26 13-15-04](https://hackmd.io/_uploads/SymfxZZT1l.png) |Node Type | MPI Process | Imgs/sec | Total Time (hh:mm:ss) | | -------- | :--------: | :-----: |:------:| |V100 | 80 |__39.41__ |__00:05:02__| |H200 | 192 | 1.89 | 03:10:54 | We will have a meeting this next Friday to expose all our questions to Tanimura san. Meanwhile I am reading more about synhtetic datasets on Images. One SOTA paper came last CVPR24: ![image](https://hackmd.io/_uploads/SyTpI-bpyl.png) ## Progress 21/05/2025 Getting on InfiniGen. ABCI seems to be working fine on InfiniGen. We can not launch up to 192 MPI process as for InfiniGen. The memory runs out. I profilled and I got the following: - 1 InfiniGen process takes up to 799MB of GPU memory -> 192*800MB=153.408GB - ABCI 3.0 GPU memory is only 143.771GB wich is causing the MPI to be broken. I ran only 96 MPI process without bounding the MPI process to each socket (** --bind-to none**). Then, we can ran safely since 96*800MB=76.8GB ## Progress 4/06/2025 Continue working on InfiniGen. Currently, I am focusing on two main aspects: - Camera position and understanding. - Video generation from many cameras. There are many issues on the debuging side, for example: - Portions of the code that generates error just beacuse it was not well generalized: ``` if CUDA_VARNAME in os.environ: print("CUDA_VARNAME" in os.environ) print(os.environ[CUDA_VARNAME]) visible = [int(s.strip()) for s in os.environ[CUDA_VARNAME].split(",")] gpus_uuids = gpus_uuids.intersection(visible) logger.debug(f"Restricting to {gpus_uuids=} due to toplevel {CUDA_VARNAME} setting")``` - I am mananging to urderstan the "gin" which is configurable API and for scheduling: ``` manage_datagen_jobs.num_concurrent = 32 get_cmd.process_niceness = 0 # let UI processes etc take precedence, to make the smooth and UI usable local_submit_cmd.use_scheduler = True LocalScheduleHandler.jobs_per_gpu = 1 jobs_to_launch_next.max_queued_total = 1 jobs_to_launch_next.max_stuck_at_task = 16 # All will run locally, LocalScheduleHandler doesnt actually enforce cpu/ram constraints currently queue_coarse.submit_cmd = @local_submit_cmd queue_fine_terrain.submit_cmd = @local_submit_cmd queue_populate.submit_cmd = @local_submit_cmd queue_render.submit_cmd = @local_submit_cmd queue_combined.mem_gb = 12 renderbackup/queue_combined.mem_gb = 24 queue_combined.cpus = 8 queue_combined.hours = 48 queue_combined.submit_cmd = @local_submit_cmd # Export queue_export.cpus = 32 queue_export.hours = 24 queue_export.submit_cmd = @local_submit_cmd # Rendering queue_render.cpus = 32 queue_render.submit_cmd = @local_submit_cmd queue_render.hours = 24 queue_render.render_type = "full" queue_render.gpus = 8 # Upload queue_upload.submit_cmd = @local_submit_cmd queue_upload.mem_gb = 6 queue_upload.cpus = 16 queue_upload.hours = 24 queue_upload.dir_prefix_len = 2 # Ground Truth queue_mesh_save.submit_cmd = @local_submit_cmd queue_opengl.submit_cmd = @local_submit_cmd ground_truth/queue_render.render_type = "flat" ground_truth/queue_render.gpus = 0 ``` At the same time, I am looking to other papers from SIGGRAPH: ![image](https://hackmd.io/_uploads/H1YrgLTMxe.png) ![image](https://hackmd.io/_uploads/HyewxLaMgx.png) ## Progress 24/06/2025 I took some time to understand (in general) the whole source code. Here are some remarks: - The code implementes its own "job scheduler". - It utilizes "gin" to get configurations and run several "python classes" and functions to construct the pipe on for each job. - I modified its own configuration file to match resources on ABCI 3.0. Basically, I tackle the task to use different cameras on InfiniGen. In here, we have some choices to choose from: - Nature scenes - Yes/No Objects - Yes/No Fauna - Yes/No Stero Cameras - Etc. - Indoor scenes - Yes/No Objects - Yes/No Furniture - Yes/No Stero Cameras - Etc. I was able to run the following "pipes": - Multicam to generate Indoor rooms wihtout furniture - Multicam to generate Indoor rooms with furniture - Multicam to generate Nature - Single cam video So far, the InfiniGen tool exposes two different challenges: - It brokes at the initial face. When the Coarse "Main algorithm to start the object graphs". - For example one sinlge run try to create on **12 Hrs on single node ABCI3.0**: - 60 Scenes (Nature or Indoor). - Each scene includes 50 cameras. - Each camera shoots 50 images. - Succesfull scenes: 4 (2,500 images only to this point) - Failed scenes: 12 - Ramaining scenes: 34 - The scenes randomly creates "meaningles" pictures. Now, the InfiniGen tool seems powerful since the following points: - Implements the full SOTA rendering techniques from BLENDER. - It seems we can modify to addapt to our own pipeline - Fractals - Shapes - 3D models - If we use, we need to move from their scheduler and write our own or spent more time understanding the main flow. ## Progress 1/07/2025 We participated on SC25 paper review. Unfortunately, our paper was rejected on the ICCV2025. ## Progress 8/07/2025 I should consider submiting the work to another venue: - AAAI - ICLR - CVPR Some advice from Kataoka san: "Storytelling is one improvement point. The accuracy is enough high-standard in synthetic training. Then we focus on research question for the next revision. At the same time, we should try to describe, for example, what is the difficulty the tradeoff zero io rendering and performance."" I am participating to LIMIT.LAB meeting for AI safety: ![image](https://hackmd.io/_uploads/SyqdTScBgg.png) I presented the report on InfiniGen, and some feedback came: - Run only indors scenes, they are less prone to errors. - Execute some other experiments using less random seeds. - Identify exactly where is the bottleneck happening. - Try Genesis: https://genesis-embodied-ai.github.io/ Here I am looking into trying tex-to-image generation on the fly. First steps are: - Trying difussion models. - Choose one light model to make it work alongside the pre-train model. - Thinking on running this model on the CPU to have the GPU free. Another topic is the generation of the Fractals meshes 3D comming from this paper: ![image](https://hackmd.io/_uploads/BkMvJI5Sxl.png) ## Progress 16/07/2025 I ma currently debuging and checking the code on InfiniGen with the purpose of creating large amount of images as fast as possible. Strategy to follow: - Figure out what is the minimal steps to create images: - The full pipeline includes this steps: - Coarse -> Create the node graph using Simulated Anealing. - Populate Scene -> Import real objects to scene. - ~~Fine Terrain -> Simulates and create real-like nature scenes.~~ - Render -> Create the image from camera shot using Ray-tracing. - ~~Post-Processing -> Computes ground truths.~~ - Customize the rendering to be minimal: - We can set up the quality of the images selecting less amount of rays per scene. Eg. full global illumination 8192 (best). - We only render JPEG file. - We can describre the rooms with initial 2 rules: - ~~Include high details -> Textured walls, ceils, doors, etc.~~ - ~~Include objects -> chairs, tables, etc.~~ - Fix the seed to compute same scene for testing. We can decrease the time into several minutes only per scene: - Full Quality (Objects, Details, 8192 rays) and Post Processing - Tital time: 47:53:12 - Full Quality(Objects, Details, 8192 rays) - Total time: 00:07:23 - Low Quality (No Objects, No Details, 1024 rays) - Total time: 00:03:13 At first, I was launchig the pipeline using MPI. However, the mpiexec was broken since one job is not One big problem now, is running large scenes and the crash ratio, over 80%. Currently, I am looking into this problem. ## Progress 13/08/2025 I ran more experiments on InfiniGen and corroborate that the ratio failing on the Coarse sections is close to 80%. In order to check I ran the same configuration to create several scenes. Configuration of experiment on ABCI 3.0: - Nodes: **1** - Objects: **No** - Details: **Yes** - Seed: **Fixed for each experiment(n+1).** - Rays to Render: **1024** |Experiment `n` | Total Scenes | Succed | Failed | Crash Ratio | Total Time| | -------- | :--------: | :-----: | :-----: |:------:|:------:| |0 | 1000 |220 | 780 | 78% | 0d 9h 24m 40s | |1 | 1000 |280 | 720 | 72% | 0d 10h 01m 17s | |2 | 1000 |390 | 608 | 60% | 0d 11h 47m 13s | Currently, I am checking on how we can "avoid" the crashing by checking a new paper: ![image](https://hackmd.io/_uploads/rJ8q8u3Ulg.png) ## Progress 19/08/2025 - I am preparing the new submission on Zero I/O for PPoPP 2026. - I am writting proposed project for my Research Visit to Portugal. - I am writting Kakenhi for Early-Carrer. On another matter, I will help Kataoka san to double check a problem on accuracy on the latest 1-PFractal paper. It seems the accuracy reported is not reached. ## Progress 16/09/2025 Currently, I am helping Kataoka's group "Scaling backwards". The main objective is to reproduce the accuracy on 1p_fractal. The following has been done: * Try to install original packages and environment on ABCI 3.0: * Python 3.8 original can not be installed instead using 3.10. * Installing the correct timm, Pytorch and WDS correct version on new environment. Pre-traning has been performed on two datasets: * Original rendered 1p_Fractal * Original downloaded 1p_Fractal So far, there are some troubles with the pre-training, the loss significanlly drops fast. We are currently investigating the reason fo such drop. Fine-tuning seems to be working, at least on the 1p_fractal code. I tried the following: * Downloaded the original weights from 1p_fractal paper. * Used my own RT-FractalDB to double check accuracy. * Using pre-trained models does not produce original accuracy on ECCV for 1p_fractal. Currently, by moving the original BS=256 to BS=512 there was a more stable loss yet not "normal". We are more close to the original performance on the original paper but we still searching the reason. ![image](https://hackmd.io/_uploads/Sy1rh4Ljxl.png) ## Progress 7/10/2025 Regarding the InfiniGen: - ![image](https://hackmd.io/_uploads/BJOQjPfTlx.png) - I am looking into new paper released on May 2025 on Robotics conference. - This new paper focus on the ease of assets creation with kinematic variation (physical dimensions, joints, trajectories etc.) - The generation of the dataset is procedural and can generate arbitrary number of objects with different properties. - They use this synthetic datasets for reinforcement learning generalization and sim-to-real experiments. - I will confirm about the policies on the procedural algorithm next. Regarding 1p_Fractal: - I got the original wandb project and I am looking into the experiments and curves. - I will perform more comparison experiments to try to match accuracy on previous paper. - If we succeed on the accuracy matching, I have a new idea to reduce further the time on loading the dataset: - The main point on the idea is not to use an image as source for dataset. Using a PNG or JPEG format to store the 1p_fractal consumes time to decode. - The 1p_fractal will be created using `torch` directly, implementing the IFS routine just before `ToTensor()`. ## Progress 14/10/2025 Regarding 1p_Fractal: - Cleaning the original code(Nakamura) to adapt new logs and better output. - Running more experiments on ABCI for matching accuracy. I am planing to submit to the following venues: - CENTRA-26 (January 2026) - Submission on presentation. - High Performance Data Loaders for Large Training on Vision Transformers. - HPC-ASIA (January 2026) - Submission on Poster. - Towards Pre-training ViT without Images through FDSL. ## Progress 21/10/2025 Regarding 1p_Fractal: - Two more people from Kataoka's team are helping to match accuracy on the dataset. Probably different Application. Other collaboration: - I helped for the ICCV25 committee award and reading papers. We submitted a paper to this place but was rejected: - Atmospheric and Climate Science (https://www.nature.com/collections/caefbeicae) Our NEDO project was accepted, for the AI test bed on different accelerators. Other presentations coming: - I submitted my presentation talk to CENTRA-26: "High Performance Data-loading for Large-Scale Vision Transformer Training" - HPC-ASIA (January 2026) poster is on progress. - Presentation for SC25 booth. ## Progress 28/10/2025 - We Re-Submitting the paper on our collaboration with Jairo on the: "Data-driven modelling of ozone dynamics across Asian megacities using multi-platform datasets" to Springer Nature for Climate and Atmospheric Science. - On the 1p_Fractal: - I am running some test on pre-training. I am facing strange behaviour only by changing BS. - I am studying and re-visiting the dataloader from TorchVision, since I need to access to Torch API to compute IFS directly. - I am preparing the benchmark to measure time on creating PNGs or JPEGs. This time will be saved since in principle is not needed on using synthetic data. This will increase also in our security appeal since no images are used or stored. - This will be candidate for HPCASIA submission poster. ## Progress 11/11/2025 - There are no results yet from PPoPP. - I was making experiments on Fractal_1P. I found some hyper-parameters different such as warm-up from the original experiments. However, the accuracy has not increased to the ones reported: - The LR was divided by 512 like DeiT but hard-coded. I was having before this problem on the BS. - Original from Tadokoro-san: - ![image](https://hackmd.io/_uploads/rkOwc_ee-x.png) - Dataset: Recent Trail vs Reported ECCV24 - CIFAR10: 96.40 vs 96.80 - CIFAR100: 81.80 vs 84.20 - Imnet-100: 88.10 vs 89.00 - Finally, I am currently, re-writing Zero IO for CVPR2026. ## Progress 02/12/2025 - Our paper on air quality research has been rejected again. This is the 3rd failed submission: - Submitted Journal:Nature - Climate and Atmospheric Science (npj) - Title: "Data-driven modelling of ozone dynamics across Asian megacities using multi-platform datasets" - I am currently working on my submission for HPC-Asia: Pre-training without Images using FDSL. - Basically, computing IFS directly into tensors and do not make any PNG or PIL conversion. - ![image](https://hackmd.io/_uploads/HyY_RijWZx.png) - There are new transformations on newer PyTorch that are GPU accelerated and support tensors() directly: - ![image](https://hackmd.io/_uploads/HJZT1hsZWx.png) - ![image](https://hackmd.io/_uploads/ryXCJhi-bl.png) - Timm supports V2 as well. - I am still preparing the dataloader and benchnmarking the execution time. ## Progress 10/12/2025 - Submission to HPC-Asia -> Towards High Performance Image-Free ViT Pre-training Using Tensor Based Fractal Generation. - Registration to ADAC. - Quick off meeting to NAPA project with RIKEN. - Test bed for new Deep Learning architectures. - Using the Nsight profiler for I/O on ABCI. - Talking to Kataoka-san, he will provide points on ABCI-Q to continue on the 1P_Fractal accuracy experiments. ## Progress 23/12/2025 - Report on using Nsight profiler for I/O on ABCI. - Study and executing samples to profile storage on Nsight. - Tying to use GPUDirect storage. - Waiting for new points on ABCI-Q to continue 1P_Fractal accuracy. - I provided information to Kataoka-san to proceed on the account creation/validation. ## Progress 16/01/2026 Presentation at CENTRA-2026 (Thailand) - I presented on different performance on data loaders using super computer environment. Cooperation with Kataoka-san. "Minimal Dataset Requirement in Visual Representation Learning". According to Kataoka-san: > if we summarize our experiments (so far) and related studies, we will need: > - Basic classification performance on Fractal, ImageNet, and related datasets > - Analysis of the minimum number of categories and instances (#category / #instance) on Fractal, OFDB, 1p-frac, DeadLeaves, ImageNet, etc. > - Our reproduction challenge for 1p-frac > - Additional real and synthetic images as supporting evidence... > - Further discussion > > I am open to discussion toward publication of this paper. I believe that “After Scaling Backwards” and the minimum challenge for ViTs are sufficient to engage the research community, even in the form of a position paper or short survey. Currently, I am in the process of registering on ABCI-Q. I am cooperating with DDN on profiling I/O. Currently, I am using Nsight profiler to extract information on Lustre, local SSDs and GPUDirect. I am starting to prepare my research stay on INESC-TEC (Portugal). The following topics were discuss during CENTRA-9: - Profiler comparison to Nsight and DDN. - They have their own profiler. - A new I/O framework that leverages and pool SSDs for AI training. - I raise a question that this is very similar to BeeGFS, so need to be a comparison. ## Progress 20/01/2026 - Cooperation on new project on metrics visualization. As part of NEDO-CEF ポスト5G情報通信システム基盤強化研究開発事業. - We propose the usage of Dexpot, an intermediate computing element between the Edge and Cloud capable of post-process, compress, pre-train, etc. - Using InfluxDB and Grafana, the main objective of this work is to aid the real-time monitoring of the Dexpot computing element. ![image](https://hackmd.io/_uploads/SyGUcH2rWg.png) ![image](https://hackmd.io/_uploads/HkFDcrnSbx.png) - I got better access on the special node ABCI where I can reach SSD for the Nsight profiler. - On another, topic I finished the poster I will present on HPC-ASIA.