glTF 2.0 Extensions in MPEG and 3GPP

# glTF 2.0 Extensions in MPEG and 3GPP AI meeting summary: Ahmed, the Director of Technical Standards at Qualcomm, gave an update on work being done in MPEG and 3GPP regarding glTF extensions and supported MPEG media for the Metaverse. The goal is to enable immersive experiences by adding new media types like dynamic meshes and point clouds. They are also focusing on compression technologies for video, audio, and 3D content. Three GPP is working on integrating these experiences into the 5G system by optimizing latency, bitrates, power consumption, and supporting codecs for speech and split rendering. The architecture involves a presentation engine that interfaces with the network, local devices (camera/microphone), user input, and scene graph updates. Media access functions handle different codecs and formats through buffers to render media using Vulcan or other graphics engines. The Scene Description format integrates timed media through extensions like MPEG media extension, timed accessor extension for dynamic buffers with metadata updates, video textures extension with alternative codec support based on device capabilities, spatial audio extension incorporating MPEG I Audio integration. MPEG has a long history in audio codecs, and now they have developed MPEG I Audio for immersive audio experiences. This extension allows for basic Spatial Audio experiences with audio sources, listeners, and limited reverb effects. The media is accessed through buffers and can be compressed in various formats. They have also implemented this concept in Unity and plan to open source it soon. Phase two of their work focuses on more advanced use cases like AR anchoring interactivity, avatars, haptics support, and light textures. Three GPP has adopted the MPEG I theme description as the baseline for XR experiences, particularly AR calls where participants can share content in a shared environment. The extensions are published as vendor extensions and an ISO specification. The phase one extensions are already published on glTF's website while the phase two extensions are still a work in progress but will be submitted as an amendment by the end of the year. Feedback on these extensions can be provided through public GitHub where all the extensions are available for review. Ahmed explains the process of updating nodes in a common scene, similar to how multiplayer games work. There are attempts to implement this concept using Unity and Blender, with plans to make it available as an open-source project. The integration of Haptics with other interactivity models is being considered, and there is a focus on making the components usable independently. Ahmed mentions that media can be controlled for selective display or timing through interactivity triggers. Security aspects, such as protecting media and integrating DRM, are also discussed. Overall, the discussion was engaging with several questions asked by participants. Action items: Based on the transcript, there are a few follow-ups and action items that can be identified: Follow-up: Engaging with the glTF Community - The speaker mentioned that the glTF extensions developed by MPEG are already available as vendor extensions under the glTF repository. This implies that there is ongoing engagement with the glTF community to incorporate these extensions into the glTF specification. Follow-up: Open-sourcing Unity Implementation - The speaker mentioned that the Unity implementation of the MPEG extensions will be open-sourced in the next month or two. This is an action item that will provide the community with access to the implementation and allow for further collaboration and development. Follow-up: Making Example Files Available - The speaker mentioned that example files for the MPEG extensions are still being prepared for public availability. However, they are willing to share them privately upon request. It would be beneficial to make these example files available to the public, and the speaker agreed to provide links to the files in the README of the Kronos repository. Follow-up: Incorporating Feedback and Refining Extensions - The speaker mentioned that feedback and review of the MPEG extensions are welcomed. They have set up a public GitHub repository for submitting issues and feedback. They will process the feedback and update the specification accordingly before final publication. Action Item: Consider Integration with Other Interactivity Models - The speaker mentioned that the Haptics extension should be able to integrate with other interactivity models, including the one being developed by Kronos. They acknowledged the need to make the components usable independently and to ensure compatibility with different interactivity models. Action Item: Exploring Security Provisions for Media Playback - During the Q&A session, a question was raised about the security of media playback in shared scenes. The speaker acknowledged the importance of integrating security provisions, such as DRM, to protect media and limit access to authorized users. This is an action item that should be considered when refining the MPEG extensions. Overall, the follow-ups involve engaging with the glTF community, open-sourcing the Unity implementation, making example files available, and incorporating feedback. Additionally, there is an action item to explore integration with other interactivity models and consider security provisions for media playback. Outline: Outline: Introduction 00:07 - Introduction to the main event, the education session. 00:11 - Ahmed's expertise on glTF extensions and supported MPEG media. Scene Description Architecture 02:45 - Focus on scene description and compressing objects in a scene with video and audio codecs. 03:07 - Importance of scene description format for immersive experiences. 04:02 - Exploring codecs for speech, spatial audio, and defining the format for this type of experience. 04:47 - Starting with the scene description architecture. Entry Point and Presentation Engine 05:35 - Scene graph as the entry point for immersive experiences. 07:46 - Presentation engine using graphics, audio, and haptics renders to render the media. 08:31 - Integration of media pipelines into the scene graph format. Phase One: Time Media 09:36 - Defining a format for distribution similar to HTML. 11:17 - Adding support for timed audio, timed video, dynamic meshes, and point cards. 12:09 - Initial phase focused on adding support for time media. 15:39 - MPEG media extension for describing referenced media in the scene description. Core Extensions and Audio Experiences 16:15 - Passing data and updates to fields as part of the data itself. 17:25 - Core extensions for time-based experiences. 18:59 - Definition of audio materials and support for advanced audio experiences. 19:43 - Basic spatial audio experience with audio sources placed in the scene. Phase Two: Advanced Use Cases 20:48 - Video texture and spatial audio in a scene. 22:30 - Scene updates for shared spaces and participant interaction. 23:39 - More advanced use cases like AR anchoring, avatars, and Haptic support. 24:46 - Interactivity extension for trigger-based actions. Three GPP Adoption and Future Work 26:30 - Addition of time dimension to extensions. 27:35 - Three GPP adopting MPEG scene description as the entry point for their services. 28:53 - Mission to evolve video telephony into shared AR experiences. 29:25 - Server-based distribution of initial scene description in a shared environment. 30:35 - glTF 2.0 with MPEG extensions for shared experiences. Demonstration and Questions 30:53 - Video demonstration of interactivity and AR anchoring. 32:12 - Questions from the audience about the state of the extensions and implementations. 35:00 - Public GitHub repository for feedback and updates. 38:19 - State of implementations and known platforms/software. 42:12 - Discussion on Haptics integration and interactivity model. 44:15 - Access to MPEG's repository of scenes. 45:01 - Controlling the timing and playback of media in a scene. 47:39 - Closing remarks and request for slide sharing. Please note that the outline is based on the timestamps and content provided in the transcript. Notes: Ahmed is well versed in the education session. Ahmed will do a screen share for the education session. The focus is on the HTML five of the web. The session will cover compressing objects in a scene and their attributes. Codecs for speech and spatial audio will be discussed. The format for this type of experience will be defined. The entry point to this experience is the scene graph. The presentation engine will render the media using graphics, audio, and haptics. The Scene Description format will handle format conversion and network access. Media pipelines will be integrated into the Scene Description format. The goal is to define a format for distribution similar to HTML. The MPEG media extension allows describing media used in the scene. The extension also provides alternatives to the media. There is a focus on simplicity and flexibility in the format. Extensions like AR Anchoring and interactivity were added. Three GPP has adopted the MPEG I scene description as the baseline. Three GPP aims to evolve video telephony with shared environments. Participants join a shared space and can add content to the scene. Updates to nodes in the scene are shared among participants. Interactivity allows triggering actions like haptic feedback. Three GPP activities include AR calls with avatars and shared spaces. The initial scene description is distributed by a server. Feedback from countries is incorporated into the standard. Participants can update certain nodes in the common scene. Feedback and suggestions for improvement are welcomed. Immediate access to the extensions can be requested. The interactivity model allows for automatic or triggered media playback. Integration of additional features like virtual objects is possible. Slides will be shared with the working group.