One of Microsoft’s key Mixed Reality partners reveals how it creates professional videos of the HoloLens experience, complete with holograms.
One of Microsoft’s mixed reality partners reveals how it creates professional videos of the MR experience, complete with holograms
Mixed reality (MR) is a new platform that’s emerged from the fields of augmented and virtual reality. MR blends the physical and digital worlds into a single space, using a combination of cutting-edge optical hardware and AI software. This powerful combination makes it just as suited to enterprises as it is to consumers, and offers a new way for people to interact with an increasingly data-rich world. The device leading the current wave in the MR revolution is the Microsoft HoloLens.
Mixed reality is a compelling experience for those using it, but onlookers simply see people gazing into thin air. To counter this, the HoloLens has built-in functionality for creating video recordings from the user’s perspective, called mixed reality capture (MRC). However, the quality from the HoloLens’ forwardfacing camera doesn’t capture the magic of a mixed reality experience, where holograms fill the room around the user. MRC also has the downside of being computationally expensive, stealing performance from the application itself. The result is an imperfect experience, both for the active and passive viewer.
You may then wonder how Microsoft produces its great demo videos, and the answer is a custom setup based on a Red Dragon camera, costing tens of thousands of dollars. It also offers a more affordable “spectator view” add-on as a way of producing high-quality videos, but this is limited to static shots only. Within weeks of getting our first few HoloLens units in early 2016, we identified this as a problem we’d like to solve, both to share our own dynamic MR content and to help our clients and early adopters of this ground-breaking new technology. Refining our solution At Fracture, our solution has been refined and improved over time, but the basic principle remains the same. At the core is a Unity framework developed for recording a HoloLens session on the device itself.
Think of it as a replay system. During a filming session, we record all the relevant information about what the user is doing, what objects have been placed where, which buttons have been tapped. This data is then saved to a file that can later be imported into the Unity editor. This allows us to play back everything the user did and render it out, from the perspective of the real-world camera, and composite it on top of the footage.
The key is knowing what the user has done and when. Our early implementations streamed the data live over the network to the Unity editor and rendered it out in realtime, but this required bespoke networking code for any app we wished to capture – even if it was only a single-user experience. The much simpler and more robust solution was to simply record the data locally on to the HoloLens, which removed much complexity during development and throughout the shoot.
The process of preparing a scene for recording involves tagging up only the objects that respond to user input, so the system knows what to record. This lightweight approach keeps memory and performance overheads low and means that we can prepare a scene – depending on complexity – in less than a day. That may sound like a long time, but trust me, it’s a far quicker approach than fully networking a HoloLens app!
With the app prepared, we’re ready to start filming. To start the HoloLens session recording on location is as simple as using a voice command to start and stop for each take. In addition to the replay data, we also save out the spatial mesh of the room that the HoloLens creates using its spatial mapping technology. This allows us to have a 3D representation of the real-world filming location in our video production package.
Filming a HoloLens session
The filming itself is straightforward, but there are a few things we do to ensure things run smoothly. We use a GoPro camera, since it has a nice wide-angle lens that allows us to keep both the holograms and the user in shot – without needing to be too far away. It also allows us to film in smaller locations when necessary.
We don’t want to do too much rotoscoping (a movie-production technique that involves manually masking out sections of video layers), so we try to avoid the user moving in front of the content. We like to use quite minimalist spaces as backdrops, so there’s less noise behind the holograms. This can make it difficult to track the camera later, so we’ll often add tracking markers to the space.
“MR is a compelling experience for those using it, but onlookers simply see people gazing into thin air”
The Fracture team uses camera tracking to extract the camera’s movement from feature points in the image, so that computer-generated content can be composited into it. The key part to this process is to ensure we have real-world measurements of an object in the shot, such as a table or a doorway. This ensures we produce a camera track at the same scale as the real world. Using 3DEqualizer, the motion of the camera is then exported as a computer animation file, along with a proxy of the known object from the shot, which can then be imported into Unity.
Rendering out from Unity
With our HoloLens session recording and our real-world camera track, we can start to render out the footage from Unity. There are three key aspects here worth highlighting.
First is the matching up of the HoloLens and imported camera coordinate systems. We use the known object proxy from 3DEqualizer on top of the spatial mesh saved during filming. This gives us a rock-solid match.
Second is the lens distortion. The GoPro has a distinctive look, and part of this look is created by lens distortion. The problem is that Unity doesn’t do distorted rendering out of the box, and even more of a problem is that the GoPro distortion isn’t your standard fish-eye or barrel distortion; it’s rather complex.
Fortunately, if you provide 3DEqualizer with an image sequence and choose the right lens profile, it can apply the lens distortion for you. However, a distorted lens captures more than just a rectangular-shaped window onto the world. The “undistortion” process unwraps the original, making the checkerboard lines straight. In doing so, the image is stretched outside the original image resolution. This is called “over scan”.
To do the reverse (distort our renders), we need to have Unity render out with over scan. This is achieved by increasing both the resolution and the field of view being rendered.
3DEqualizer does a good job of all this, but depending on shot length, it could take up to 30 minutes to process for each shot. To speed this up, we now apply the lens distortion at the time of rendering in Unity. Rather than implementing the complex formulas 3DEqualizer uses to create this distortion, we wrote an integration pipeline between 3DEqualizer and Unity that exports the lens distortion profile as a warped mesh that Unity can then use to render out the images with the correct distortion. The benefit of this is that if we use a different camera/lens, which requires a different formula to produce the distortion, all we need to do is export a new lens profile from 3DEqualizer.
The third challenge is render speed. In early versions of this system, we were rendering out 3,450 x 1,992 images at 60fps as PNG files. We used PNG so that we had the alpha channel and some compression for more manageable file sizes. But the Unity implementation of PNG encoding is slow. Render times were 30 minutes just for a 30-second shot, even on a high-spec PC.
We looked at a few solutions on the asset store, but only one ticked all the boxes: AVPro Movie Capture by RenderHeads. We use a number of RenderHeads plugins and they’re all superbly fast native implementations – and this one is no exception. It will render out 1080p at 60fps using a lossless codec supporting alpha in near-real-time. So by combining both the lens distortion and the rendering improvements, we’ve taken a processing time of 30 minutes per 30-second shot down to about 30 seconds. The saved time allows us to iterate more, which ultimately results in higher-quality output.
Final thoughts
The uptake of MR has begun and we’re already creating applications for forwardlooking enterprises. By gaining an understanding of the benefits, efficiencies and new, more powerful ways of collaborating that MR offers, we’re helping those early adopters to have a significant head start over competitors. Sharing these experiences in an accessible format such as video is a key piece of the puzzle in evangelising the value of both individual applications and the medium itself.