Guest Gordon Wetzstein is an expert in using math to improve photographs.
Such methods have exploded in recent years and have wide-ranging impacts from improving your family photos, to making self-driving cars safer, to building ever-more-powerful microscopes. Somewhere in between hardware and software, he says, is the field of computational imaging, which makes cameras do some pretty amazing things. Wetzstein and host Russ Altman bring it all into focus on this episode of Stanford Engineering’s The Future of Everything podcast.
[00:00:00] Gordon Wetzstein: The idea of non-line-of-sight imaging is that you take a photograph of a scene, but you can actually recover parts of the scene that is hidden away outside of the direct line of sight. So, some object standing around the corner. Basically, can we take a picture and then see what's around the next corner or inside a room that we can actually see it?
[00:00:27] Russ Altman: This is Stanford Engineering's The Future of Everything and I'm your host, Russ Altman. If you enjoy the podcast, please follow or subscribe it on your favorite podcast app. It'll help us grow and you'll never miss an episode.
[00:00:38] Today, Gordon Wetzstein will tell us about some of the amazing things that a new generation of I call them cameras, but they are not the cameras of our childhood. But what these new cameras can do. For example, he'll tell us what HDR is on our cell phone and whether we should use it, high dynamic range. He'll tell us how cameras can now look around corners and see things that aren't even in the field of view. This has huge implications for self driving cars that may be able to see things before they even are in front of them and can be seen.
[00:01:13] So this is gonna be amazing. It's the future of computational imaging.
[00:01:17] Before we jump into this episode, a reminder to please rate and review the podcast. It'll help our audience grow, and it'll help us improve.
[00:01:31] Many of us have a camera. In fact, almost all of us have a camera in our iPhone. If you've been paying attention, you'll notice that the quality of the pictures in your iPhone has become absolutely stunning. The colors, the focus, the exposure. This didn't happen by accident. There is a field of computational imaging where people are creating incredibly clever ways to optimize photographs. But it goes way beyond that because the technologies used in cell phones are also being used to create amazing new technologies that can be used in self driving cars and in scientific instrumentation.
[00:02:07] The study of hardware and software to create images has exploded in the last few years. For example, we can even see things that are around the corner based on how they scatter light that is hitting them and then hitting things that we can see. And by carefully and mathematically dissecting those signals, we can see [00:02:30] something that's around the corner. It's called non-line-of-sight imaging. So, we're going to hear about all of this from Professor Gordon Wetzstein of Stanford University, he's a professor of computer science and electrical engineering. He's the director of the Stanford Computational Imaging Lab, and he's the co director of the Image Systems Engineering Lab.
[00:02:51] So Gordon, let's start out with the basics. What is computational imaging, and what are the big challenges to that field?
[00:02:58] Gordon Wetzstein: Well, first of all, let me say, it's great [00:03:00] to be here, Russ. I'm a big fan of the show.
[00:03:02] Um, computational imaging is a field that is very close to the hardware, but also sort of between software and hardware. What we like to do is design future camera and imaging systems in general by jointly optimizing both hardware components and software components. So let me give you an example. For example, uh, imaging systems on autonomous cars are very important, but they're sort of diverse. You have a lot of different sensors [00:03:30] on there. There's a bunch of radar sensors. We have a LIDAR that shoots light out and measures the distance around the car. We're going to have a bunch of cameras as well that do like traffic sign detection and things like that. But then there's a software piece behind it, the perception algorithm that makes sense of all this raw data coming in from all the different sensors, and it has to help the car make a decision of whether it's going to stop or drive.
[00:03:54] And so we like to think about these imaging systems as, you know, holistic systems that have hardware components, software components and we want to get the best out of them by jointly optimizing software and hardware together. That's sort of the basic idea behind computational imaging. But it not only applies to autonomous driving, but also to microscopy, scientific imaging, the camera sitting on your cell phone, for example, in many other scenarios too.
[00:04:22] Russ Altman: So great. So, thank you very much. That makes perfect sense. So, it's about getting images and you talked about, you know, acquiring them and then analyzing them. And I'm wondering, I think also part of your interest is how do you then subsequently present them to humans?
[00:04:36] So in the case of driving cars is great because you could say, well, the car is handling everything. The humans don't need to see anything, but I'm sure there are situations where you want to give at least a little feedback to the human about what the car is seeing and, um, is that part of the deal and does that have significant challenges as well?
[00:04:55] Gordon Wetzstein: Oh, absolutely. Like, I mean, just think about, you know, you pull out your phone and you want to take a picture of your kids or your family. You want to make sure that you capture that memory in the best possible way. So, you know, just capturing better photographs is sort of where the field actually started out.
[00:05:11] So back about 20 years ago, when the field came around, people were like. Oh my God, I take a picture and if there's a bright object like the sun or something really bright in there, it's just totally blown out. I can adjust the exposure time to capture that bright object, but then all the darker parts of the scene will be too dark.
[00:05:27] Russ Altman: Right.
[00:05:28] Gordon Wetzstein: So, this idea of being able to expand the contrast or the dynamic range of a camera is actually one of the starting points of this field. And you know, back when I was doing my undergrad, that was the hottest research topic. And now this is in everybody's cell phone, basically. So, the field advances quickly.
[00:05:44] Russ Altman: I'm glad you brought this up because I knew you would say this and I, uh, and I wanted to ask you two things about iPhones. I think a lot of people besides me have this question. The first thing is, HDR. I know that you've looked at HDR. Can you just tell us what HDR is and whether we should use it? Because it's on all my cameras and I stare at it, and I don't know what to do.
[00:06:06] Gordon Wetzstein: Oh, absolutely. HDR stands for high dynamic range. And what that simply means is that There's a very high contrast in the scene. The contrast is the range of luminance or brightness between the darkest object and the brightest object.
[00:06:20] So if you have a person standing in the shade of a tree, for example, that person will be fairly dark. And if you have a very bright object like the sky in the background, even the difference between them is enormous. And sensors typically cannot capture them with any one exposure setting. You can either get the background well exposed or the foreground.
[00:06:39] Russ Altman: Ya.
[00:06:40] Gordon Wetzstein: You can't get both well exposed in a single photo. So HDR imaging is this idea where you press the button only once, but the phone actually captures multiple different exposures at the same time. And it computationally merges them all together into something that wasn't really a single photograph.
[00:06:57] Russ Altman: Ah.
[00:06:57] Gordon Wetzstein: It sort of extracts the most useful information from multiple different images. And calculates an image that looks good to you as a human observer.
[00:07:07] Russ Altman: Okay. Is there also a phenomenon of doing the same kind of thing, but not for the, uh, for the lighting, but for the focus where it takes many pictures and then kind of sews together the ones that are showing the objects in the distance and in the nearby in focus?
[00:07:22] Gordon Wetzstein: Absolutely. So, we call that a focal stack. So, the idea of you know, cycling through the different focus settings of the camera and capturing the image differently focused. I mean, how many times has it happened that, oh my God, there's this really cool thing that I want to take a picture of right now, and I'm going to hold up my phone and it's just the object I want to see is out of focus, right?
[00:07:41] Russ Altman: Yeah.
[00:07:41] Gordon Wetzstein: So, by, by being able to sweep through the focus settings, I could pick the best focused spot for each pixel in the image. But I can do more than that. I can do things that I cannot do with a regular camera. So, making an image look better is great, but you can also enable new, uh, abilities. For [00:08:00] example, by having the focal sweep, for example, we can run now an algorithm that estimates the depth of the distance of each point in the scene from the camera.
[00:08:09] It gives us something that we call a depth map, basically.
[00:08:12] Russ Altman: Yep, yep.
[00:08:13] Gordon Wetzstein: And if we have a depth map of a scene, which we cannot capture directly with a single image. Then we can do really cool things. For example, we can emulate, uh, how a portrait shot would look like had you taken it not with your iPhone camera, but with a professional DSLR camera...
[00:08:29] Russ Altman: Ah.
[00:08:29] Gordon Wetzstein: .... that has a [00:08:30] really large lens. Because the difference between a DSLR camera and a small camera is really... It's mainly the size of the lens, the physical size. You just can't get that size on a phone and with a larger lens, you can get this really nice depth of field or bokeh effect.
[00:08:45] Russ Altman: Yes, yes.
[00:08:45] Gordon Wetzstein: Where the person is perfectly in focus, but the background is sort of blurry and out of focus. That's what is like a signature way to capture a portrait and you just can't do it physically with a cell phone camera. But using computational photography, you can [00:09:00] estimate the distance of all the different objects in the scene and then create this relative blur in the background and synthesize it.
[00:09:07] Russ Altman: Okay.
[00:09:08] Gordon Wetzstein: It's sort of funny that you have to make an image seemingly worse by blurring it in parts to make the photograph look better. But, you know, that is one of the, one of the set of algorithms that we focus on.
[00:09:18] Russ Altman: Okay. And I know this is super basic stuff but thank you. I guess the first public service announcement is use the HDR on your camera. It's a good thing. And, uh, and Gordon recommends it.
[00:09:28] Okay. Going now a [00:09:30] little bit, and there's so much that I want to ask. Um, uh, I am a little bit of a photography, uh, enthusiast. And so, when I was looking at your CV and your papers, I just got very excited. I wanted to go right and then we're going to go to more general things.
[00:09:43] Um, but I want to go to non-line-of-sight photography, because I know you've worked on this, and it sounds like magic. So, could you first define it and then tell us what it is and how it works?
[00:09:56] Gordon Wetzstein: Absolutely. So, this was such a fun topic that we've been working on [00:10:00] for a couple of years. And the idea of non-line-of-sight imaging is that you take a photograph of a scene, but you can actually recover parts of the scene that is hidden away outside of the direct line of sight. So, some object standing around the corner. You basically, can we take a picture and then see what's around the next corner or inside a room that we can actually see it?
[00:10:23] Russ Altman: So that's making everybody's brain explode when you even say that, although the implications for safety, for example, in safe driving [00:10:30] cars, but also the coolness for spy applications. So yeah, please go on.
[00:10:35] Gordon Wetzstein: Uh, exactly. So, applications I like to think about is mainly in autonomous driving is can we make it safer by allowing the car to see around the next curve, right? We want autonomous cars not to be only as safe as human drivers are because humans aren't actually that safe or at the end of the day, we want them to be safer, much safer.
[00:10:54] And part of the way we can get that is to enable them to just have superhuman vision to [00:11:00] some extent, to see things that we humans can see and base some of their decisions, at least on some of that information. And online offsite imaging is one of these. It seems sort of like magic. When I first heard about this idea, I was like, how is that even possible?
[00:11:14] Like, I mean, photons come from the scene to the camera, and we digitize them using, uh, sensors. So, where's that information coming from? Do I just make this up? But there's a, as with anything in engineering and science, there's a trick that you [00:11:30] can use to make this sort of work and the trick is to use active illumination.
[00:11:35] And what do I mean by active illumination? So, I mentioned autonomous cars a few times, but typically these LiDAR sensors, they have a laser on them that shoots a short pulse of light into the scene. And it measures the time it takes for the light to bounce back from the object that it hit to the sensor. That's a LiDAR system, light, uh, ranging systems basically.
[00:11:57] Russ Altman: Yes.
[00:11:57] Gordon Wetzstein: It just measures the time of [00:12:00] flight. And we use that routinely in all these cars. And also, now on your iPhone by the way to get a sense of depth. You basically scan a, this point across the scene and you can probe how far away is any one point using this active illumination, right?
[00:12:16] When we think about what happens to this pulse of light that flies into the scene. It doesn't just bounce from an object and returns directly to the sensor. Some of the light actually bounces from that object that it hit in all directions.
[00:12:29] Russ Altman: Yep.
[00:12:29] Gordon Wetzstein: And [00:12:30] part of these, a couple of photons will bounce around the corner where there's somebody else. So, we call this indirect bounces or indirect light transport.
[00:12:38] Russ Altman: Okay, okay.
[00:12:39] Gordon Wetzstein: And so, think about the laser bouncing onto a visible surface, some of that light bouncing behind, you know, into hidden parts of the scene. And then some of the photons will be reflected back into the visible part and then come back to the sensor.
[00:12:53] Russ Altman: Okay.
[00:12:54] Gordon Wetzstein: So instead of just looking at the first bounce that goes there and back, we're looking at the third bounce. So, there to the hidden part of [00:13:00] the scene, back into the scene and then to the camera.
[00:13:03] Russ Altman: Yes.
[00:13:03] Gordon Wetzstein: And we get a lot of these mixed photons that were bouncing around a few times somewhere where we don't know where they went and we don't know what they saw, but, uh, we can use the time of flight of these indirectly scattered photons and solve a large scale inverse system. So, we solve a mathematical problem to try to recover the shape and the colors or appearance of these objects that are outside of the direct line of sight.
[00:13:27] Russ Altman: Yes, this actually is a [00:13:30] great, it's a great description because I'm thinking about a ball and if I throw a ball and it hits the wall and then it bounces and if it hits somebody behind a corner, it'll bounce back to the wall quicker than what I'm expecting and maybe I can figure out it just hits somebody who's coming around the corner or something like that.
[00:13:47] Gordon Wetzstein: Yeah, yeah.
[00:13:47] Russ Altman: Okay, that's fantastic. And how are these systems? Like, so that's theoretically sounds great. I know there's a lot of photons, so you might get up, be able to get a lot of data. Are we able to do, like, what's the resolution of it? [00:14:00] Can you say there's some object or can you say it's my friend Bill? Where, what is our accuracy in those reconstructions?
[00:14:07] Gordon Wetzstein: Yeah. The accuracy is physically limited to a few factors, uh, including the how short is the pulse. So, the shorter it gets, the better the resolution gets, but then also the temporal resolution of the sensor, the resolution is in the centimeter range.
[00:14:23] Russ Altman: Ah.
[00:14:23] Gordon Wetzstein: So that means I can get centimeter scale spatial resolution, you know, within a few meters [00:14:30] behind that wall. And that's enough to make out. Oh yeah, it's a human. And you know, maybe you can even make out who that human is. I mean, to some level, right?
[00:14:38] Russ Altman: Ya, ya.
[00:14:38] Gordon Wetzstein: Like I think traditional feature detection on your faces may not directly.
[00:14:42] Russ Altman: Right, but people are known to have separate gates. You'll know that the way they're walking might be enough.
[00:14:47] Gordon Wetzstein: Perhaps, exactly, uh, or you will know they have a nose and two eyes and two ears to some extent, but so I don't think you can really identify people, uh, right now, but based on their movements, perhaps, and, [00:15:00] but it may be enough to say, oh, there's a car around the corner, or there's a bike or a pedestrian or there's a ball rolling on the street.
[00:15:07] Russ Altman: Yes.
[00:15:07] Gordon Wetzstein: Uh, that may be chased by a small child. So, uh, be extra cautious, uh, and yeah, it's just, uh, such an enabling technology that's sort of like going down the superhuman vision route.
[00:15:19] Russ Altman: And so your comments lead me to believe that this is not just for static images, but you'll able, you're able to do dynamic detection of the, of a moving object by taking multiple frames, [00:15:30] uh, with the photons, etc., as you just described.
[00:15:32] Gordon Wetzstein: Yeah. Yeah. When we have moving objects.
[00:15:34] Russ Altman: Okay.
[00:15:34] Gordon Wetzstein: We even have more information that sometimes makes it even easier to recover certain aspects of this.
[00:15:39] Russ Altman: Right. Okay. Thank you so much because these were all the things, I was dying to ask you. Uh, and now I want to just step back and say, what are the things happening in your lab that you're most excited about these days?
[00:15:51] Gordon Wetzstein: Well, these days I would say there are two different things. So, I'm in electrical engineering, but I'm also in computer science. So, you know, my [00:16:00] electrical engineering self likes to work on hardware, and we like to use artificial intelligence and modern algorithms to not only design the software part, but also the hardware.
[00:16:10] And, uh, that's sort of, there's a couple of things going on. You know, I mentioned this depth sensing, you know,
[00:16:16] Russ Altman: Yep.
[00:16:16] Gordon Wetzstein: How do we build cameras that are better at sensing distances for on a per pixel basis. But we've been also looking at a couple of other things. Like we we've been looking at optical computing. So, uh, this is an area [00:16:30] where you maybe think about the camera as a preprocessor. So now the human being is not the final consumer of the image, like the photograph we talked about earlier.
[00:16:38] Russ Altman: Ah.
[00:16:38] Gordon Wetzstein: But maybe there's a robot or a car or some other algorithm that actually looks at the data. So, no human being will ever look at the data. In that context, you know, most of the decision making is actually done by neural networks or artificial intelligence algorithms and an AI algorithm is typically a big piece of algorithm that takes a certain [00:17:00] amount of power that has a certain latency takes some time for it to process the data. So, the question there is, can we make it faster? Can we make it more power efficient? Can we perhaps outsource some of the computation directly into the optics already and precompute some features like a feature extractor or some amount of computation directly in the analog domain before the data is even digitized by the sensor?
[00:17:24] And so that's the idea of optical computing is like, can we actually do the computing with photons rather than with [00:17:30] electrons?
[00:17:30] Russ Altman: Right. So as an electrical engineer, now, once you've said that, it strikes me that you get to decide how far the hardware goes before it goes into a software system. And, you know, for many of us, we are just, you know, we have a computer, we have a device, and we don't get to decide where they interface. You just have to take the data out of the device and put it into the computer. But since you have both of these skill sets, you can say, I'm going to extend what the hardware does and have the software come in at a later point, or I can have the software go [00:18:00] deeper into the process. Um, so that's fascinating.
[00:18:03] What are the applications that these optical computing might be used for?
[00:18:07] Gordon Wetzstein: Well, the applications we're thinking about are mainly a power constraint systems. Think about, uh, drones perhaps where you have a power constraint and you, you know, any milliwatt of power you can save will greatly extend the range.
[00:18:21] Russ Altman: Ah.
[00:18:22] Gordon Wetzstein: So this is so important for, let's say, aerial supervision of like forest fires, or, you [00:18:30] know, maybe you want to fly your drone to check out if that PG&E equipment isn't flying any sparks there, you know, like.
[00:18:35] Russ Altman: Right.
[00:18:35] Gordon Wetzstein: And just being able to get to that next transformer is super important and could prevent the next wildfire, right? So just power constraint settings are good because there we add additional computational resource, additional degrees of freedom of doing the processing on the device.
[00:18:51] Russ Altman: Is that because the hardware may be actually more power efficient than the software?
[00:18:56] Gordon Wetzstein: The hardware is ideally power free. So, the way we [00:19:00] think about it is like,
[00:19:00] Russ Altman: Ah.
[00:19:01] Gordon Wetzstein: We want to build a lens or an optical system that we, you know, we have to train and software just like a neural network, but then we fabricate it in the clean room or in a lens manufacturing facility,
[00:19:13] Russ Altman: Ahh.
[00:19:13] Gordon Wetzstein: We assemble the camera, and the camera doesn't capture a regular picture. It captures something that, you know, where the data that you record is already preprocessed. It's sort of baked into the hardware.
[00:19:22] Russ Altman: Oh, that's fantastic. So, you get the light comes from the subject of interest. It goes through lenses that have been specially designed to kind of process it, not [00:19:30] for human consumption, but for input to a computer or software algorithm. But they can do a lot of the work so that the software has relatively less to do and therefore needs relatively less power.
[00:19:41] Gordon Wetzstein: Exactly. Less power, or it can just be done faster because.
[00:19:45] Russ Altman: Yes.
[00:19:45] Gordon Wetzstein: Processing at the speed of light is maybe a tiny little bit quicker than, uh, you know, running the electrons through a digital system. Uh, so for latency sensitive systems, like again, like I'm always going back to this autonomous driving application.
[00:19:59] It's [00:20:00] like if you can shave off, you know, a few milliseconds of latency from your autonomous car, it can just make decisions just ever so fast.
[00:20:10] Russ Altman: So, I don't want to get too greedy. But, um, we know that these AI systems are being run on huge power farms, where they're using gigawatts of energy. There's actually an environmental, a nontrivial environmental impact of some of these deep learning algorithms like ChatGPT. Um, is optical computing [00:20:30] going to be part of a solution there, or is that too far flung to think about?
[00:20:34] Gordon Wetzstein: Oh, there are many different flavors of optical computing, and you're absolutely right, there is a big environmental concern. When I think about AI, typically people separate that into this training stage. What does it take to train an algorithm? And then the inference stages, what does it take to run a pre-trained algorithm? So, with these, with the optical computing systems that I work on, we mainly focus on the inference stage. Once the system [00:21:00] is already trained and we want to make the inference time more efficient and faster, there are a lot of applications for enabling the training.
[00:21:08] So you mentioned like data centers. So typically, out of these. I mean, chat GPT is trained on like hundreds of the highest end GPUs that are sitting in the data center. And this is not a desktop computer like you may know at home where you have one computer with a GPU in it. This is like racks and racks and racks full of GPUs, and they all have to communicate with one another. Hundreds of GPUs [00:21:30] train the same algorithm and they have to very efficiently communicate with one another.
[00:21:33] Russ Altman: Yes.
[00:21:34] Gordon Wetzstein: So, their optical communication
[00:21:36] Russ Altman: Ah.
[00:21:36] Gordon Wetzstein: Between all these GPUs and the different nodes in the data center is absolute key. So, you know, optical computing or interconnects are actually commonly used there also to enable very fast networking and
[00:21:50] Russ Altman: Yes.
[00:21:50] Gordon Wetzstein: Distributed training on GPUs. Personally, I don't work on this, but a lot of my colleagues are.
[00:21:55] Russ Altman: But now I'm thinking, because of what you've taught us, now I'm thinking we, I [00:22:00] shouldn't think of those fiber optics simply as connectors. It's possible that they'll be able to start doing some of the processing. Uh, on the data so that when it reaches the next computer in the hop, that is data that's already been massaged in some way.
[00:22:15] And so I'm just making this up, but your examples from before making me see that you can do work with the optics. You don't have to just have it.
[00:22:22] Gordon Wetzstein: And this is a great research question because, you know, in engineering, uh, conventional wisdom basically says [00:22:30] that. Optics are good for communication and, uh, you know, electronics are great for the computation. This is the standard paradigm. If you think about, you know, your internet connection at home, there's probably going to be an optical fiber cable coming all the way to your house, probably. And, but that's exclusively used for communication because photons are so fast, and you can put so many different photons carrying different information through the cable all at the same time. So, the bandwidth becomes huge. But fundamental research question that [00:23:00] we and others are trying to answer in this field is like, how much compute can the photons do?
[00:23:04] Russ Altman: Yes.
[00:23:05] Gordon Wetzstein: Because that can be very beneficial.
[00:23:07] Russ Altman: This is The Future of Everything with Russ Altman. More with Gordon Wetzstein next.
[00:23:23] Welcome back to The Future of Everything. I'm Russ Altman and I'm talking with Professor Gordon Wetzstein of Stanford University. In the last segment, [00:23:30] Gordon told us about these amazing things we're doing in imaging, sensing. But Gordon, I know you also generate images using generative AI and other technologies. How do those two connect?
[00:23:42] Gordon Wetzstein: Russ, that's a great question. So, you know, I was just telling you about all the work we do in EE, but I'm also partly in CS and in computer science, you know, my background is actually in computer graphics and 3D computer vision. And the idea of computer graphics is to build tools that help people create 3D content, [00:24:00] like the Pixar movies you may know, uh, and build the tools behind that. So, in that space, we actually apply very similar methodologies. This co-design of maybe not hardware and software, but the way I think about it is, we combine AI and physics to some extent. And the physics and computer graphics is, you know, the way we simulate how light interacts with a surface, with a person, like how do you compute an image based on the material properties and so on. So, we applied the same methodologies of combining the [00:24:30] best of physics and modeling with the best of AI, in this case generative AI, in a meaningful way.
[00:24:35] Russ Altman: Yeah.
[00:24:35] Gordon Wetzstein: And what does that mean? It means that we try to model the structure that we know is there. For example, uh, if I wanted to model a 3D human, for example, I know that the human lives in 3D, there's a camera that can look at the human from different perspectives and humans move in certain ways, so can I learn to generate certain aspects of this, like the appearance of the person, but can I still [00:25:00] use modeling tools to, you know, articulate their face and move their mouth and, uh, light them using a physical lighting environment and things like that.
[00:25:12] So, maybe that sounds a little bit abstract. But, you know, one of the things that we looked at is, for example, to take a large collection of images of people's faces, so portraits. And we don't have multiple images per person, it's just like a random collection of images. And can we train a generative AI [00:25:30] algorithm to now learn to generate 3D digital...
[00:25:34] Russ Altman: ah.
[00:25:34] Gordon Wetzstein: ...humans from that unstructured collection of 2D images? That's a big project we worked on a while ago, and it's only possible by combining physics and generative AI in a meaningful way.
[00:25:47] Russ Altman: Yeah. So I just want to interrupt you because I see what you're saying because you could, it would be relatively easy if you had a bunch of movies of people taken from different angles, then it would be just the data itself would give [00:26:00] the AI some clue about how to do that modeling on a new face.
[00:26:04] But I love this example because now you just have that one image, and you need kind of have to use a physical model of the world and depth and 3D in order to kind of help the AI understand what it should be generating. So, this is, could this in principle lead to AI that requires less data? Because the other thing, we talked about power earlier, the other thing that everybody knows in addition to the power requirements of AI, [00:26:30] is the huge data requirements. And I'm wondering, in this case, you only give it a single image plus some physics. And it sounds like therefore you don't need as much data. Is that a fair statement?
[00:26:39] Gordon Wetzstein: That is a fair statement. So, and I'll give you a specific example at this, you know, digitizing human identities is sort of being commonly used in Hollywood production. So, any Hollywood movie you watch today, you know, most of the pixels aren't actually captured by a camera. They are rendered by an algorithm.
[00:26:58] Russ Altman: Ah.
[00:26:58] Gordon Wetzstein: And when a famous [00:27:00] actor goes on stage, they can't put that actor in any you know, of the crazy action sequences that they're actually shooting. So, what they typically do is they have a big photo camera rig on stage with like hundreds of higher resolution cameras that take pictures of that person from all these different angles and then reconstruct a 3D model from the actor. And with that reconstruction, this digital double, they can make them do things that would be perhaps dangerous for the actual actor and maybe impossible, like put them on Mars or something like that.
[00:27:27] Russ Altman: And I've also seen they've starting to make actors [00:27:30] younger.
[00:27:31] Gordon Wetzstein: Yes.
[00:27:32] Russ Altman: Remove the wrinkles.
[00:27:33] Gordon Wetzstein: younger or older
[00:27:34] Russ Altman: or older. Yes.
[00:27:35] Gordon Wetzstein: So, you these visual effects are just based on a 3D representation of the actor. And this technology has been very expensive, difficult to maintain, and only really accessible to, uh, people with deep pockets like Hollywood production studios. So generative AI sort of changes that game a little bit in that we can now learn to learn some of these priors as we call it [00:28:00] from, uh, data.
[00:28:01] So you have a lot of unstructured data on the internet, and we want to learn these priors from that data and then apply it to less data after. So instead of using a camera rig with hundreds of cameras taking picture of me, can I just take a single image and still get a reasonably looking 3D reconstruction.
[00:28:21] It's an ill posed problem, like I don't have enough information. I don't know what I look at from here, uh, because the camera didn't see that, right. But I've [00:28:30] seen a hundred million other people from that perspective. So, can I leverage some of that previous knowledge to fill in the gaps and do this in painting or, you know, extrapolate from the current view? And that's the basic idea behind using generative AI tools.
[00:28:44] Russ Altman: And how is it working? Are you happy with the levels of fidelity or maybe it would be fun to ask you, what are some of the failures at least early in the process where you just don't get it right? I mean, the side, like the ears don't look right or eyebrows don't match or what are the kinds of things that you [00:29:00] see?
[00:29:00] Gordon Wetzstein: I would say we're about 90 percent there. So, we've made great progress, and over the last just two, three years, and people are just blown away by what you can do already. But there's still some long way to go to make this photo realistic. So, we're not quite as far as Hollywood is, for example, teeth are hard to do.
[00:29:18] Russ Altman: Wow.
[00:29:19] Gordon Wetzstein: Like when you generate a digital human and you move the camera, sometimes the teeth are, you know, moving a little bit and not super consistent. So, it looks like the person really didn't have [00:29:30] a good dentist, you know?
[00:29:31] Russ Altman: Oh, that's interesting because yeah, so much of the face is malleable and will move over time. That we might tolerate movements of the face, but then if we start seeing teeth chatter around, that's weird, that's creepy.
[00:29:44] Gordon Wetzstein: It's a little bit creepy. Yes., and I would say maybe the hardest thing is really hair. Maybe not my hair cause they're, you know, but if you have a crazy haircut, long hair, individual strands of hair. There's so much detail in that, and even, [00:30:00] you know, this may be so much smaller than even a single pixel, the information captured by a single pixel, so getting that right is really, really hard, I would say.
[00:30:08] Russ Altman: Do you use physical models of the hair? Because I know, you know, curly hair versus straight hair, there, uh, you know, there is a, I'm a biologist, there is a molecular basis for the curly hair, and we know what it is. And so I'm wondering if that's an opportunity to bring in more physics and say, okay, once you tell me that they have a tight curl, I'll be able to, uh, extrapolate what their hair might look like much more [00:30:30] accurately than if you tell me they have wavy hair or straight hair or whatever.
[00:30:33] Gordon Wetzstein: Yeah, that's a great point. I mean, physics could certainly be helpful there. We haven't gone down to that level quite yet of the physical models. I mean, to some extent you want to apply physics in the way that it's efficient and it can help you model things, but you can't model everything really well. So, modeling hair is actually really difficult. So, this is where you want to rely on the data driven aspect and just look at lots of people's hair to get....
[00:30:56] Russ Altman: Right.
[00:30:56] Gordon Wetzstein: .... a sense for how does that look like? You don't want to explicitly constrain it too much. [00:31:00] Uh, you mainly want to rely on the data. So, excuse me. So that's maybe a part where the data aspect is more powerful or makes things easier. Yeah.
[00:31:12] Russ Altman: Uh, that's fantastic. Now, um, you mentioned before that we're increasingly getting cameras that are sensing depth. And so now I'm thinking that those pictures so and then you also said that you're using these single portraits, but I'm assuming that the portraits that you're using for the 3D faces don't have [00:31:30] that depth information. But I'm imagining that if you did have portraits with the depth information, your ability to do the 3D modeling would get both more accurate and a little bit easier. Is that true?
[00:31:39] Gordon Wetzstein: That's absolutely true. I think there It's a matter of scale. How much data do you have? The more information and the better the information is that you have, the less data you need to really train a good algorithm. If you have less structured data or less information in each image, like a lack of depth, for example, you just need a lot more data to [00:32:00] learn these aspects.
[00:32:02] Russ Altman: So, in the last couple of minutes, I do want to ask you about presenting images to the human eye because I know you've done work there. I mean, it's quite amazing. And I did look at a lot of your papers over the last few years, and there's a definitely a thread about creating displays that like either get projected onto the eye or part of like a virtual reality goggles.
[00:32:23] Maybe in the last minute or two, tell me what some of the challenges are? Um, I have never worn VR goggles that don't make [00:32:30] me sick almost immediately. And so how are we doing at figuring out how to get those images onto the retina and not have people get sick and have them feel like they're really immersed?
[00:32:41] Gordon Wetzstein: Oh, wow, great question, Russ. So, our group works quite a bit on VR and AR, so virtual and augmented reality systems design is really a systems level challenge. You want to basically put a supercomputer in a form factor, uh, of your glasses, your eyeglasses and, you know, run that all day. I mean, it's a crazy idea, [00:33:00] it's the next frontier of computing really. And there, I don't have nowhere to begin on mentioning challenges, but the industry has come quite far. I mean, it's incredible, uh, the level of fidelity that we can get right now. I mean, some of the challenges that we've been looking at is how to design displays that, you know, give you what we call perceptual realism. It's almost like a touring test where, you know, can you show a digital image to a person and a physical object next to it....
[00:33:26] Russ Altman: Right.
[00:33:26] Gordon Wetzstein: .... that looks, if we want, the goal is [00:33:30] to be able to show this digital image to a person and not being able to tell which one is real and which one is digital, right? So that's what we call perceptual realism. And in order to get there, we need a lot of pixels. We need so many pixels to, you know, cover the entire visual field of human vision, the color fidelity, the contrast that we can see, uh, the depth aspects, uh, and so there's a lot of different nuanced, uh, aspects to that, and we've been working on a few of these challenges, [00:34:00] uh but you also want it to be visually comfortable. So, you mentioned that, you know, people get sick and that's true, there's motion sickness or VR sickness.
[00:34:08] Russ Altman: But I saw that you have at least one paper where the word sickness prediction or the phrase occurs, and I thought that was great.
[00:34:14] Gordon Wetzstein: Yeah, so that often stems from latency, for example, if you move your head, but the image only changes a second later, there's this noticeable latency. You move your head and the image sort of like follows your gaze. That's what VR used to be like 10 years ago, 15 years ago when I started working on [00:34:30] this. I think industry players like Meta, Apple, and others have done an amazing job and, you know, largely eliminating this latency issue.
[00:34:38] But there's a few other aspects, like we talked about depth for cameras. The human visual system uses depth too. Our eyes have a built-in auto focus mechanism that sort of, we call this accommodation, like the eyes accommodate at different distances. Uh, that's how we see the real world that's how we evolved as children. This is like how the, something looks natural to us. In [00:35:00] VR we don't have that, there's basically, it's basically a magnifying lens in front of a small screen. And you're looking at this magnified image it's floating at a fixed distance in front of you. And your eyes sort of are accommodated always at this fixed distance. That sometimes leads to eye strain and, you know, visual discomfort. So, we've been working on all sorts of display technology that helps users be more comfortable in VR, that creates more natural rendition of these digital objects and something that Just is hopefully going to pass this visual touring test at some stage.
[00:35:36] Russ Altman: Thanks to Gordon Wetzstein. That was the future of computational engineering. You've been listening to The Future of Everything with Russ Altman. If you enjoy the podcast, please follow it. Definitely rate and review it. And also remember that we have an archive with more than 200, pushing 250 old episodes. Go back through those and find out about the future of most things.
[00:35:56] You can connect with me on Twitter, or x @rbaltman. You can find me on threads @rusbaltman. And you can follow Stanford Engineering @stanfordeng.