I look forward to reading this in closer detail, but it looks like they solve an inverse problem to recover a ground truth set of voxels (from a large set of 2d images with known camera parameters), which is underconstrained. Neat to me that it works w/o using dense optical flow to recover the structure -- I wouldn't have thought that would converge.
Love this a whole heck of a lot more than NeRF, or any other "lol lets just throw a huge network at it" approach.
Why is this called rendering, when it would be more accurate to call it reverse-rendering (unless "rendering" means any kind of transformation of visual-adjacent data)?
The reverse-rendering is not real-time, but takes several minutes. Only rendering new viewpoints from the resulting sparse voxel representation runs at high enough framerates.
This is basically Gaussian splat using cubes instead of Gaussians. The cube centers and sizes choices are discrete and non overlapping, hence the name “sparse voxel”. The qualitative results and rendering speeds are similar to Gaussian splat, and it’s sometimes better or worse depending on the scene.
Funny, it almost sounds like a straight efficiency improvement of Plenoxels (the direct predecessor of gaussian splatting), which would mean gaussian splatting was something of a a red herring/sidetrack. Though I'm not sure atm where the great performance gain is. Definitely interesting.
They both emerged out of the pursuit of a more efficient solution for addressing the inefficiencies in NeRF, which was mainly due to expensive ray marching and MLP calls. Before the emergence of Gaussian splatting, grids, such as plenoxels were all the rage. Of course, Gaussian splatting here refers to the paper, “3D Gaussian Splatting for Real-Time Radiance Field Rendering”
Can someone ELI5 what the input to these renders is?
I'm familiar with the premise of NeRF "grab a bunch of relatively low resolution images by walking in a circle around a subject/moving through a space", and then rendering novel view points,
but on the landing page here the videos are very impressive (though the volumetric fog in the classical building is entertaining as a corner case!),
but I have no idea what the input is.
I assume if you work in this domain it's understood,
"oh these are all standard comparitive output, source from <thing>, which if you must know are a series of N still images taken... " or "...excerpted image from consumer camera video while moving through the space" and N is understood to be 1, or more likely, 10, or 100...
They are photos, in this case from the MIP Nerf 360 dataset. I believe there are on the order of hundreds per scene. They are not videos turned into photos. Some datasets include high grade position and directional information -- I believe this dataset does not, so you need to do some work to orient the rendering training. But, I'm a hobbyist, so all this could be very wrong.
> We optimize adaptive sparse voxels radiance field from multi-view images…
Pretty sure the input is the same as for NeRFS, GS and photogrammetry: as many high rez photos from as many angles as you have the patience to collect.
I think the example scenes are from a common collection of photos that are being widely used as a common reference point.
I look forward to reading this in closer detail, but it looks like they solve an inverse problem to recover a ground truth set of voxels (from a large set of 2d images with known camera parameters), which is underconstrained. Neat to me that it works w/o using dense optical flow to recover the structure -- I wouldn't have thought that would converge.
Love this a whole heck of a lot more than NeRF, or any other "lol lets just throw a huge network at it" approach.
>Love this a whole heck of a lot more than NeRF, or any other "lol lets just throw a huge network at it" approach.
Well yes, but that's what gaussian splatting also was. The question is: are their claims to be so much better than gsplat accurate?
There's no neural net with gaussian splatting, it's a fancy pointcloud that's optimized with ML techniques.
I know, that's the point.
Mea culpa, I misunderstood.
Why is this called rendering, when it would be more accurate to call it reverse-rendering (unless "rendering" means any kind of transformation of visual-adjacent data)?
The reverse-rendering is not real-time, but takes several minutes. Only rendering new viewpoints from the resulting sparse voxel representation runs at high enough framerates.
This is basically Gaussian splat using cubes instead of Gaussians. The cube centers and sizes choices are discrete and non overlapping, hence the name “sparse voxel”. The qualitative results and rendering speeds are similar to Gaussian splat, and it’s sometimes better or worse depending on the scene.
Funny, it almost sounds like a straight efficiency improvement of Plenoxels (the direct predecessor of gaussian splatting), which would mean gaussian splatting was something of a a red herring/sidetrack. Though I'm not sure atm where the great performance gain is. Definitely interesting.
How is plenoxels a direct predecessor of gaussian splatting?
They both emerged out of the pursuit of a more efficient solution for addressing the inefficiencies in NeRF, which was mainly due to expensive ray marching and MLP calls. Before the emergence of Gaussian splatting, grids, such as plenoxels were all the rage. Of course, Gaussian splatting here refers to the paper, “3D Gaussian Splatting for Real-Time Radiance Field Rendering”
I think this paper is as important as original Gaussian Splatting paper.
Why do you say so?
What is the usecase for radiance fields?
Take a bunch of photos of an object or scene. Fly around the scene inside a computer.
https://news.ycombinator.com/item?id=43120582
Like photogrammetry. But, handles a much wider range of materials.
Can someone ELI5 what the input to these renders is?
I'm familiar with the premise of NeRF "grab a bunch of relatively low resolution images by walking in a circle around a subject/moving through a space", and then rendering novel view points,
but on the landing page here the videos are very impressive (though the volumetric fog in the classical building is entertaining as a corner case!),
but I have no idea what the input is.
I assume if you work in this domain it's understood,
"oh these are all standard comparitive output, source from <thing>, which if you must know are a series of N still images taken... " or "...excerpted image from consumer camera video while moving through the space" and N is understood to be 1, or more likely, 10, or 100...
...but what I want to know is,
are these video- or still-image input;
and how much/many?
They are photos, in this case from the MIP Nerf 360 dataset. I believe there are on the order of hundreds per scene. They are not videos turned into photos. Some datasets include high grade position and directional information -- I believe this dataset does not, so you need to do some work to orient the rendering training. But, I'm a hobbyist, so all this could be very wrong.
> We optimize adaptive sparse voxels radiance field from multi-view images…
Pretty sure the input is the same as for NeRFS, GS and photogrammetry: as many high rez photos from as many angles as you have the patience to collect.
I think the example scenes are from a common collection of photos that are being widely used as a common reference point.