I'm not sure how either of those situations would trip up the system they're using. For a system trained on the background image, what difference does it make if the subject is holding a cup? The cup is not the background image, and it would be obvious in the same way that it's obvious the subject isn't the background image.
He might be referring to transparent/translucent/refractive objects, like a glass cup. Supposedly this technique can manage the transparency, but not refraction (and maybe the refraction could trip the transparency into failing).
I'm not associated with the paper, but I don't think this will have the same kinds of effects. It's effectively using a photo without the user to discriminate the background from any subjects in the scene.
The neural network used seems mostly for allowing for variations in lighting and dealing with the fuzzy effects you'll usually get around the subject. Depth of the subject doesn't appear to be relevant here.
(wouldn't it be nice if every time a research topic pops up here that there would be a small list of the essential keywords to look for more background information?)
XSplit's VCam can do background removal without chroma-key. It's reasonably good, and emits a virtual webcam that VC clients or OBS can use as an input. Think it's 40 USD for a lifetime license. Has a bit of ghosting when you move quickly.
The Zoom guesswork-based imitation is pretty good, and it seems to be optimized for a single person's movement. It gets confused when there are more actors like a child or dog entering from stage left.
Very cool. Next step would be to emulate lighting in the target scene, but that probably requires pose detection and facial landmarks for accurate shading.
Very good idea. And change in real-time if the background is dynamic. And allow the user to set styles such as warming the FG subject, shimmering as if there is a fire or candlelight in the room, etc.
Anybody know of any work done to improve greenscreen keying? The current old school techniques work quite poorly and require so much manual work. I would imagine with the new work coming out with neural nets etc. there would be possibilities for improvement. This is very cool work and good for certain applications, but seems to produce similar problems like greenscreen to some edges.
Modern green screen plugins/filters are miles better than what they used to be to the point that if the keying is hard, then the image must not have been produced well. By that, I mean an evenly list background (no light fall off producing gradients). Proper lighting of the subject. Proper distance from the background (helps reduce the edges and color tinting).
I work quite a bit with green screen keying. I see the same keylights, ultimattes and primattes used still even in big productions I have worked. Fixing the key can take weeks. Maybe industry is bit conservative and I haven't seen cool new stuff bubbling under, but would love to have new tools in the toolset to approach difficult shots.
If you have a background picture, you have all the info you need to identify your subject - just plain subtraction. I think this is what the Photo Booth app on my circa-2012 MacBook does, quite effectively.
This is a question we’ve gotten quite a bit (second author here).
A good intuition is that if it were easy to do it already with any background, professional studios wouldn’t be spending so much money on green screens. Background subtraction is pretty poor in general without very constrained setups. Our goal is really to provide professional quality without any of the equipment.
And can your solution do what studios want, namely process a 4K-video artifact free, when played back on a cinema screen? I doesn't look like that tbh if I watch the second video ("Ours real" is your work?).
And yeah, it requires constrained setup and a lot of additional work, because even before you "subtract" the background you have to think about lighting (your demo video might have very nice background matting, but the lighting is off, so it's relatively useless except for toy applications (which there are a lot).
Also: did you compare somewhere withe the very basic fixed-exposure method? Beause for fixed exposure, background and camera placement I suppose this should work just as well... Still I think this is a really cool project, I didn't get disappointed like with the last link of that sort, where someone tried the same with horrible artifacting.
Green screens are crap with hair, because its translucent the green/blue bleeds through which means that it has to be cleaned up by hand.
Then there are the situations where there isn't a green screen. Again manual cleanup is required. Each frame needs to be cut out by hand. 24 times a second.
The same with a difference matte. Cameras are noisy, so there is constant noise in the alpha channel. This makes the effect look wobbly and cheap.
What this method does is pull a key from a difference matte, and makes it look good.
The project page has a video comparison against previous state of the art. You can't just subtract the background if it's not 100% static and stable. Further, the novelty seems to be less artifacts, especially around hair and eyeglasses.
> If you have a background picture, you have all the info you need to identify your subject - just plain subtraction
It's not really "just plain subtraction", it's keying. Which AIUI basically means setting the alpha according to the difference between the image and the reference.
Green screen works well for this because, excepting Zoe Soldana, people tend to hang out around the opposite side of the colour wheel, so there tends to be a good distance between foreground colour and background colour. If you're trying to do this against arbitrary backgrounds, you seemingly need to augment keying with additional techniques like image segmentation to get good results.
This new method works well for partially transparent regions (hair) and allows slightly larger background movement and color overlap between foreground and background.
I think they baited you with their "loook, movement" replacement videos. As far as I can tell, their inputs have a fixed background and are of constant exposure and camera position.
No, the camera is indeed allowed to change a tiny bit. For example, you do not need a tripod. Taking photos with a handheld camera works fine (although a tripod works even better). They explain it in greater detail in their paper: https://arxiv.org/pdf/2004.00626.pdf
Background subtraction methods on the other hand usually fail if the camera moves even a tiny bit or the lighting changes slightly. More advanced methods can recover eventually, but you still get a few frames with improperly removed background.
In the first example (the one with the girl), you can see that there are small camera movements. You can also see the effect this have when applying straightforward background subtraction in the second video.
https://arxiv.org/pdf/2004.00626.pdf, which is inlined at the originally submitted URL. I'm not sure what's going on here, but on HN the convention is probably to link to the project home page first, and after that maybe the Github page and if neither of those exist, to the arxiv.org homepage (but not the pdf since those change with each revision). So I've changed to the project home page for now.
Hi, I'm one of the folks building CatalyzeX (https://www.catalyzex.com). It's intended primarily as a free resource for machine learning practitioners (research engineers, developers, students, and generally anyone interested in R&D) to discover interesting ML projects and papers, easily access the code and datasets, and communicate with the authors or other experts.
The link share here was likely with keeping the relevance of this project to HN in mind, and that easy access to the code and authors would be valuable for anyone here looking to take it further.
Thanks for clarifying the convention here on HN, being transparent, and for updating accordingly. Much appreciated.
Always open to feedback if you have any as well! :)
I wasn't suggesting anything, but having just looked at the submission history it seems clear that it's promotional. The HN community doesn't favor that. It's fine to submit your own site or work occasionally, but not to use HN primarily for promotion.
Also, the submitted title ('Zoom’s virtual background swap but better. DL+GANs for background replacement') was too promotey.
Peripheral: What is the benefit of halving these artificial backgrounds? Apart from „it’s fun“, which wears off after about a minute? In my experience (Zoom Meetings), there’s blurring/artefacts around the edge of the head and the image quality seems to suffer as well.
I had a meeting where one participant uses an actual green screen, and the difference was remarkable, with none of the issues above.
There are two parts to background matting. The first is removing the existing background and the second is replacing it with something else. Removing the background improves the focus on the foreground - people watching can see you better and they'll listen more closely because they're not distracted by what's behind you. The second part, replacing the background with something else, might be done because you don't want people to see where you are, or because you want to overlay your foreground video on a presentation. Being able to pretend you're on a holodeck or a desert island is a trivial use of the tech.
Some people feel the need to hide their shitty apartment.
I've seen someone advised that the background on their webcam makes them "look poor", where the concern was that looking poor is a (perverse) impediment to getting paid work, but they can't exactly move, especially under lockdown. It may be better to use a calm artificial background in that case.
See also people doing online-conference presentations and Youtube videos. I've seen quite a few of those are using virtual backgrounds.
Perhaps for the same reason - thousands of people may see the video, and some people, having made the effort to put on a nice suit/makeup/etc, get a haircut, and look their best, don't want thousands of people to see what their not so nice home looks like behind it.
- people who are using occupying a wider z-axis (for example leaning forwards in the camera or who have arms in front of them)
- people holding objects like cups
How well do your method handle those kind of situations?