A technology which, as far as I can tell, is somewhat behind visual deepfakes at the moment is voice replication, and I think that the audio aspect of a scene is as important as the video.
Most research I've seen in that area is more along the lines of providing text to a model that then reads it out in a voice it's learned- but it feels to me that it should be significantly easier to use prerecorded audio and shift tones and pitches to match the target voice. This will probably fall short in terms of things like speed and rhythm of a voice, so not sure how convincing it would be but I think it's worth exploring.
I think the idea of having contracted models for deepfakes is the correct one both legally and morally, so it would be good to take some voice samples as well as high quality images from angles optimised for the best deepfake results.