I will admit to not following AI at all for about 20 years, so perhaps this is old hat now, but having separate policy networks and value networks is quite ingenious. I wonder how successful this would be at natural language generation. It reminds me of Krashen's theories of language acquisition where there is a "monitor" that gives you fuzzy matches on whether your sentences are correct or not. One of these days I'll have to read their paper.
For language generation, AFAIK there is no good model that follows this architecture. For image generation, Generative Adversarial Networks are strong contenders. See for instance: