encoder/decoder/discriminator composition
Unless there is some hidden reasoning, the current composition of the three components seems rather random. In particular:
- encoder: narrow layer of 256 neurons followed by wider one (256)
- decoder: even wider layer first (1024) followed by two narrower (512)
- (consequently): asymmetric encoder vs. decoder, which should not happen
- discriminator: given its very narrow input, it's pointless to make it so huge (512-256-256)
- seemingly random mix of activation functions
- specifically, is there a reason for using sigmoid? It's well known to slow down/block convergence in many cases
For all these we either need strong reasons, or replace them with something more regular if we just don't know.
Moreover, IMHO the sizes of layers should not be fixed but they should reflect somehow the size of input
Edited by Aleš Křenek