self.cross_entropy_loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logprobs[:, -1, :], labels=self.states)
Why you use softmax_cross_entropy_with_logits here, the first state is "[10.0, 128.0, 1.0, 1.0]*args.max_layers",so does the labels. The final output of RNN contributes to the action, why you use softmax on the action?