Well, you could use an EA to take a stab at finding better minima :)
And correct me if I'm wrong, but isn't the cost function for a feed forward neural networks that uses a sigmoid activation function convex wrt the parameters being trained, i.e. gradient descent is guaranteed to find the global minimum when small enough of a step size is used?
Mostly, no. Hidden units introduce non-convexity to the cost. How bout a simple counter-example?
Take a simple classifier network with one input, one hidden unit and one output and no biases. To make things even simpler, tie the two weights, i.e. make the first weight equal to the second. Now, mathematically the output of the network can be written: z=f(w * f(w * x)) where f() is the sigmoid.
Next, consider a dataset with two items: [(x_1, y_1), (x_2, y_2)] where x_i is the input and y_i is the class label, 0 or 1. Take as values: [(0.9, 1), (0.1,0)]. The cost function (loglikelihood in this case) is:
And correct me if I'm wrong, but isn't the cost function for a feed forward neural networks that uses a sigmoid activation function convex wrt the parameters being trained, i.e. gradient descent is guaranteed to find the global minimum when small enough of a step size is used?