Q-learning

Using your favorite search engine, find a simple tutorial on Q-learning. Read the tutorial.

Implement Q-learning. Here is some pseudocode for one iteration of Q-learning. Your program will call this in a loop until it converges:

// Pick an action
Let Q be a table of q-values;
Let i be the current state;
Let ε = 0.05;
if(rand.nextDouble() < ε)
{
	// Explore (pick a random action)
	action = rand.nextInt(4);
}
else
{
	// Exploit (pick the best action)
	action = 0;
	for(int candidate = 0; candidate < 4; candidate++)
		if(Q(i, candidate) > Q(i, action))
			action = candidate;
	if(Q(i, action) == 0.0)
		action = rand.nextInt(4);
}

// Do the action
do_action(action);
Let j be the new state

// Learn from that experience
Apply the equation below to update the Q-table.
	a = action.
	Q(i,a) refers to the Q-table entry for doing
		action "a" in state "i".
	Q(j,b) refers to the Q-table entry for doing
		action "b" in state "j".
	Use 0.1 for α^k. (Don't get "α^k" mixed up with "a".)
	use 0.97 for γ (gamma).
	A(j) is the set of four possible actions, {<,>,^,v}.
	r(i,a,j) is the reward you obtained when you landed in state j.

// Reset
If j is the goal state, teleport to the start state.

Here is the equation that you will implement at the core of your Q-learner:

Implement a simple grid-world. Use dimensions of 20 wide and 10 high. There are 4 possible actions: {left, right, up, down}. Start in one corner. Place a reward in the opposite corner. Place a wall of penalty states with a small hole through which the agent can pass somewhere between the start and goal states. Use some probability that actions will cause unintended behavior. Use Q-learning to learn a policy for solving this problem. (Your Q-table will require 20*10*4=800 elements. Initialize them with zeros, then iteratively apply Q-learning until your Q-table converges to represent a good policy for this game.)

Print your policy at periodic intervals to the console while it learns. Use "#" to indicate a penalty state, "G" to indicate a goal, and the characters "<,>,^,v" to indicate left, right, up, and down for the policy in all other states. Example:

	> > v v v v v v < < # > > > > > > > > G
	> v > v v v v v v < # > > > > > > > ^ ^
	> > > > > v v v v < # > > > > > ^ ^ ^ ^
	> > > > > v v v v < # > > > > > ^ ^ ^ ^
	> > > > > > v v v < # > > > > ^ ^ ^ ^ ^
	> > > > > > > > v v # > > > ^ ^ ^ ^ ^ ^
	> > > > > > > > > > > > > ^ ^ ^ ^ ^ ^ ^
	> > > > > > > > > > > > > ^ ^ ^ ^ ^ ^ ^
	> > > > > > > > ^ ^ # > > > ^ ^ ^ ^ ^ ^
	> > > > > > > > ^ ^ # > > > ^ ^ ^ ^ ^ ^
	> > > > > > > > ^ ^ # > > > ^ ^ ^ ^ ^ ^
	> > > > > > > ^ ^ < # > > > ^ ^ ^ ^ ^ ^
	> > > > > ^ ^ ^ ^ < # > > > ^ ^ ^ ^ ^ ^
	> > > > ^ ^ ^ ^ ^ < # > > > ^ ^ ^ ^ ^ ^
	S > > ^ ^ ^ ^ ^ ^ < # > > > ^ ^ ^ ^ ^ ^

If you are taking the graduate scetion of this class, if your program is executed with an optional command-line parameter, "nn", use a neural network to implement your Q-table instead of some other data structure.

You can use the neural network from previous assignments. Some fiddling may be necessary to find a suitable number of layers, and other meta-parameters to get it working. The neural network may require many more iterations for convergence to occur.

I recommend using two inputs {x,y} and 4 outputs {q-left,q-right,q-up,q-down}. Here is an example that makes a neural network with the topology 2->16->4:
```
Random rand = new Random(1234567);
NeuralNet nn = new NeuralNet();
nn.layers.add(new LayerTanh(2, 16));
nn.layers.add(new LayerTanh(16, 4));
nn.init(rand);
```
To refine your neural network, call "trainIncremental(in, out, 0.03)", where "in" is your input vector, "out" is your output vector, and 0.03 is a reasonable learning rate. Remember to train it using values that have been normalized to fall approximately in the range [0,1]. To make a prediction with your neural network, call "forwardProp(in)", and it will return the vector of 4 Q-values for the state you specify.

If you only want to refine one of the 4 q-values, then use the predicted values for the other 3 q-values when you call "trainIncremental". That way, it will only attempt to change the one q-value for which it is making different predictions. So, let

Then, instead of doing Q(i,a)=(1-α)*Q(i,a)+α*t, do
```
double[] prediction = nn.predict(i);
double[] copy = Vec.copy(prediction);
copy[a] = t;
nn.trainIncremental(i, copy); // This line updates the neural net
```

Click on the "Project submission" link on the main class page to submit your archive file. If you have any difficulties with the submission server, please e-mail your submission directly to the grader.