Q-learning

  1. Using your favorite search engine, find a simple tutorial on Q-learning. Read the tutorial.


  2. Implement Q-learning. Here is some pseudocode for one iteration of Q-learning. Your program will call this in a loop until it converges:
    // Pick an action
    Let Q be a table of q-values;
    Let i be the current state;
    Let ε = 0.05;
    if(rand.nextDouble() < ε)
    {
    	// Explore (pick a random action)
    	action = rand.nextInt(4);
    }
    else
    {
    	// Exploit (pick the best action)
    	action = 0;
    	for(int candidate = 0; candidate < 4; candidate++)
    		if(Q(i, candidate) > Q(i, action))
    			action = candidate;
    	if(Q(i, action) == 0.0)
    		action = rand.nextInt(4);
    }
    
    // Do the action
    do_action(action);
    Let j be the new state
    
    // Learn from that experience
    Apply the equation below to update the Q-table.
    	a = action.
    	Q(i,a) refers to the Q-table entry for doing
    		action "a" in state "i".
    	Q(j,b) refers to the Q-table entry for doing
    		action "b" in state "j".
    	Use 0.1 for αk. (Don't get "αk" mixed up with "a".)
    	use 0.97 for γ (gamma).
    	A(j) is the set of four possible actions, {<,>,^,v}.
    	r(i,a,j) is the reward you obtained when you landed in state j.
    
    // Reset
    If j is the goal state, teleport to the start state.
    
    Here is the equation that you will implement at the core of your Q-learner:

    .


  3. Implement a simple grid-world. Use dimensions of 20 wide and 10 high. There are 4 possible actions: {left, right, up, down}. Start in one corner. Place a reward in the opposite corner. Place a wall of penalty states with a small hole through which the agent can pass somewhere between the start and goal states. Use some probability that actions will cause unintended behavior. Use Q-learning to learn a policy for solving this problem. (Your Q-table will require 20*10*4=800 elements. Initialize them with zeros, then iteratively apply Q-learning until your Q-table converges to represent a good policy for this game.)


  4. Print your policy at periodic intervals to the console while it learns. Use "#" to indicate a penalty state, "G" to indicate a goal, and the characters "<,>,^,v" to indicate left, right, up, and down for the policy in all other states. Example:
    	> > v v v v v v < < # > > > > > > > > G
    	> v > v v v v v v < # > > > > > > > ^ ^
    	> > > > > v v v v < # > > > > > ^ ^ ^ ^
    	> > > > > v v v v < # > > > > > ^ ^ ^ ^
    	> > > > > > v v v < # > > > > ^ ^ ^ ^ ^
    	> > > > > > > > v v # > > > ^ ^ ^ ^ ^ ^
    	> > > > > > > > > > > > > ^ ^ ^ ^ ^ ^ ^
    	> > > > > > > > > > > > > ^ ^ ^ ^ ^ ^ ^
    	> > > > > > > > ^ ^ # > > > ^ ^ ^ ^ ^ ^
    	> > > > > > > > ^ ^ # > > > ^ ^ ^ ^ ^ ^
    	> > > > > > > > ^ ^ # > > > ^ ^ ^ ^ ^ ^
    	> > > > > > > ^ ^ < # > > > ^ ^ ^ ^ ^ ^
    	> > > > > ^ ^ ^ ^ < # > > > ^ ^ ^ ^ ^ ^
    	> > > > ^ ^ ^ ^ ^ < # > > > ^ ^ ^ ^ ^ ^
    	S > > ^ ^ ^ ^ ^ ^ < # > > > ^ ^ ^ ^ ^ ^
    


  5. If you are taking the graduate scetion of this class, if your program is executed with an optional command-line parameter, "nn", use a neural network to implement your Q-table instead of some other data structure.

    You can use the neural network from previous assignments. Some fiddling may be necessary to find a suitable number of layers, and other meta-parameters to get it working. The neural network may require many more iterations for convergence to occur.

    I recommend using two inputs {x,y} and 4 outputs {q-left,q-right,q-up,q-down}. Here is an example that makes a neural network with the topology 2->16->4:
    Random rand = new Random(1234567);
    NeuralNet nn = new NeuralNet();
    nn.layers.add(new LayerTanh(2, 16));
    nn.layers.add(new LayerTanh(16, 4));
    nn.init(rand);
    
    To refine your neural network, call "trainIncremental(in, out, 0.03)", where "in" is your input vector, "out" is your output vector, and 0.03 is a reasonable learning rate. Remember to train it using values that have been normalized to fall approximately in the range [0,1]. To make a prediction with your neural network, call "forwardProp(in)", and it will return the vector of 4 Q-values for the state you specify.

    If you only want to refine one of the 4 q-values, then use the predicted values for the other 3 q-values when you call "trainIncremental". That way, it will only attempt to change the one q-value for which it is making different predictions. So, let

    Then, instead of doing Q(i,a)=(1-α)*Q(i,a)+α*t, do
    double[] prediction = nn.predict(i);
    double[] copy = Vec.copy(prediction);
    copy[a] = t;
    nn.trainIncremental(i, copy); // This line updates the neural net
    


  6. Click on the "Project submission" link on the main class page to submit your archive file. If you have any difficulties with the submission server, please e-mail your submission directly to the grader.