GClasses
GClasses::GIncrementalLearnerQAgent Class Reference

Detailed Description

This is an implementation of GQLearner that uses an incremental learner for its Q-table and a SoftMax (usually pick the best action, but sometimes randomly pick the action) strategy to balance between exploration vs exploitation. To use this class, you need to supply an incremental learner (see the comment for the constructor for more details) and to implement the GetRewardForLastAction method.

#include <GReinforcement.h>

Inheritance diagram for GClasses::GIncrementalLearnerQAgent:
GClasses::GQLearner GClasses::GPolicyLearner

Public Member Functions

 GIncrementalLearnerQAgent (const GRelation &obsControlRelation, GIncrementalLearner *pQTable, int actionDims, double *pInitialState, GRand *pRand, GAgentActionIterator *pActionIterator, double softMaxThresh)
 pQTable must be an incremental learner. If the relation for pQTable has n attributes, then the first (n-1) attributes refer to the sense (state) and action, and the last attribute refers to the Q-value (the current estimate of the utility of performing that action in that state). For actionDims, see the comment for GPolicyLearner::GPolicyLearner. pInitialState is the initial sense vector. If softMaxThresh is 0, it always picks a random action. If softMaxThresh is 1, it always picks the best action. For values in between, it does something in between. More...
 
virtual ~GIncrementalLearnerQAgent ()
 
virtual double getQValue (const double *pState, const double *pAction)
 See the comment for GQLearner::GetQValue. More...
 
virtual void setQValue (const double *pState, const double *pAction, double qValue)
 See the comment for GQLearner::SetQValue. More...
 
- Public Member Functions inherited from GClasses::GQLearner
 GQLearner (const GRelation &relation, int actionDims, double *pInitialState, GRand *pRand, GAgentActionIterator *pActionIterator)
 
virtual ~GQLearner ()
 
virtual void refinePolicyAndChooseNextAction (const double *pSenses, double *pOutActions)
 See GPolicyLearner::refinePolicyAndChooseNextAction. More...
 
void setActionCap (int n)
 This specifies a cap on how many actions to sample. (If actions are continuous, you obviously don't want to try them all.) More...
 
void setDiscountFactor (double d)
 Sets the factor for discounting future rewards (often called "gamma"). More...
 
void setLearningRate (double d)
 Sets the learning rate (often called "alpha"). If state is deterministic and actions have deterministic consequences, then this should be 1. If there is any non-determinism, there are three common approaches for picking the learning rate: 1- use a fairly small value (perhaps 0.1), 2- decay it over time (by calling this method before every iteration), 3- remember how many times 'n' each state has already been visited, and set the learning rate to 1/(n+1) before each iteration. The third technique is the best, but is awkward with continuous state spaces. More...
 
- Public Member Functions inherited from GClasses::GPolicyLearner
 GPolicyLearner (const GRelation &relation, int actionDims)
 actionDims specifies how many dimensions are in the action vector. (For example, if your agent has a discrete set of ten possible actions, then actionDims should be 1, because it only takes one value to represent one of ten discrete actions. If your agent has four legs that can move independently to continuous locations relative to the position of the agent in 3D space, then actionDims should be 12, because it takes three values to represent the force or offset vector for each leg.) The number of sense dimensions is the number of attributes in pRelation minus actionDims. pRelation specifies the type of each element in the sense and action vectors (whether they are continuous or discrete). The first attributes refer to the senses, and the last actionDims attributes refer to the actions. More...
 
 GPolicyLearner (GDomNode *pNode)
 
virtual ~GPolicyLearner ()
 
void onTeleport ()
 If an external force changes the state of pSenses, you should call this method to inform the agent that the change is not a consequence of its most recent action. The agent should refrain from "learning" when refinePolicyAndChooseNextAction is next called. More...
 
void setExplore (bool b)
 If b is false, then the agent will only exploit (and no learning will occur). If b is true, (which is the default) then the agent will seek some balance between exploration and exploitation. More...
 

Protected Member Functions

virtual void chooseAction (const double *pSenses, double *pOutActions)
 This method picks the action during training. This method is called by refinePolicyAndChooseNextAction. (If it makes things easier, the agent may actually perform the action here, but it's a better practise to wait until refinePolicyAndChooseNextAction returns, because that keeps the "thinking" and "acting" stages separated from each other.) One way to pick the next action is to call GetQValue for all possible actions in the current state, and pick the one with the highest Q-value. But if you always pick the best action, you'll never discover things you don't already know about, so you need to find some balance between exploration and exploitation. One way to do this is to usually pick the best action, but sometimes pick a random action. More...
 
- Protected Member Functions inherited from GClasses::GQLearner
virtual double rewardFromLastAction ()=0
 A reward is obtained when the agent performs a particular action in a particular state. (A penalty is a negative reward. A reward of zero is no reward.) This method returns the reward that was obtained when the last action was performed. If you return UNKNOWN_REAL_VALUE, then the q-table will not be updated for that action. More...
 
- Protected Member Functions inherited from GClasses::GPolicyLearner
GDomNodebaseDomNode (GDom *pDoc)
 If a child class has a serialize method, it should use this method to serialize the base-class stuff. More...
 

Protected Attributes

GVec m_buf
 
GIncrementalLearnerm_pQTable
 
double m_softMaxThresh
 
- Protected Attributes inherited from GClasses::GQLearner
int m_actionCap
 
double m_discountFactor
 
double m_learningRate
 
double * m_pAction
 
GAgentActionIteratorm_pActionIterator
 
GRandm_pRand
 
double * m_pSenses
 
- Protected Attributes inherited from GClasses::GPolicyLearner
int m_actionDims
 
bool m_explore
 
GRelationm_pRelation
 
int m_senseDims
 
bool m_teleported
 

Constructor & Destructor Documentation

GClasses::GIncrementalLearnerQAgent::GIncrementalLearnerQAgent ( const GRelation obsControlRelation,
GIncrementalLearner pQTable,
int  actionDims,
double *  pInitialState,
GRand pRand,
GAgentActionIterator pActionIterator,
double  softMaxThresh 
)

pQTable must be an incremental learner. If the relation for pQTable has n attributes, then the first (n-1) attributes refer to the sense (state) and action, and the last attribute refers to the Q-value (the current estimate of the utility of performing that action in that state). For actionDims, see the comment for GPolicyLearner::GPolicyLearner. pInitialState is the initial sense vector. If softMaxThresh is 0, it always picks a random action. If softMaxThresh is 1, it always picks the best action. For values in between, it does something in between.

virtual GClasses::GIncrementalLearnerQAgent::~GIncrementalLearnerQAgent ( )
virtual

Member Function Documentation

virtual void GClasses::GIncrementalLearnerQAgent::chooseAction ( const double *  pSenses,
double *  pOutActions 
)
protectedvirtual

This method picks the action during training. This method is called by refinePolicyAndChooseNextAction. (If it makes things easier, the agent may actually perform the action here, but it's a better practise to wait until refinePolicyAndChooseNextAction returns, because that keeps the "thinking" and "acting" stages separated from each other.) One way to pick the next action is to call GetQValue for all possible actions in the current state, and pick the one with the highest Q-value. But if you always pick the best action, you'll never discover things you don't already know about, so you need to find some balance between exploration and exploitation. One way to do this is to usually pick the best action, but sometimes pick a random action.

Implements GClasses::GQLearner.

virtual double GClasses::GIncrementalLearnerQAgent::getQValue ( const double *  pState,
const double *  pAction 
)
virtual

See the comment for GQLearner::GetQValue.

Implements GClasses::GQLearner.

virtual void GClasses::GIncrementalLearnerQAgent::setQValue ( const double *  pState,
const double *  pAction,
double  qValue 
)
virtual

See the comment for GQLearner::SetQValue.

Implements GClasses::GQLearner.

Member Data Documentation

GVec GClasses::GIncrementalLearnerQAgent::m_buf
protected
GIncrementalLearner* GClasses::GIncrementalLearnerQAgent::m_pQTable
protected
double GClasses::GIncrementalLearnerQAgent::m_softMaxThresh
protected