Q-Learning in python
November 18th, 2017 • Comments Off on Q-Learning in pythonThere is a nice tutorial that explains how Q-Learning works here. The following python code implements the basic principals of Q-Learning:
Let’s assume we have a state matrix defining how we can transition between states, and a goal state (5):
GOAL_STATE = 5
# rows are states, columns are actions
STATE_MATRIX = np.array([[np.nan, np.nan, np.nan, np.nan, 0., np.nan],
[np.nan, np.nan, np.nan, 0., np.nan, 100.],
[np.nan, np.nan, np.nan, 0., np.nan, np.nan],
[np.nan, 0., 0., np.nan, 0., np.nan],
[0., np.nan, np.nan, 0, np.nan, 100.],
[np.nan, 0., np.nan, np.nan, 0, 100.]])
Q_MATRIX = np.zeros(STATE_MATRIX.shape)
Visually this can be represented as follows:
For example if you are in state 0, we can go to state 4, define by the 0. . If we are in state 4, we can directly goto to state 5 define by the 100. . np.nan define impossible transitions. Finally we initialize an empty Q-Matrix.
Now the Q-Learning algorithm is simple. The comments in the following code segment will guide through the steps:
i = 0
while i < MAX_EPISODES:
# pick a random state
state = random.randint(0, 5)
while state != goal_state:
# find possible actions for this state.
candidate_actions = _find_next(STATE_MATRIX[state])
# randomly pick one action.
action = random.choice(candidate_actions)
# determine what the next states could be for this action...
next_actions = _find_next(STATE_MATRIX[action])
values = []
for item in next_actions:
values.append(Q_MATRIX[action][item])
# add some exploration randomness...
if random.random() < EPSILON:
# so we do not always select the best...
max_val = random.choice(values)
else:
max_val = max(values)
# calc the Q value matrix...
Q_MATRIX[state][action] = STATE_MATRIX[state][action] + \
EPSILON * max_val
# next step.
state = action
i += 1
We need one little helper routine for this – it will help in determine the next possible step I can do:
def _find_next(items):
res = []
i = 0
for item in items:
if item >= 0:
res.append(i)
i += 1
return res
Finally we can output the results:
Q_MATRIX = Q_MATRIX / Q_MATRIX.max()
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print Q_MATRIX
This will output the following Q-Matrix:
[[ 0.000 0.000 0.000 0.000 0.800 0.000] [ 0.000 0.000 0.000 0.168 0.000 1.000] [ 0.000 0.000 0.000 0.107 0.000 0.000] [ 0.000 0.800 0.134 0.000 0.134 0.000] [ 0.044 0.000 0.000 0.107 0.000 1.000] [ 0.000 0.000 0.000 0.000 0.000 0.000]]
This details for example the best path to get from e.g. state 2 to state 5 is: 2 -> 3 (0.107), 3 -> 1 (0.8), 1 -> 5 (1.0).
