# No Huddle Offense

"Individual commitment to a group effort-that is what makes a team work, a company work, a society work, a civilization work."

## Q-Learning in python

November 18th, 2017 • Comments Off on Q-Learning in python

There is a nice tutorial that explains how Q-Learning works here. The following python code implements the basic principals of Q-Learning:

Let’s assume we have a state matrix defining how we can transition between states, and a goal state (5):

```GOAL_STATE = 5
# rows are states, columns are actions
STATE_MATRIX = np.array([[np.nan, np.nan, np.nan, np.nan, 0., np.nan],
[np.nan, np.nan, np.nan, 0., np.nan, 100.],
[np.nan, np.nan, np.nan, 0., np.nan, np.nan],
[np.nan, 0., 0., np.nan, 0., np.nan],
[0., np.nan, np.nan, 0, np.nan, 100.],
[np.nan, 0., np.nan, np.nan, 0, 100.]])
Q_MATRIX = np.zeros(STATE_MATRIX.shape)
```

Visually this can be represented as follows:

For example if you are in state 0, we can go to state 4, define by the 0. . If we are in state 4, we can directly goto to state 5 define by the 100. . np.nan define impossible transitions. Finally we initialize an empty Q-Matrix.

Now the Q-Learning algorithm is simple. The comments in the following code segment will guide through the steps:

```i = 0
while i < MAX_EPISODES:
# pick a random state
state = random.randint(0, 5)
while state != goal_state:
# find possible actions for this state.
candidate_actions = _find_next(STATE_MATRIX[state])

# randomly pick one action.
action = random.choice(candidate_actions)

# determine what the next states could be for this action...
next_actions = _find_next(STATE_MATRIX[action])
values = []
for item in next_actions:
values.append(Q_MATRIX[action][item])

if random.random() < EPSILON:
# so we do not always select the best...
max_val = random.choice(values)
else:
max_val = max(values)

# calc the Q value matrix...
Q_MATRIX[state][action] = STATE_MATRIX[state][action] + \
EPSILON * max_val

# next step.
state = action
i += 1
```

We need one little helper routine for this – it will help in determine the next possible step I can do:

```def _find_next(items):
res = []
i = 0
for item in items:
if item >= 0:
res.append(i)
i += 1
return res
```

Finally we can output the results:

```Q_MATRIX = Q_MATRIX / Q_MATRIX.max()

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print Q_MATRIX
```

This will output the following Q-Matrix:

```[[ 0.000  0.000  0.000  0.000  0.800  0.000]
[ 0.000  0.000  0.000  0.168  0.000  1.000]
[ 0.000  0.000  0.000  0.107  0.000  0.000]
[ 0.000  0.800  0.134  0.000  0.134  0.000]
[ 0.044  0.000  0.000  0.107  0.000  1.000]
[ 0.000  0.000  0.000  0.000  0.000  0.000]]
```

This details for example the best path to get from e.g. state 2 to state 5 is: 2 -> 3 (0.107), 3 -> 1 (0.8), 1 -> 5 (1.0).