No Huddle Offense

"Individual commitment to a group effort-that is what makes a team work, a company work, a society work, a civilization work."

Q-Learning in python

November 18th, 2017 • Comments Off on Q-Learning in python

There is a nice tutorial that explains how Q-Learning works here. The following python code implements the basic principals of Q-Learning:

Let’s assume we have a state matrix defining how we can transition between states, and a goal state (5):

GOAL_STATE = 5
# rows are states, columns are actions
STATE_MATRIX = np.array([[np.nan, np.nan, np.nan, np.nan, 0., np.nan],
                         [np.nan, np.nan, np.nan, 0., np.nan, 100.],
                         [np.nan, np.nan, np.nan, 0., np.nan, np.nan],
                         [np.nan, 0., 0., np.nan, 0., np.nan],
                         [0., np.nan, np.nan, 0, np.nan, 100.],
                         [np.nan, 0., np.nan, np.nan, 0, 100.]])
Q_MATRIX = np.zeros(STATE_MATRIX.shape)

Visually this can be represented as follows:

(Click to enlarge)

For example if you are in state 0, we can go to state 4, define by the 0. . If we are in state 4, we can directly goto to state 5 define by the 100. . np.nan define impossible transitions. Finally we initialize an empty Q-Matrix.

Now the Q-Learning algorithm is simple. The comments in the following code segment will guide through the steps:

i = 0
while i < MAX_EPISODES:
    # pick a random state
    state = random.randint(0, 5)
    while state != goal_state:
        # find possible actions for this state.
        candidate_actions = _find_next(STATE_MATRIX[state])

        # randomly pick one action.
        action = random.choice(candidate_actions)

        # determine what the next states could be for this action...
        next_actions = _find_next(STATE_MATRIX[action])
        values = []
        for item in next_actions:
            values.append(Q_MATRIX[action][item])

        # add some exploration randomness...
        if random.random() < EPSILON:
            # so we do not always select the best...
            max_val = random.choice(values)
        else:
            max_val = max(values)

        # calc the Q value matrix...
        Q_MATRIX[state][action] = STATE_MATRIX[state][action] + \
                                  EPSILON * max_val

        # next step.
        state = action
    i += 1

We need one little helper routine for this – it will help in determine the next possible step I can do:

def _find_next(items):
    res = []
    i = 0
    for item in items:
        if item >= 0:
            res.append(i)
        i += 1
    return res

Finally we can output the results:

Q_MATRIX = Q_MATRIX / Q_MATRIX.max()

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print Q_MATRIX

This will output the following Q-Matrix:

[[ 0.000  0.000  0.000  0.000  0.800  0.000]
 [ 0.000  0.000  0.000  0.168  0.000  1.000]
 [ 0.000  0.000  0.000  0.107  0.000  0.000]
 [ 0.000  0.800  0.134  0.000  0.134  0.000]
 [ 0.044  0.000  0.000  0.107  0.000  1.000]
 [ 0.000  0.000  0.000  0.000  0.000  0.000]]

This details for example the best path to get from e.g. state 2 to state 5 is: 2 -> 3 (0.107), 3 -> 1 (0.8), 1 -> 5 (1.0).

Comments are closed.