# Reinforcement learning from the ground up | part 1: tic-tac-toe.

posted on 2019-11-29T12:12:07Z · last modified on 2020-06-01T11:36:56Z · view page on GitHubAs a first example to reinforcement learning, we'll make our computer learn by itself how to play tic-tac-toe. As one of the most simple 2 player games, tic-tac-toe is ideal to get started with reinforcement learning, while still being more interesting that learning to play a single player game.

**The AI created in this series can now be challenged here!**

## The Game¶

First we'll need to create an environment to play te game. We'll represent this environment by the `TicTacToe`

class.

A `TicTacToe`

object stores the *2D state* of the game, which is represented as a 2D numpy array with 9 elements, such that the positions of the playing board are labeled as follows. Each of those labels represents a possible *action* an player can take.

```
0 1 2
3 4 5
6 7 8
```

Furthermore, each position on the board can be in 3 differents states. First there is the empty state (a position in the board where nobody has placed a mark yet) denoted by a `0`

. Then there is the `O`

-state: the state where `player1`

has placed a mark denoted by a `1`

. Finally there is the `X`

-state, denoted by a `2`

: the `state`

where `player2`

has placed a mark.

An example state of the board at a certain time in the game can for example look like this:

```
. X O 0 2 1
. O . --> 0 1 0
X O X 2 1 2
```

In addition to these 9 numbers, the current player's *turn* will also be kept track of. The turn index can thus be either `1`

or `2`

, or `0`

when the game is over. The *turn* together with the *2D state* make up the full *state* of the game.

Apart from the state of the game, the `TicTacToe`

object will also keep a reference to the two players playing the game, so it can ask for an action from either player whenever it's the specific `player`

's turn. A `player`

is asked to choose an `action`

(move) by its own `get_action`

method.

After the `action`

is asked from a `player`

, the `action`

is validated and executed by the `play_turn`

method of the `TicTacToe`

instance.
If the action is invalid (for example when choosing a square that is already taken), the player loses. If the `action`

results in a line of three equal marks, the current `player`

wins. This would result in a `+1`

reward if `player1`

wins and a `-1`

reward if `player2`

wins (they have opposite goals, so they receive opposite rewards; more on the reward system later). In all other cases no rewards will be given as no `player`

has won (yet).

Finally, the full `state`

, `action`

and resulting `next_state`

and `reward`

are given to the player's `learn`

method, such that the player can update it's policy according to the reward received, which will make it learn from its mistakes.

Finally, it is useful to have a way to visualize the game each turn. The game `state`

can be visualized by setting the `visualize`

flag to `True`

.

```
class TicTacToe:
""" Tic-Tac-Toe Game """
def __init__(self, player1, player2):
""" The Tic-Tac-Toe game takes two players and pitches them against each other. """
# pitch players against each other
self.players = {1: player1, 2: player2}
# reward for each outcome of the game (tie, player1 wins, player2 wins)
self._reward = {0: 0, 1: 1, 2: -1}
def play(self, visualize=False):
""" play a full game """
turn = 1
state2d = np.zeros((3, 3), dtype=np.int64)
state = (state2d, turn) # full state of the game
for i in range(9):
current_player = self.players[turn]
action = current_player.get_action(state)
next_state, reward = self.play_turn(state, action)
current_player.learn(state, action, next_state, reward)
if visualize:
self.visualize_state(next_state, turn)
(state2d, turn) = state = next_state
if turn == 0:
break
def play_turn(self, state, action):
""" execute a specific move chosen by the current player and
check if it's a winning/losing move. """
# retrieve states
state2d, turn = state
next_state2d = state2d.copy()
next_turn = turn % 2 + 1
# transform action in two indices
ax, ay = action // 3, action % 3
# check if board is already occupied at location
if state2d[ax, ay] != 0: # invalid move
next_state2d.fill(0)
next_state = (next_state2d, 0) # next_turn == 0 -> game over
return next_state, self._reward[next_turn] # next player wins
# apply action
next_state2d[ax, ay] = turn
# check if the action resulted in a winner
mask = next_state2d == turn
if (
(mask[0, 0] and mask[1, 1] and mask[2, 2])
or (mask[0, 2] and mask[1, 1] and mask[2, 0])
or (mask[0, 0] and mask[0, 1] and mask[0, 2])
or (mask[1, 0] and mask[1, 1] and mask[1, 2])
or (mask[2, 0] and mask[2, 1] and mask[2, 2])
or (mask[0, 0] and mask[1, 0] and mask[2, 0])
or (mask[0, 1] and mask[1, 1] and mask[2, 1])
or (mask[0, 2] and mask[1, 2] and mask[2, 2])
):
next_state = (next_state2d, 0) # next_turn == 0 -> game over
return next_state, self._reward[turn] # current player wins
# if the playing board is full, but no winner found: tie
if (next_state2d != 0).all(): # final tie.
next_state = (next_state2d, 0) # next_turn == 0 -> game over
return next_state, self._reward[0] # no winner
# if no move has resulted in a winner: next player's turn.
next_state = (next_state2d, next_turn)
return next_state, self._reward[0] # no winner yet
@staticmethod
def visualize_state(next_state, turn):
""" show the resulting game state after a player's turn """
next_state2d, next_turn = next_state
print(f"player {turn}'s turn:")
if (next_state2d == 0).all() and turn == 0:
print("[invalid state]\n\n")
else:
print(
str(next_state2d)
.replace("[[", "")
.replace(" [", "")
.replace("]]", "")
.replace("]", "")
.replace("0", ".")
.replace("1", "O")
.replace("2", "X")
+ "\n\n"
)
```

## Random Player¶

Let's first define the most stupid player of them all: a random player. The following player will always return a random index corresponding to one of the empty positions in the board:

```
class RandomPlayer:
""" The random player will return a random action from the set of allowed actions"""
def get_action(self, state):
state2d, turn = state
possible_actions = np.where(state2d.ravel() == 0)[0]
action = np.random.choice(possible_actions)
return action
def learn(self, state, action, next_state, reward):
pass # a random player does not learn from its mistakes
```

## Let's play a game!¶

Let's pitch two of those random players against each other!

```
player1 = RandomPlayer()
player2 = RandomPlayer()
game = TicTacToe(player1, player2)
game.play(visualize=True)
```

obviously, this is not very interesting. We need a smarter kind of player:

## A smarter player¶

But how do we define such a smarter `player`

? We could probably just code a perfect AI ourselves by coding out some rules that the `player`

has to take in each specific position of the board. And for tic-tac-toe, this would probably not be very difficult. However, we'd like a more general approach: enter `Q`

learning.

In general, `Q`

learning works af follows: An `Agent`

plays the game according to a certain policy `π`

. This policy `π`

is such that it chooses the `action`

with the maximal `Q`

value associated with it for a specific `state`

of the game. This `Q`

value can be found in a lookup table: the `Q`

table, where a different `Q`

value is associated with each `state`

of the game and the possible `action`

s one can take for this specific `state`

. Mathematically, one can express this as follows:
\begin{align*}
\texttt{action} &= \pi(\texttt{state}) = \text{argmax}_{action} Q(\texttt{state},~\texttt{action})
\end{align*}
Intuitively, each `Q`

value can thus vaguely be linked to the probability of a certain `player`

winning the game after playing a certain `action`

when the game is in a certain `state`

.

However, we will pitch two Agents against each other. If we assume a higher `Q`

value to be associated to a higher chance of `player1`

winning, the policy for `player2`

will have to be the opposite of the policy of `player1`

. `player2`

's policy will thus try to choose the `action`

with the minimal `Q`

value associated to it. This way of pitching two players against each other while one tries to optimize the objective function and the other tries to minimize it is also known as the `minimax`

algorithm.

However, the above does not fully describe how the `Q`

table is formed. We need to define a way to update the `Q`

values after a certain decision has been made. For this, the Bellman equation can be used, which basically describes that the Q-value at the current state of the game should be equal to the reward received at this stage of the game plus a discounted Q-value at the next state of the game.
\begin{align*}
Q(\texttt{state},~\texttt{action}) = \texttt{reward} + \texttt{discount_factor} \cdot Q(\texttt{next-state},~\texttt{next-action})
\end{align*}
Which states that for a *perfect* `Q`

table, the expected `Q`

value of the current `state`

will be equal the the `Q`

value of the next `state`

and next `action`

discounted by a `discount_factor`

plus a possible reward obtained by performing chosen the `action`

. However, we start with an imperfect `Q`

table, which means that at each timestep, there will be an error `δ`

between the left hand side and the right hand side of the Bellman equation:
\begin{align*}
\delta &= Q(\texttt{state},~\texttt{action}) - (\texttt{reward} + \texttt{discount_factor} \cdot Q(\texttt{next-state},~\texttt{next-action}))
\end{align*}
`player1`

has the policy to *maximize* `Q`

, while `player2`

will have the policy to *minimize* `Q`

and vice versa, which yields for the error on the `Q`

values for each of the players:
\begin{align*}
\texttt{player1} &: \delta = Q(\texttt{state},~\texttt{action}) - (\texttt{reward} + \texttt{discount_factor} \cdot \text{min}_{\texttt{action}} Q(\texttt{next-state},~\texttt{action})) \\
\texttt{player2} &: \delta = Q(\texttt{state},~\texttt{action}) - (\texttt{reward} + \texttt{discount_factor} \cdot \text{max}_{\texttt{action}} Q(\texttt{next-state},~\texttt{action}))
\end{align*}
Note that the error on the Q-value for player 1 uses the policy of player 2 [`min`

]. This is to be expected, as this policy appears in the discounted Q-value of the next state, which is determined by the other player.

This error can then be used in conjunction with a `learning_rate`

to update the `Q`

values iteratively at every step of the game.

Let's have a look how this can be implemented:

### Q table¶

As discussed above, the Q table is just a table (numpy array) that returns a Q value for each state of the game and for each action taken. A tic-tac-toe game consists of 9 squares, each of which can be in 3 different states, totalling 3^9 different states (in reality, there are less allowed states possible; 3^9 is an upper bound). We can convert each state to a unique integer to index the qtable. Indexing (calling in this case) the QTable with a state results in a row of the QTable containing 9 qvalues corresponding to the 9 possible actions one can take for each state. Yes, there are 9 possible actions possible at **each** state, even if there are already a few squares taken: the game is programmed in such a way that an invalid move will make you lose, obviously this design choice will make training more difficult but also more general. Note that we only use the *2D state* of the game as key for the Q-table as this is enough to create a unique key to index the underlying table.

```
class QTable:
def __init__(self):
self._qtable = np.random.randn(3 ** 9, 9)
self._state_key_vector = 3 ** np.arange(9)
def __call__(self, state2d):
return self._qtable[int(np.sum(state2d.ravel() * self._state_key_vector))]
def save(self, filename):
if not filename.endswith(".csv"):
filename = filename + ".csv"
np.savetxt(filename, self._qtable, delimiter=",")
def load(self, filename):
if not filename.endswith(".csv"):
filename = filename + ".csv"
self._qtable = np.loadtxt(filename, delimiter=",")
return self
```

### Agent¶

A reinforcement agent plays the game by playing a random action with with probability Ɛ. Alternatively, the Agent will consult its Q-table with probability (1-Ɛ) to play the action with the optimal Q-value associated to it. This is the *explore*-*exploit* principle. During training of the Agent the Ɛ will be quite high, as one want the agent to learn each corner case of the game. During evaluation, the Ɛ will be set to zero, forcing the agent to play its best game.

```
class Agent:
""" The Agent plays the game by playing a move corresponding to the optimal Q-value """
def __init__(
self, qtable=None, epsilon=0.2, learning_rate=0.3, discount_factor=0.9
):
""" A new Agent can be given some optional parameters to tune how fast it learns
Args:
qtable: QTable=None: the initial Q-table to start with.
epsilon: float=0.2: the chance the Agent will explore a random move
(in stead of choosing the optimal choice according to the Q table)
learning_rate: float=0.3: the rate at which the Agent learns from its games
discount_factor: float=0.9: the rate at which the final reward gets discounted
for when rating previous moves.
"""
self.qtable = QTable() if qtable is None else qtable
# the speed at which the Qvalues get updated
self.learning_rate = learning_rate
# the discount factor of future rewards
self.discount_factor = discount_factor
# the chance of executing a random action
self.epsilon = epsilon
def random_action(self):
""" get a random action """
return int(np.random.randint(0, 9, 1))
def best_action(self, state):
""" get the best values according to the current Q table """
state2d, turn = state
argminmax = {1: np.argmax, 2: np.argmin}[turn]
qvalues = self.qtable(state2d)
return argminmax(qvalues)
def get_action(self, state):
""" perform an action according to the state on the game board """
if np.random.rand() < self.epsilon:
# Choose action (random with chance of epsilon; best action otherwise.)
action = self.random_action()
else:
# get qvalues for current state of the game
action = self.best_action(state)
return action
def learn(self, state, action, next_state, reward):
""" learn from the current state and action taken. """
state2d, turn = state
next_state2d, next_turn = next_state
if next_turn == 0: # game finished
expected_qvalue_for_action = reward
else:
next_qvalues = self.qtable(next_state2d)
minmax = {1: max, 2: min}[next_turn]
expected_qvalue_for_action = reward + (
self.discount_factor * minmax(next_qvalues)
)
# update qvalues:
qvalues = self.qtable(state2d)
qvalues[action] += self.learning_rate * (
expected_qvalue_for_action - qvalues[action]
)
```

## Train¶

Now the `Agent`

is defined, it's time to train it! Training is as simple as just playing a 'bunch' of games (about 10,000,000). The `Agent`

will learn automatically from its `learn`

method.

```
# initialize
N = 10_000_000 # Number of training episodes
qtable = QTable() # share Q for faster training
player = Agent(qtable=qtable, learning_rate=0.1, epsilon=0.6)
game = TicTacToe(player, player) # let player play against itself
for _ in tqdm.trange(N):
game.play()
qtable.save("qtable.csv")
```

## Let's play another game!¶

Let the trained agent play against itself one final time.

```
player = Agent(epsilon=0.0) # epsilon=0 -> no random guesses
player.qtable.load("qtable.csv")
game = TicTacToe(player, player) # let player play against itself
# play
game.play(visualize=True)
```

We see that the game ends in a tie. This is to be expected, as two perfect tic-tac-toe players playing against each other will always draw.

The Q-table for the trained Agent can be downloaded in csv format here (availability not guarenteed).

If you like this post, consider leaving a comment or star it on GitHub.