Reinforcement learning from the ground up | part 1: tic-tac-toe.

posted on 2019-11-29T12:12:07Z · last modified on 2020-06-01T11:36:56Z · view page on GitHub

tags:

As a first example to reinforcement learning, we'll make our computer learn by itself how to play tic-tac-toe. As one of the most simple 2 player games, tic-tac-toe is ideal to get started with reinforcement learning, while still being more interesting that learning to play a single player game.

The AI created in this series can now be challenged here!

Index¶

part 1 (this post): We create the game environment and a simple unbeatable AI based on traditional Q-learning 🤖.
part 2: We modify our AI to utilize a neural network: deep Q-learning 👾.
part 3: Have some fun and play against the Q-agent 🤓.

The Game¶

First we'll need to create an environment to play te game. We'll represent this environment by the TicTacToe class.

A TicTacToe object stores the 2D state of the game, which is represented as a 2D numpy array with 9 elements, such that the positions of the playing board are labeled as follows. Each of those labels represents a possible action an player can take.

0   1   2
3   4   5
6   7   8

Furthermore, each position on the board can be in 3 differents states. First there is the empty state (a position in the board where nobody has placed a mark yet) denoted by a 0. Then there is the O-state: the state where player1 has placed a mark denoted by a 1. Finally there is the X-state, denoted by a 2: the state where player2 has placed a mark.

An example state of the board at a certain time in the game can for example look like this:

.   X   O       0   2   1
.   O   .  -->  0   1   0 
X   O   X       2   1   2

In addition to these 9 numbers, the current player's turn will also be kept track of. The turn index can thus be either 1 or 2, or 0 when the game is over. The turn together with the 2D state make up the full state of the game.

Apart from the state of the game, the TicTacToe object will also keep a reference to the two players playing the game, so it can ask for an action from either player whenever it's the specific player's turn. A player is asked to choose an action (move) by its own get_action method.

After the action is asked from a player, the action is validated and executed by the play_turn method of the TicTacToe instance. If the action is invalid (for example when choosing a square that is already taken), the player loses. If the action results in a line of three equal marks, the current player wins. This would result in a +1 reward if player1 wins and a -1 reward if player2 wins (they have opposite goals, so they receive opposite rewards; more on the reward system later). In all other cases no rewards will be given as no player has won (yet).

Finally, the full state, action and resulting next_state and reward are given to the player's learn method, such that the player can update it's policy according to the reward received, which will make it learn from its mistakes.

Finally, it is useful to have a way to visualize the game each turn. The game state can be visualized by setting the visualize flag to True.

In [2]:

class TicTacToe:
    """ Tic-Tac-Toe Game """

    def __init__(self, player1, player2):
        """ The Tic-Tac-Toe game takes two players and pitches them against each other. """
        # pitch players against each other
        self.players = {1: player1, 2: player2}

        # reward for each outcome of the game (tie, player1 wins, player2 wins)
        self._reward = {0: 0, 1: 1, 2: -1}

    def play(self, visualize=False):
        """ play a full game """
        turn = 1
        state2d = np.zeros((3, 3), dtype=np.int64)
        state = (state2d, turn)  # full state of the game
        for i in range(9):
            current_player = self.players[turn]
            action = current_player.get_action(state)
            next_state, reward = self.play_turn(state, action)
            current_player.learn(state, action, next_state, reward)

            if visualize:
                self.visualize_state(next_state, turn)

            (state2d, turn) = state = next_state

            if turn == 0:
                break

    def play_turn(self, state, action):
        """ execute a specific move chosen by the current player and 
        check if it's a winning/losing move. """
        # retrieve states
        state2d, turn = state
        next_state2d = state2d.copy()
        next_turn = turn % 2 + 1

        # transform action in two indices
        ax, ay = action // 3, action % 3

        # check if board is already occupied at location
        if state2d[ax, ay] != 0:  # invalid move
            next_state2d.fill(0)
            next_state = (next_state2d, 0)  # next_turn == 0 -> game over
            return next_state, self._reward[next_turn]  # next player wins

        # apply action
        next_state2d[ax, ay] = turn

        # check if the action resulted in a winner
        mask = next_state2d == turn
        if (
            (mask[0, 0] and mask[1, 1] and mask[2, 2])
            or (mask[0, 2] and mask[1, 1] and mask[2, 0])
            or (mask[0, 0] and mask[0, 1] and mask[0, 2])
            or (mask[1, 0] and mask[1, 1] and mask[1, 2])
            or (mask[2, 0] and mask[2, 1] and mask[2, 2])
            or (mask[0, 0] and mask[1, 0] and mask[2, 0])
            or (mask[0, 1] and mask[1, 1] and mask[2, 1])
            or (mask[0, 2] and mask[1, 2] and mask[2, 2])
        ):
            next_state = (next_state2d, 0)  # next_turn == 0 -> game over
            return next_state, self._reward[turn]  # current player wins

        # if the playing board is full, but no winner found: tie
        if (next_state2d != 0).all():  # final tie.
            next_state = (next_state2d, 0)  # next_turn == 0 -> game over
            return next_state, self._reward[0]  # no winner

        # if no move has resulted in a winner: next player's turn.
        next_state = (next_state2d, next_turn)
        return next_state, self._reward[0]  # no winner yet

    @staticmethod
    def visualize_state(next_state, turn):
        """ show the resulting game state after a player's turn """
        next_state2d, next_turn = next_state
        print(f"player {turn}'s turn:")
        if (next_state2d == 0).all() and turn == 0:
            print("[invalid state]\n\n")
        else:
            print(
                str(next_state2d)
                .replace("[[", "")
                .replace(" [", "")
                .replace("]]", "")
                .replace("]", "")
                .replace("0", ".")
                .replace("1", "O")
                .replace("2", "X")
                + "\n\n"
            )

Random Player¶

Let's first define the most stupid player of them all: a random player. The following player will always return a random index corresponding to one of the empty positions in the board:

In [3]:

class RandomPlayer:
    """ The random player will return a random action from the set of allowed actions"""

    def get_action(self, state):
        state2d, turn = state
        possible_actions = np.where(state2d.ravel() == 0)[0]
        action = np.random.choice(possible_actions)
        return action

    def learn(self, state, action, next_state, reward):
        pass  # a random player does not learn from its mistakes

Let's play a game!¶

Let's pitch two of those random players against each other!

In [4]:

player1 = RandomPlayer()
player2 = RandomPlayer()
game = TicTacToe(player1, player2)
game.play(visualize=True)

player 1's turn:
. . .
. . O
. . .


player 2's turn:
. . .
X . O
. . .


player 1's turn:
. . .
X . O
O . .


player 2's turn:
X . .
X . O
O . .


player 1's turn:
X . O
X . O
O . .


player 2's turn:
X . O
X . O
O . X


player 1's turn:
X . O
X O O
O . X

obviously, this is not very interesting. We need a smarter kind of player:

A smarter player¶

But how do we define such a smarter player? We could probably just code a perfect AI ourselves by coding out some rules that the player has to take in each specific position of the board. And for tic-tac-toe, this would probably not be very difficult. However, we'd like a more general approach: enter Q learning.

In general, Q learning works af follows: An Agent plays the game according to a certain policy π. This policy π is such that it chooses the action with the maximal Q value associated with it for a specific state of the game. This Q value can be found in a lookup table: the Q table, where a different Q value is associated with each state of the game and the possible actions one can take for this specific state. Mathematically, one can express this as follows: \begin{align*} \texttt{action} &= \pi(\texttt{state}) = \text{argmax}_{action} Q(\texttt{state},~\texttt{action}) \end{align*} Intuitively, each Q value can thus vaguely be linked to the probability of a certain player winning the game after playing a certain action when the game is in a certain state.

However, we will pitch two Agents against each other. If we assume a higher Q value to be associated to a higher chance of player1 winning, the policy for player2 will have to be the opposite of the policy of player1. player2's policy will thus try to choose the action with the minimal Q value associated to it. This way of pitching two players against each other while one tries to optimize the objective function and the other tries to minimize it is also known as the minimax algorithm.

\begin{align*} \texttt{player1} &: \texttt{action} = \pi(\texttt{state}) = \text{argmax}_{action} Q(\texttt{state},~\texttt{action})\\ \texttt{player2} &: \texttt{action} = \pi(\texttt{state}) = \text{argmin}_{action} Q(\texttt{state},~\texttt{action}) \end{align*}

However, the above does not fully describe how the Q table is formed. We need to define a way to update the Q values after a certain decision has been made. For this, the Bellman equation can be used, which basically describes that the Q-value at the current state of the game should be equal to the reward received at this stage of the game plus a discounted Q-value at the next state of the game. \begin{align*} Q(\texttt{state},~\texttt{action}) = \texttt{reward} + \texttt{discount_factor} \cdot Q(\texttt{next-state},~\texttt{next-action}) \end{align*} Which states that for a perfect Q table, the expected Q value of the current state will be equal the the Q value of the next state and next action discounted by a discount_factor plus a possible reward obtained by performing chosen the action. However, we start with an imperfect Q table, which means that at each timestep, there will be an error δ between the left hand side and the right hand side of the Bellman equation: \begin{align*} \delta &= Q(\texttt{state},~\texttt{action}) - (\texttt{reward} + \texttt{discount_factor} \cdot Q(\texttt{next-state},~\texttt{next-action})) \end{align*} player1 has the policy to maximize Q, while player2 will have the policy to minimize Q and vice versa, which yields for the error on the Q values for each of the players: \begin{align*} \texttt{player1} &: \delta = Q(\texttt{state},~\texttt{action}) - (\texttt{reward} + \texttt{discount_factor} \cdot \text{min}_{\texttt{action}} Q(\texttt{next-state},~\texttt{action})) \\ \texttt{player2} &: \delta = Q(\texttt{state},~\texttt{action}) - (\texttt{reward} + \texttt{discount_factor} \cdot \text{max}_{\texttt{action}} Q(\texttt{next-state},~\texttt{action})) \end{align*} Note that the error on the Q-value for player 1 uses the policy of player 2 [min]. This is to be expected, as this policy appears in the discounted Q-value of the next state, which is determined by the other player.

This error can then be used in conjunction with a learning_rate to update the Q values iteratively at every step of the game.

Let's have a look how this can be implemented:

Q table¶

As discussed above, the Q table is just a table (numpy array) that returns a Q value for each state of the game and for each action taken. A tic-tac-toe game consists of 9 squares, each of which can be in 3 different states, totalling 3^9 different states (in reality, there are less allowed states possible; 3^9 is an upper bound). We can convert each state to a unique integer to index the qtable. Indexing (calling in this case) the QTable with a state results in a row of the QTable containing 9 qvalues corresponding to the 9 possible actions one can take for each state. Yes, there are 9 possible actions possible at each state, even if there are already a few squares taken: the game is programmed in such a way that an invalid move will make you lose, obviously this design choice will make training more difficult but also more general. Note that we only use the 2D state of the game as key for the Q-table as this is enough to create a unique key to index the underlying table.

In [5]:

class QTable:
    def __init__(self):
        self._qtable = np.random.randn(3 ** 9, 9)
        self._state_key_vector = 3 ** np.arange(9)

    def __call__(self, state2d):
        return self._qtable[int(np.sum(state2d.ravel() * self._state_key_vector))]

    def save(self, filename):
        if not filename.endswith(".csv"):
            filename = filename + ".csv"
        np.savetxt(filename, self._qtable, delimiter=",")

    def load(self, filename):
        if not filename.endswith(".csv"):
            filename = filename + ".csv"
        self._qtable = np.loadtxt(filename, delimiter=",")
        return self

Agent¶

A reinforcement agent plays the game by playing a random action with with probability Ɛ. Alternatively, the Agent will consult its Q-table with probability (1-Ɛ) to play the action with the optimal Q-value associated to it. This is the explore-exploit principle. During training of the Agent the Ɛ will be quite high, as one want the agent to learn each corner case of the game. During evaluation, the Ɛ will be set to zero, forcing the agent to play its best game.

In [6]:

class Agent:
    """ The Agent plays the game by playing a move corresponding to the optimal Q-value """

    def __init__(
        self, qtable=None, epsilon=0.2, learning_rate=0.3, discount_factor=0.9
    ):
        """ A new Agent can be given some optional parameters to tune how fast it learns
        
        Args:
            qtable: QTable=None: the initial Q-table to start with. 
            epsilon: float=0.2: the chance the Agent will explore a random move
                               (in stead of choosing the optimal choice according to the Q table)
            learning_rate: float=0.3: the rate at which the Agent learns from its games
            discount_factor: float=0.9: the rate at which the final reward gets discounted
                                        for when rating previous moves.
        """
        self.qtable = QTable() if qtable is None else qtable

        # the speed at which the Qvalues get updated
        self.learning_rate = learning_rate

        # the discount factor of future rewards
        self.discount_factor = discount_factor

        # the chance of executing a random action
        self.epsilon = epsilon

    def random_action(self):
        """ get a random action """
        return int(np.random.randint(0, 9, 1))

    def best_action(self, state):
        """ get the best values according to the current Q table """
        state2d, turn = state
        argminmax = {1: np.argmax, 2: np.argmin}[turn]
        qvalues = self.qtable(state2d)
        return argminmax(qvalues)

    def get_action(self, state):
        """ perform an action according to the state on the game board """

        if np.random.rand() < self.epsilon:
            # Choose action (random with chance of epsilon; best action otherwise.)
            action = self.random_action()
        else:
            # get qvalues for current state of the game
            action = self.best_action(state)

        return action

    def learn(self, state, action, next_state, reward):
        """ learn from the current state and action taken. """
        state2d, turn = state
        next_state2d, next_turn = next_state
        if next_turn == 0:  # game finished
            expected_qvalue_for_action = reward
        else:
            next_qvalues = self.qtable(next_state2d)
            minmax = {1: max, 2: min}[next_turn]
            expected_qvalue_for_action = reward + (
                self.discount_factor * minmax(next_qvalues)
            )

        # update qvalues:
        qvalues = self.qtable(state2d)
        qvalues[action] += self.learning_rate * (
            expected_qvalue_for_action - qvalues[action]
        )

Train¶

Now the Agent is defined, it's time to train it! Training is as simple as just playing a 'bunch' of games (about 10,000,000). The Agent will learn automatically from its learn method.

In [7]:

# initialize
N = 10_000_000  # Number of training episodes
qtable = QTable()  # share Q for faster training
player = Agent(qtable=qtable, learning_rate=0.1, epsilon=0.6)
game = TicTacToe(player, player)  # let player play against itself

for _ in tqdm.trange(N):
    game.play()

qtable.save("qtable.csv")

100%|██████████| 10000000/10000000 [26:55<00:00, 6189.38it/s]

Let's play another game!¶

Let the trained agent play against itself one final time.

In [8]:

player = Agent(epsilon=0.0)  # epsilon=0 -> no random guesses
player.qtable.load("qtable.csv")
game = TicTacToe(player, player)  # let player play against itself

# play
game.play(visualize=True)

player 1's turn:
. . .
. O .
. . .


player 2's turn:
X . .
. O .
. . .


player 1's turn:
X . .
O O .
. . .


player 2's turn:
X . .
O O X
. . .


player 1's turn:
X . O
O O X
. . .


player 2's turn:
X . O
O O X
X . .


player 1's turn:
X . O
O O X
X O .


player 2's turn:
X X O
O O X
X O .


player 1's turn:
X X O
O O X
X O O

We see that the game ends in a tie. This is to be expected, as two perfect tic-tac-toe players playing against each other will always draw.

The Q-table for the trained Agent can be downloaded in csv format here (availability not guarenteed).

Go to part 2 of the series

If you like this post, consider leaving a comment or star it on GitHub.