..

Paint Your Partner Layered Learning

In my last post, I trained a Paint Your Partner agent using reinforcement learning. Specifically, I provided the agent with all game information as a single, large chunk of observation data. With sufficient training time, the agent successfully learned to play the game.

However, in today’s post, I’ll explore an alternative approach—layered learning. In this paradigm, the agent learns to handle individual subtasks separately before being exposed to the complete environment. For the PYP environment, the key subtasks are:

1.	Wandering
2.	Painting
3.	Moving to the goal

In this post, I’ll discuss how the agent is trained to perform each of these tasks. However, a crucial aspect of layered learning remains: integrating these layers to form a cohesive and complex behavior.

Table of contents

Wandering

At first, I thought training a wandering behavior would be easy. However, I soon realized that training a purposeless behavior, like wandering, was even more conceptually challenging than training a complex behavior such as playing Paint Your Partner from scratch.

The difficulty stemmed from two key issues. First, because wandering lacks a clear goal, I had to carefully design an appropriate reward system. Second, I needed to balance exploitation and exploration effectively. Wandering is fundamentally an exploratory behavior, but since the agent is playing the game, its wandering actions still need to be effective in some way.

Below is the code that successfully emulates the wandering behavior. Basically, I rewarded agent if they successfully visited all grids in a given amounts of steps - impotantly more than the size of the grid, rewarding them if they’ve visited different positions from where they already visited.

    def step(self, action):

        new_position = self.agent_pos.copy()
        if action == 0: new_position[1] += 1  # Up
        elif action == 1: new_position[1] -= 1  # Down
        elif action == 2: new_position[0] -= 1  # Left
        elif action == 3: new_position[0] += 1  # Right

        reward = 0

        if 0 <= new_position[0] < self.grid_size and 0 <= new_position[1] < self.grid_size:
            self.agent_pos = new_position
            if tuple(self.agent_pos) not in self.visited_positions:
                reward = 1  # Reward for new tile
                self.visited_positions.add(tuple(self.agent_pos))

        # Check termination condition (e.g., excessive looping)
        terminated = len(self.visited_positions) == self.grid_size**2
        truncated = self.steps_taken >= self.grid_size**3

        if terminated:
            self.info["is_success"] = True
            reward += 100
        elif truncated:
            self.info["TimeLimit.truncated"] = True

        self.steps_taken += 1

        return self._get_obs(), reward, terminated, truncated, self.info

Wandering Behavior

Tensorboard Report

tb-trend-monitoring

Painting

Interestingly, painting is more closely related to planning than simply taking actions. For the agent to match its body color to the goal color, the first priority is to devise an effective strategy—a sequence of coloring steps. Only after that does the action of painting take place, guiding the agent toward the color chip, which serves as an interim goal.

Therefore, this layer functions as the agent’s mental planning ability for painting.

self.action_space = spaces.Discrete(4)  # 0: cyan, 1: magenta, 2: yellow, 3: transparent, 4: do nothing
    def step(self, action):
        info = {
            "is_success": False,
            "TimeLimit.truncated": False,
        }
        self.steps_taken += 1  # Increment step counter
        max_steps = self.grid_size ** 2  # Example: step limit based on grid size

        # Handle painting action
        if action == 0:  # Combine with Cyan
            self.agent_color = self._combine_colors(self.agent_color, self.cmytx[action])
        elif action == 1:  # Combine with Magenta
            self.agent_color = self._combine_colors(self.agent_color, self.cmytx[action])
        elif action == 2:  # Combine with Yellow
            self.agent_color = self._combine_colors(self.agent_color, self.cmytx[action])
        elif action == 3:  # Combine with Transparent
            self.agent_color = Pallete.TRANSPARENT
        elif action == 4:
            self.agent_color = self.agent_color

        reward = 0  # Initialize reward

        # Check if the agent has reached the goal with the correct color
        terminated = self.agent_color == self.goal_color
        if terminated:
            reward += 10  # High reward for completing the objective
            #print(f"Agent Color: {self.agent_color}, Goal Color: {self.goal_color}")
            #print(f"Terminated: {terminated}")
            info['is_success'] = True

        # Truncation logic
        truncated = self.steps_taken >= max_steps
        if truncated:
            info["TimeLimit.truncated"] = True
            #print(f"Agent Color: {self.agent_color}, Goal Color: {self.goal_color}")
            #print(f"Episode truncated after {self.steps_taken} steps. No success achieved.")

        reward -= 0.1

        return self._get_obs(), reward, terminated, truncated, info

Painting Logs

Below is evaluation log of painting layer. As you can see, the agent learned how to plan the color most effectively throughout.

Episode 4: Reward = 9.9, Steps = 1
[<Pallete.CYAN: 0>]
Goal Color: Pallete.CYAN_MAGENTA_YELLOW, Initial Agent Color: Pallete.CYAN_YELLOW
Episode 5: Reward = 9.9, Steps = 1
[<Pallete.MAGENTA: 1>]
Goal Color: Pallete.CYAN_YELLOW, Initial Agent Color: Pallete.CYAN
Episode 6: Reward = 9.9, Steps = 1
[<Pallete.YELLOW: 2>]
Goal Color: Pallete.CYAN_MAGENTA, Initial Agent Color: Pallete.MAGENTA_YELLOW
Episode 7: Reward = 9.700000000000001, Steps = 3
[<Pallete.TRANSPARENT: 7>, <Pallete.CYAN: 0>, <Pallete.MAGENTA: 1>]
Goal Color: Pallete.CYAN_YELLOW, Initial Agent Color: Pallete.YELLOW
Episode 8: Reward = 9.9, Steps = 1
[<Pallete.CYAN: 0>]
Goal Color: Pallete.CYAN_MAGENTA, Initial Agent Color: Pallete.MAGENTA_YELLOW
Episode 9: Reward = 9.700000000000001, Steps = 3
[<Pallete.TRANSPARENT: 7>, <Pallete.CYAN: 0>, <Pallete.MAGENTA: 1>]
Goal Color: Pallete.CYAN_YELLOW, Initial Agent Color: Pallete.MAGENTA
Episode 10: Reward = 9.700000000000001, Steps = 3
[<Pallete.TRANSPARENT: 7>, <Pallete.YELLOW: 2>, <Pallete.CYAN: 0>]
Evaluation over 10 episodes:
Mean Reward: 9.81
Success Rate: 100.00%
Truncation Rate: 0.00%

Moving to the goal

This layer was the easiest sub layer to train. The observation the agent needs is its current position and the row, column of the goal.

    def step(self, action):

        if action == 0 and self.agent_pos[0] > 0:  # Move up
            self.agent_pos[0] -= 1
        elif action == 1 and self.agent_pos[0] < self.grid_size - 1:  # Move down
            self.agent_pos[0] += 1
        elif action == 2 and self.agent_pos[1] > 0:  # Move left
            self.agent_pos[1] -= 1
        elif action == 3 and self.agent_pos[1] < self.grid_size - 1:  # Move right
            self.agent_pos[1] += 1

        reward = 0

        # Check termination condition (e.g., excessive looping)
        terminated = np.array_equal(self.agent_pos, self.goal_pos)
        truncated = self.steps_taken >= self.grid_size**3

        if terminated:
            self.info["is_success"] = True
            reward += 100
        elif truncated:
            self.info["TimeLimit.truncated"] = True

        self.steps_taken += 1

        return self._get_obs(), reward, terminated, truncated, self.info

Moving to the goal behavior

Tensorboard Report

tb-trend-monitoring

Thoughts

Now the ingredients for Layered Learning is ready. Next post will cover how to combine these layers for an agent to perform complex playing behavior.