Maze PLR/ACCEL
This example implements several variations of curation-based UED methods. In particular, it supports Domain Randomisation (DR), Prioritized Level Replay (PLR), Robust PLR, and ACCEL.
See the DR example for details about what this outputs, and how to run it.
Arguments
Name | Description | Default |
---|---|---|
--score_function |
The score function to use, pvl or MaxMC |
MaxMC |
--exploratory_grad_updates |
If True , trains on random levels |
False |
--level_buffer_capacity |
The maximum number of levels in the buffer. | 4000 |
--replay_prob |
The probability of performing a replay step |
0.8 |
--staleness_coeff |
The coefficient used to combine staleness and score weights | 0.3 |
--temperature |
The temperature when using rank prioritization, only valid if prioritization=rank . |
0.3 |
--topk_k |
The number of levels sampled when using topk prioritization. Only valid if prioritization=topk . |
4 |
--minimum_fill_ratio |
The minimum number of environments in the level before replay can be triggered. | 0.5 |
--prioritization |
rank or topk . |
rank |
--buffer_duplicate_check |
If True, duplicate levels cannot be added to the buffer. | True |
--use_accel |
If True, runs ACCEL. | False |
--num_edits |
Only if --use_accel=True , the number of mutations done. |
5 |
--project |
Wandb project | JAXUED_TEST |
--run_name |
The group name to use | None |
--seed |
Random seed | 0 |
--mode |
"train" or "eval" | train |
--checkpoint_directory |
Only valid if mode==eval where to load checkpoint from | None |
--checkpoint_to_eval |
Only valid if mode==eval. This is the timestep to load from the above checkpoint directory | -1 |
--checkpoint_save_interval |
How often to save checkpoints | 0 |
--max_number_of_checkpoints |
How many checkpoints to save in total | 60 |
--eval_freq |
How often to evaluate the agent and log | 250 |
--eval_num_attempts |
How many attempts (episodes) per level to run for evaluation | 10 |
--eval_levels |
The eval levels to use | "SixteenRooms", "SixteenRooms2", "Labyrinth", "LabyrinthFlipped", "Labyrinth2", "StandardMaze", "StandardMaze2", "StandardMaze3" |
--lr |
The agent's learning rate | 1e-4 |
--max_grad_norm |
The agent's max PPO grad norm | 0.5 |
--num_updates |
Number of updates. Mutually exclusive with num_env_steps . Generally, num_env_steps = num_updates * num_steps * num_train_envs |
30000 |
--num_env_steps |
Number of env steps. Mutually exclusive with `num_updates`` | None |
--num_steps |
Number of PPO rollout steps | 256 |
--num_train_envs |
Number of training environments | 32 |
--num_minibatches |
Number of PPO minibatches | 1 |
--gamma |
Discount factor | 0.995 |
--epoch_ppo |
Number of PPO epochs | 5 |
--clip_eps |
PPO Epsilon Clip | 0.2 |
--gae_lambda |
PPO Lambda | 0.98 |
--entropy_coeff |
PPO entropy coefficient | 1e-3 |
--critic_coeff |
Critic coefficient | 0.5 |
--agent_view_size |
The number of tiles the agent can see in front of it | 5 |
--n_walls |
Number of walls to generate | 25 |