Maze PAIRED
This implements PAIRED.
See the DR example for details about what this outputs, and how to run it.
Arguments
Argument Name | Description | Default |
---|---|---|
--project |
Wandb project | JAXUED_TEST |
--run_name |
This controls where the checkpoints are stored | None |
--seed |
Random seed | 0 |
--mode |
"train" or "eval" | train |
--checkpoint_directory |
Only valid if mode==eval where to load checkpoint from | None |
--checkpoint_to_eval |
Only valid if mode==eval. This is the timestep to load from the above checkpoint directory | -1 |
--checkpoint_save_interval |
How often to save checkpoints | 0 |
--max_number_of_checkpoints |
How many checkpoints to save in total | 60 |
--eval_freq |
How often to evaluate the agent and log | 250 |
--eval_num_attempts |
How many attempts (episodes) per level to run for evaluation | 10 |
--eval_levels |
The eval levels to use | "SixteenRooms", "SixteenRooms2", "Labyrinth", "LabyrinthFlipped", "Labyrinth2", "StandardMaze", "StandardMaze2", "StandardMaze3" |
--num_updates |
Number of updates. Mutually exclusive with num_env_steps . Generally, num_env_steps = num_updates * num_steps * num_train_envs |
30000 |
--num_env_steps |
Number of env steps. Mutually exclusive with `num_updates`` | None |
--num_train_envs |
Number of training environments | 32 |
--student_lr |
Student's learning rate | 1e-4 |
--student_max_grad_norm |
Student's max PPO grad norm | 0.5 |
--student_num_steps |
Student's number of PPO rollout steps | 256 |
--student_num_minibatches |
Student's number of PPO minibatches | 1 |
--student_gamma |
Student's discount factor | 0.995 |
--student_epoch_ppo |
Student's number of PPO epochs | 5 |
--student_clip_eps |
Student's PPO Epsilon Clip | 0.2 |
--student_gae_lambda |
Student's PPO Lambda | 0.98 |
--student_entropy_coeff |
Student's PPO entropy coefficient | 1e-3 |
--student_critic_coeff |
Student's Critic coefficient | |
--adv_lr |
Adversary's learning rate | 1e-4 |
--adv_max_grad_norm |
Adversary's max PPO grad norm | 0.5 |
--adv_num_steps |
Adversary's number of PPO rollout steps | 50 |
--adv_num_minibatches |
Adversary's number of PPO minibatches | 1 |
--adv_gamma |
Adversary's discount factor | 0.995 |
--adv_epoch_ppo |
Adversary's number of PPO epochs | 5 |
--adv_clip_eps |
Adversary's PPO Epsilon Clip | 0.2 |
--adv_gae_lambda |
Adversary's PPO Lambda | 0.98 |
--adv_entropy_coeff |
Adversary's PPO entropy coefficient | 1e-3 |
--adv_critic_coeff |
Adversary's Critic coefficient | 0.5 |
--agent_view_size |
The number of tiles the agent can see in front of it | 5 |
--n_walls |
Number of walls to generate | 25 |