Evaluation
The evaluation is conducted on the Huawei Cloud. The cloud server will regularly check the docker server and look for update. If an image update is detected, the evaluation will automatically start. You can only evaluate your agent once per day.
The cloud computer currently has 12 vCPUs and 24 GiB RAM. Each agent will be assigned with 4 cores. Unfortunately, we don’t support GPU acceleration this year.
What is evaluated
For the warmup phase we will evaluate the agent by running
python run.py --n_episodes 1 --steps_per_game 45000 -e tournament -g phase-3
That means that we load the default config file air_hockey_agent/agent_config.yml and overwrite the parameters
related for evaluation. You can append your agent related parameters in the config file.
Note
Please keep in mind that we will restore all all the air_hockey_challenge/* files as well as run.py to their
original state for evaluation on the server.
Get the Dataset
We provide a script to download the dataset once the evaluation is done. You can download them by running
python scripts/download_dataset.py
Dataset
The dataset is organized as follows:
dataset_path
├── qualifying
│ ├── eval-{datetime}
│ │ ├── out.log
│ │ ├── result_{team}.json
│ │ ├── Game_0
│ │ │ ├── computation_time.npy
│ │ │ ├── ee_constr.npy
│ │ │ ├── joint_pos_constr.npy
│ │ │ ├── joint_vel_constr.npy
│ │ │ ├── dataset.pkl
│ │ │ └── violations.json
└── tournament
You can load the files as
dataset = np.load(path_to_the_file, allow_pickle=True)
The dataset in dataset.pkl stores a list of tuples at each step. The length of the list is the total number of
steps of the evaluation. Each item in the list contains a tuple:
(state[array], action[array], reward[float], next_state[array], absorbing[bool], last[bool])
The other .npy files contain the specific info of the evaluation at each step.
To replay the dataset, you can call the function in the air_hockey_challenge/utils/replay_dataset.py
Metric
The agent is evaluated by playing against the baseline agent for 45000 steps. We will compute two metrics in the evaluation:
Success Rate: A success criterion is defined for each task. We will check if the task succeed when the episode terminates. Each episode can terminate because of two reasons:
Maximum number of steps reached
No further interaction can be in the episodes.
Deployability: The deployability score will assess your agent in multiple aspects. Each metric will be assigned one or more penalty points depending on the level of risk. When a constraint is violated, the deployability penalty points are added up by the associated penalty points. Same violations will count once per episode. The following are the constraints considered in the challenge:
- Violations of the End-Effector’s Position Constraints (3):
The x-y-position of the end-effector should remain within the boundaries of the table. The z-position of the end-effector should remain within a range. The end-effector’s position can be computed \(p = \mathrm{ForwardKinematics} (q)_{x}\). The constraints can be represented as:
\(l_x < p_x,\)
\(l_y < p_y < u_y,\)
\(\mathrm{table\,height - tolerance} < p_z < \mathrm{table\, height + tolerance}\)
This constraint is very strict. As we use a high-gain tracking controller to improve the tracking performance. An infeasible command could potentially damage the table, end-effector , or robot actuator.
- Violations of the Elbow and Wrist Link Height Constraints (3):
The elbow and wrist link should remain high enough to avoid collision with the table. The height of the elbow and wrist link can be computed as:
\(z_\mathrm{elbow} > 0.25, \quad z_\mathrm{wrist} > 0.25\)
- Violations of the Joint Position Limit Constraints (2):
The joint position should not exceed the position limits. In the real-robot, violations of the position limit will trigger the emergency break.
\(q_l < q_{cmd} < q_u\)
- Violations of the Joint Velocity Limit Constraints (1):
The velocity should not exceed the velocity limits. The controller in the real-robot are set not to exceed the velocity limits.
\(\dot{q}_l < \dot{q}_{cmd} < \dot{q}_u\)
- Computation Time (0.5 - 2):
The computation time at each step should be smaller than 0.02s.
- Penalty Points 2:
maximum computation time > 0.2s or average computation time > 0.02s
- Penalty Points 1:
0.2s >= maximum computation time > 0.1s
- Penalty Points 0.5:
0.1s >= maximum computation time > 0.02s
Leaderboard
We will categorize your agent into three categories based on the deployability penalty:
Deployable
Improvable
Non-deployable
The leaderboard is divided into three categories by Deployability. Each
category will be ranked separately according to Success Rate. At each stage, we provide
an overall leaderboard and a task-specific leaderboard. In the overall leaderboard,
deployability is categorized by the maximum penalty score for all tasks; the score for
the ranking is a weighted average of the success rates of all tasks.