ICME 2020 CORSMAL Challenge

Multi-modal fusion and learning for robotics

Organised in conjunction with the IEEE International Conference on Multimedia and Expo 2020, London, UK

Challenge Description

A major challenge to address for human-robot cooperation in household chores is enabling robots to predict the properties of previously unseen containers with different fillings. Examples of containers are cups, glasses, mugs, bottles and food boxes, whose varying physical properties such as material, stiffness, texture, transparency, and shape must be inferred on-the-fly prior to a pick-up or a handover.

The CORSMAL Challenge focuses on the estimation of the pose, dimensions and mass of containers. Participants will determine the physical properties of a container while it is manipulated by a human. Containers vary in their physical properties (shape, material, texture, transparency and deformability). Containers and fillings are not known to the robot, and the only prior information available is a set of object categories (glasses, cups, food boxes) and a set of filling types (water, sugar, rice).

The CORSMAL Challenge distributes a multi-modal dataset with visual-audio-inertial recordings of people interacting with containers, for example while pouring a liquid in a glass or shaking a food box. The dataset is created with four cameras (one on a robotic arm, one on the human chest and two third-person views) and a microphone array. Each camera provides RGB, depth images and stereo infrared images, which are temporally synchronized and spatially aligned. The body-worn camera is equipped with an inertial measurement unit, from which we provide the data as well.

The CORSMAL Challenge includes three scenarios:

  • Scenario 1. A container is on the table, in front of the robot. A person pours filling into the container or shakes an already filled food box, and then hands it over to the robot.
  • Scenario 2. A container is held by a person, in front of the robot. The person pours the filling into the container or shakes an already filled food box, and then hands it over to the robot.
  • Scenario 3. A container is held by a person, outside of the camera mounted on the robot camera. The person pours the filling into the container or shakes an already filled food box, walks towards a position in front of the robot and hands the container over to the robot.
Each scenario is recorded with two different backgrounds and under two different lighting conditions.

See the call for participation.


The dataset consists of 15 container types: five drinking cups, five drinking glasses and five food boxes. These containers are made of different materials, such as plastic, glass and paper. Each container can be empty or filled with water, rice or pasta at two different levels of fullness: 50% and 90% with respect to the capacity of the container. The combination of containers and fillings results in a total of 95 configurations executed by a different subject for each scenario and for each background/illumination condition. The total number of configurations is 1140.

The dataset is split into training set (9 containers), public testing set (3 unseen containers), and private testing set (3 unseen containers). The containers for each set are evenly distributed among the three container types. We provide to the participants the annotations of the capacity of the container, filling type, fullness, and the mass of the container and filling for the training set.

The cameras are Intel RealSense D435i providing RGB, narrow-baseline stereo infrared and depth images at 30 Hz with 1280x720 pixels resolution. RGB, infrared and depth images are spatially aligned, and the calibration information (intrinsic and extrinsic parameters) and the inertial measurements of the Intel RealSense D435i used as body-worn camera are also provided. The microphone array is placed on the table and consists of 8 Boya BY-M1 omnidirectional Lavelier microphones arranged in a circular shape of radius 15cm. Audio signals are sampled synchronously at 44.1 kHz with a multi-channel audio recorder. All signals are synchronized.

We encourage participants to complement the provided dataset with their own acquisitions. Please contact the organisers, if you want the data to be included in a future release of the dataset.

Camera 1 Camera 2 Camera 3 Camera 4 Audio
Scenario 1
Red cup
”animated” ”animated” ”animated” ”animated”
Scenario 2
Wine glass
(50% rice)
”animated” ”animated” ”animated” ”animated”
Scenario 3
Tea box
(90% pasta)
”animated” ”animated” ”animated” ”animated”

- RGB = RGB videos from the 4 cameras
- Infrared = stereo infrared videos from the 4 cameras
- Depth = depth images from the 4 cameras
- Others = audio files from the microphone array, IMU data and calibration information

Split ID Container Sample image RGB Infrared Depth Others
Train 1 Red cup CORSMAL-Challenge red cup 1_rgb.zip
2 Small white cup CORSMAL-Challenge small white cup 2_rgb.zip
3 Small transparent cup CORSMAL-Challenge small transparent cup 3_rgb.zip
4 Green glass CORSMAL-Challenge green glass 4_rgb.zip
5 Wine glass CORSMAL-Challenge wine glass 5_rgb.zip
6 Champagne flute glass CORSMAL-Challenge flute glass 6_rgb.zip
7 Cereal box CORSMAL-Challenge cereal box 7_rgb.zip
8 Biscuit box CORSMAL-Challenge biscuit box 8_rgb.zip
9 Tea box CORSMAL-Challenge tea box 9_rgb.zip
Public test* 10 Beer cup CORSMAL-Challenge beer cup 10_rgb.zip
11 Cocktail glass CORSMAL-Challenge cocktail glass 11_rgb.zip
12 Fusilli pasta box CORSMAL-Challenge fusilli pasta box 12_rgb.zip

List of containers and their physical properties
Microphone locations

Download the dataset:
Script to download training set (Unix, Windows) (~300GB)
Script to download public testing set (Unix, Windows) (~95GB).
*The test set is password protected, please email us to provide you with the password.

Evaluation criteria and methodology

For the public testing set, participants will provide for each configuration the estimations for the capacity of the container, the filling type, the fullness, the mass of filling and the mass of the empty container. For the private testing set, participants will submit the source code that will be run by the organizers to compute the estimations for each configuration. The source code should be properly commented and easy to run. The source code will be deleted after the release of the results. The evaluation will be based on the results of both testing sets and participants will be ranked across all scenarios. Additionally, organisers will provide the ranks for each scenario independently. The accuracy of the estimated capacity is the mean absolute error of the estimated capacity of the container, whereas the accuracy of the estimation of the filling type (e.g. rice classified as rice) and fullness (e.g. 50% classified as 50%) are the ratio between true positives and the total number of configurations. The mean absolute error is also used to compute the accuracy of the estimation of the masses for the empty container and the filling.

Important dates

Release of the public testing set: 30 January 2020
Results on the public testing set and source code of methods due: 21 February 2020
Release of the evaluation results in public and private testing sets: 28 February 2020
Paper due: 13 March 2020
Final acceptance notification: 15 April 2020
Final manuscript due: 29 April 2020

Submission guidelines

Participants must submit the estimations for each configuration of the public testing set in a csv file to corsmal-challenge@qmul.ac.uk. Each row corresponds to the respective configuration, the first column is the object id, the second column is the configuration id, the third column is the estimated capacity of the container in milliliters, the fourth column is the estimated mass of the empty container in grams, the fifth column is the filling type (water, pasta or rice), the sixth column is the estimated fullness (0, 50 or 90), and the seventh column is the mass of the filling in grams.

Participants must submit the source code of their models that will be run by the organisers on the private test set. The source code should be properly commented and easy to run. The source code will be deleted by the organisers after the release of the results.

A technical paper describing the proposed method, and following the ICME guidelines, must be submitted through the ICME 2020 website.


Apostolos Modas, École polytechnique fédérale de Lausanne (EPFL)
Pascal Frossard, École polytechnique fédérale de Lausanne (EPFL)
Andrea Cavallaro, Queen Mary University of London (UK)
Ricardo Sanchez-Matilla, Queen Mary University of London (UK)
Alessio Xompero, Queen Mary University of London (UK)


Riccardo Mazzon, Queen Mary University of London (UK)

Info and queries: corsmal-challenge@qmul.ac.uk

Sponsor of the ICME 2020 CORSMAL Challenge

IET logo