ICME 2020 CORSMAL Challenge

Multi-modal fusion and learning for robotics

Organised in conjunction with the IEEE International Conference on Multimedia and Expo 2020, London, United Kingdom

Challenge Description

A major challenge to address for human-robot cooperation in household chores is enabling robots to predict the properties of previously unseen containers with different fillings. Examples of containers are cups, glasses, mugs, bottles and food boxes, whose varying physical properties such as material, stiffness, texture, transparency, and shape must be inferred on-the-fly prior to a pick-up or a handover.

The CORSMAL challenge focuses on the estimation of the pose, dimensions and mass of containers. Participants will determine the physical properties of a container while it is manipulated by a human. Containers vary in their physical properties (shape, material, texture, transparency and deformability). Containers and fillings are not known to the robot, and the only prior information available is a set of object categories (glasses, cups, food boxes) and a set of filling types (water, sugar, rice).

The CORSMAL challenge distributes a multi-modal dataset with visual-audio-inertial recordings of people interacting with containers, for example while pouring a liquid in a glass or moving a food box. The dataset is created with four cameras (one on a robotic arm, one on the human and two third-person views) and a microphone array. Each camera provides RGB, depth images and stereo infrared images, which are temporally synchronized and spatially aligned. The body-worn camera is equipped with an inertial measurement unit, from which we provide the data as well.

The CORSMAL Challenge includes three scenarios:

  • Scenario 1. A container is on the table, in front of the robot. A person pours filling into the container or shakes an already filled food box, and then hands it over to the robot.
  • Scenario 2. A container is held by a person, in front of the robot. The person pours the filling into the container or shakes an already filled food box, and then hands it over to the robot.
  • Scenario 3. A container is held by a person, in front of the robot. The person pours filling into the container or shakes an already filled food box, and then walks around for a few seconds holding the container. Finally, the person hands the container over to the robot.
Each scenario is recorded with two different backgrounds and under two different lighting conditions.

The call for participants can be found at here.


The dataset consists of 15 container types: five drinking cups, five drinking glasses and five food boxes. These containers are made of different materials, such as plastic, glass and paper. Each container can be empty or filled with water, rice or pasta at two different levels of fullness: 50% and 90% with respect to the capacity of the container.

We provide annotations for the capacity of the container, filling type, fullness, and the mass of the container and the filling. Annotations are available to the participants in the training set but remain hidden in the testing set. The training dataset has nine containers and the testing dataset has six unseen containers, evenly distributed among the three categories. We release a preliminary training dataset as a sample of the training set in a different setup.

The cameras are Intel RealSense D435i providing RGB, narrow-baseline stereo infrared and depth images at 30 Hz with 1280x720 pixels resolution. RGB, infrared and depth images are spatially aligned, and the calibration information (intrinsic and extrinsic parameters) and the inertial measurements of the Intel RealSense D435i used as body-worn camera are also provided. The microphone array is placed on the table and consists of 8 Boya BY-M1 omnidirectional Lavelier microphones arranged in a circular shape of diameter 15cm. Audio signals are sampled synchronously at 44.1 kHz with a multi-channel audio recorder. All signals are synchronized.

We encourage participants to complement the provided dataset with their own acquisitions. Please contact the organisers, if you want the data to be included in a future release of the dataset.

Preliminary dataset

The preliminary dataset is a small representative sample of the whole dataset, however some specific changes will be made to the whole dataset to account for different backgrounds and setup. The provided preliminary training set consists of the three scenarios captured under two different backgrounds (uniform tabletop and tabletop covered by a textured tablecloth) and using 2 drinking cups, 2 drinking glasses, and 2 food boxes, either empty, filled with 90% pasta, or 50% rice. The combination of containers and fillings results in a total of 18 configurations executed by a different subject for each scenario and for each background condition. The total number of configurations of this preliminary training set is 108.

View 1 View 2 View 3 View 4 Audio
Scenario 1
Red cup
”animated” ”animated” ”animated” ”animated”
Scenario 2
Wine glass
(50% rice)
”animated” ”animated” ”animated” ”animated”
Scenario 3
Tea box
(90% pasta)
”animated” ”animated” ”animated” ”animated”

- RGB = RGB videos from the 4 cameras
- Infrared = stereo infrared videos
- Depth = depth images
- Others = audio files from the microphone array, IMU data and calibration information

ID Container Sample image RGB Infrared Depth Others
1 Red cup CORSMAL-Challenge red cup 1_rgb.zip
2 Small white cup CORSMAL-Challenge red cup 2_rgb.zip
3 Wine glass CORSMAL-Challenge wine glass cup 3_rgb.zip
4 Green glass CORSMAL-Challenge green glass 4_rgb.zip
5 Biscuit box CORSMAL-Challenge biscuit box 5_rgb.zip
6 Tea box CORSMAL-Challenge tea box 6_rgb.zip

List of containers and their physical properties here
Annotation here
Microphone locations here
Readme here
Linux script to download all files (~40GB) here

Evaluation criteria and methodology

For each configuration, participants will provide their estimations for the capacity of the container, the filling type, the fullness, the mass of filling and the mass of the empty container. The evaluation will be based on the results on the testing dataset and participants will be ranked across all scenarios. Additionally, we will provide the ranks for each scenario independently. For all configurations in the testing set, the accuracy of the estimated capacity is the mean absolute error of the estimated capacity of the container, whereas the accuracy of the estimation of the filling type (e.g. rice classified as rice) and fullness (e.g. 50% classified as 50%) are the ratio between true positives and the total number of configurations. The mean absolute error is also used to compute the accuracy of the estimation of the masses for the empty container and the filling.

Important dates

Start of the Challenge. Release of the preliminary training dataset: 30 November 2019 5 December 2019
Release of the training dataset: 20 December 2019
Release of the testing dataset: 30 January 2020
Results on the testing dataset due: 21 February 2020
Release of the evaluation results: 28 February 2020
Paper due: 13 March 2020
Final acceptance notification: 15 April 2020
Final manuscript due: 29 April 2020

Submission guidelines

Participants must submit the estimations of each configuration of the testing set in a text file where each row corresponds to the respective configuration, the first column is the estimated capacity of the container (in milliliters), the second column is the filling type (water, pasta or rice), the third column is the estimated fullness (0, 50 or 90), the fourth column is the mass of the filling (in grams) and the fifth column is the mass of the empty container (in grams). A technical paper describing the proposed method and following the ICME guidelines must be submitted through the ICME 2020 website. Further details will be provided after the release of the testing dataset.


Apostolos Modas, Ecole polytechnique fédérale de Lausanne (EPFL), apostolos.modas@epfl.ch
Pascal Frossard, Ecole polytechnique fédérale de Lausanne (EPFL), pascal.frossard@epfl.ch
Andrea Cavallaro, Queen Mary University of London (UK), a.cavallaro@qmul.ac.uk
Ricardo Sanchez-Matilla, Queen Mary University of London (UK), ricardo.sanchezmatilla@qmul.ac.uk
Alessio Xompero, Queen Mary University of London (UK), a.xompero@qmul.ac.uk


Riccardo Mazzon, Queen Mary University of London (UK), r.mazzon@qmul.ac.uk