The 2020 CORSMAL Challenge

Multi-modal fusion and learning for robotics


Organised in conjunction with the 25th International Conference on Pattern Recognition 2020, Milan, Italy


IET logo Chistera logo

Call for participation

If you are interested and want to participate, please register with this form.
Info and queries:


A major challenge for human-robot cooperation in household chores is enabling robots to predict the properties of previously unseen containers with different fillings. Examples of containers are cups, glasses, mugs, bottles and food boxes, whose varying physical properties such as material, stiffness, texture, transparency, and shape must be inferred on-the-fly prior to a pick-up or a handover.

The challenge focuses on the estimation of the capacity and mass of containers, as well as the type, mass and percentage of the content, if any. Participants will determine the physical properties of a container while it is manipulated by a human. Containers vary in their physical properties (shape, material, texture, transparency, and deformability). Containers and fillings are not known to the robot: the only prior information available is a set of object categories (glasses, cups, food boxes) and a set of filling types (water, pasta, rice). Previous and related challenges/benchmarks, such as the Amazon Picking Challenge or the Yale-CMU-Berkeley (YCB) benchmark, focus on tasks where robots interact with objects on a table and without the presence of a human, for example grasping objects, table setting, stacking cups, or assembling/disassembling objects.

Advancements in this research field will help the integration of smart robots into people's lives to perform daily activities involving objects and handovers. For example, this is a step towards supporting people with disabilities in performing everyday activities.

CORSMAL distributes an audio-visual-inertial dataset of people interacting with containers, for example while pouring a filling into a glass or shaking a food box. The dataset is collected with four multi-sensor devices (one on a robotic arm, one on the human chest and two third-person views) and a circular microphone array. Each device is equipped with an RGB camera, a stereo infrared camera and an inertial measurement unit. In addition to RGB and infrared images, each device provides synchronised depth images that are spatially aligned with the RGB images. All signals are synchronised, and the calibration information for all devices, as well as the inertial measurements of the body-worn device, is also provided.

The CORSMAL Challenge includes three scenarios with an increasing level of difficulty, caused by occlusions or subject motions:

  • Scenario 1 Scenario 1. The subject sits in front of the robot, while a container is on a table. The subject pours the filling into the container, while trying not to touch the container, or shakes an already filled food box, and then initiates the handover of the container to the robot.

  • Scenario 2 Scenario 2. The subject sits in front of the robot, while holding a container. The subject pours the filling from a jar into a glass/cup or shakes an already filled food box, and then initiates the handover of the container to the robot.

  • Scenario 3 Scenario 3. A container is held by the subject while standing to the side of the robot, potentially visible from one third-person view camera only. The subject pours the filling from a jar into a glass/cup or shakes an already filled food box, takes a few steps to reach the front of the robot and then initiates the handover of the container to the robot.

Each scenario is recorded with two different backgrounds and under two different lighting conditions.



During a human-to-robot handover, the robot should be able to determine if a container-like object is filled with a content and, therefore, estimate how heavy the object is based on the amount and type of content (filling). This estimation will enable the robot to apply the right amount of forces when handing the object, avoid to a handover failure or spill content from the object.

We define the weight of the object handled by the human as the sum of the mass of an unseen container and the mass of the unknown filling within the container. However, we focus the challenge only towards the estimation of the mass of the filling (overall task). To estimate the mass of the filling, we expect a perception system to reason on the capacity of the container and to determine the type and amount of filling present in the container.
Therefore, we identify three tasks for the participants:

  • Task 1 (T1) Fullness classification. Containers can be either empty or filled with an unknown content at 50% or 90% of the whole capacity of the container (fullness). For each recording related to a container, the goal is to classify the fullness. There are three fullness classes: 0 (empty), 50, (half full), and 90 (full).
  • Task 2 (T2) Filling type classification. Containers can be either empty or filled with an unknown content. For each recording related to a container, the goal is to classify the type of filling, if any. There are four filling type classes: 0 (empty), 1 (pasta), 2 (rice), 3 (water).
  • Task 3 (T3) Container capacity estimation. Containers vary in shape and size. For each recording related to a container, the goal is to estimate the capacity of the container.

For each recording related to a container, we then compute the filling mass estimation using the estimations of fullness from T1, filling type from T2, and container capacity from T3, and using the prior density of each filling type per container. The density of pasta and rice is computed from the annotation of the filling mass, capacity of the container, and fullness for each container. Density of the water is 1 g/mL. The formula selects the annotated density for a container based on the estimated filling type.

The CORSMAL Containers Manipulation dataset

Camera 1 Camera 2 Camera 3 Camera 4 Audio
”animated” ”animated” ”animated” ”animated”

The dataset consists of 1140 audio-visual-inertial recordings of people interacting with (15) containers, using 4 cameras (RGB, depth, and infrared) and a 8-element circular microphone array. Containers are either empty of filled at 2 different levels (50%, 90%) with 3 different types of content (water, pasta, rice). For example, people can pour a liquid in a glass/cup or shake a food box.

The dataset is split into training set (9 containers), public testing set (3 containers), and private testing set (3 containers). We provide to the participants with the annotations of the capacity of the container, filling type, fullness, the mass of the container, and the mass of the filling for the training set. No annotations will be provided for public testing set while private testing set will not be released to the participants. The containers for each set are evenly distributed among the three container types.

Webpage to access and download the dataset.


Any individual or research group can download the dataset and participate in the challenge. The only prior knowledge available to the models is the filling types (water, rice, and pasta), the fullness levels (empty, 50%, and 90%), and the high-level category of the containers (cup, glass, food box). The organisers do not allow the use of prior 3D object models. Inferences must be generated automatically by a model that uses as input any of the provided data modalities or a combination (for example, video, audio or audio-visual fusion); i.e., non-automatic manipulation of the testing data (e.g., manual selection of frames) is not allowed. The use of additional training data is allowed but the provided testing set cannot be used for training. Models must perform the estimations for each testing sequence only using data from that sequence, and the training set; but not from other sequences. Learning (e.g. model fine tuning) across testing sequences is not allowed.

Starting kit and documentation

- R. Sanchez-Matilla, K. Chatzilygeroudis, A. Modas, N. Ferreira Duarte, A. Xompero, P. Frossard, A. Billard, and A. Cavallaro, Benchmark for Human-to-Robot Handovers of Unseen Containers with Unknown Filling, IEEE Robotics and Automation Letters (RA-L), vol.5, no. 2, Apr. 2020
- Baseline: a vision-based algorithm, part of a larger system, proposed for localising, tracking and estimating the dimensions of a container with a stereo camera. See more details at the benchmark page
- A. Xompero, R. Sanchez-Matilla, A. Modas, P. Frossard, and A. Cavallaro, Multi-view shape estimation of transparent containers, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4-8 May 2020
- LoDE: a method that jointly localises container-like objects and estimates their dimensions with a generative 3D sampling model and a multi-view 3D-2D iterative shape fitting, using two wide-baseline, calibrated RGB cameras. See more details at the LoDE page
- Baseline for fullness classfication: coming soon
- Baseline for filling type classfication: coming soon
- Script to pre-process the dataset: coming soon
- Evaluation script: coming soon


Performance scores

For fullness classification (Task 1) and filling type classification (Task 2), the organisers compute the Weighted Average F1-score (WAFS) across the classes, each weighted by the number of recordings in each class. For container capacity estimation (Task 3), the organisers compute the relative absolute error between the estimated and the annotated capacity for each configuration, and then compute the Average Capacity Score (ACS), that is the average score across all the configurations. The score for each configuration is computed as the exponential of the negative relative absolute error. For filling mass estimation, the organisers compute the relative absolute error between the estimated and the annotated filling mass for each configuration, unless the annotated mass is zero (i.e. empty) and then the estimation is set to zero if the estimation is also zero, otherwise equal to the estimation. The organisers then compute the Average filling Mass Score (AMS), that is the average score across all the configurations, as done for the container capacity estimation. If the capacity of the container or the filling mass in one configuration is not estimated (value should be -1), then the score is set to zero.

See the document for technical details on the performance measures.

Submission guidelines

Participants must submit the estimations for each configuration of the public testing set in a csv file to Each row corresponds to the respective configuration (as object id and sequence id), the third column is the estimated fullness class (0, 50 or 90), the fourth column is the estimated filling type class (0: empty, 1: pasta, 2: rice, 3: water), and the fifth column is the estimated capacity of the container in millilitres. Participants can compete to any or a combination of the 3 tasks. Columns related to tasks not addressed by the participants should be filled with -1 values. Method failures or configurations not address should also be filled with -1 values.

Participants should also include in the body of the email:
- Team name
- Modalities employed (RGB, IR, Depth, Audio, IMU)
- Number of (1, 2, 3, 4) and which views participants employed (demonstrator, manipulator, left side-view, right side-view)
- Completed task(s) (T1, T2, T3)
- Execution time in seconds for the complete execution of the public test.

Participants will submit the source code and executable files that will be run by the organisers on the private test set to generate the estimations of the same properties for each configuration. The source code should be properly commented and easy to run. The source code will be deleted by the organisers after the release of the results. Participants that do not submit the software for the evaluation on the private testing set will not be included in the leaderboard.

If one (or two) of the tasks is not submitted by a participant, the organisers will use baseline results for the not-submitted tasks for calculating the estimation of the filling mass. This will show the performance of the method submitted by a participant with respect to the baseline methods.


To be updated soon with baseline results.

- V1: view from the moving camera worn by the demonstrator (human)
- V2: view from the fixed camera mounted on the manipulator (robot)
- V3: view from the fixed camera on the left side of the manipulator
- V4: view from the fixed camera on the right side of the manipulator

The evaluation is based on the results of both testing sets and participants will be ranked across all configurations. The organisers will declare the winner of the challenge based on the score on the overall task. The best-performing entries will be presented at the conference venue. Selected participants will be invited to co-author the writing of a paper to discuss and analyse the research outcomes of the challenge.

Important dates (To be confirmed)

Release of the public testing set: 25 September 2020
Results on the public testing set and source code of models due: 10 October 2020
Release of the results via an on-line leader board and selected submissions: 30 November 2020
Challenge presentation at ICPR: 10-15 January 2021


Alessio Xompero, Queen Mary University of London (UK)
Andrea Cavallaro, Queen Mary University of London (UK)
Apostolos Modas, École polytechnique fédérale de Lausanne (Switzerland)
Aude Billard, École polytechnique fédérale de Lausanne (Switzerland)
Dounia Kitouni, Sorbonne Université (France)
Kaspar Althoefer, Queen Mary University of London (UK)
Konstantinos Chatzilygeroudis, École polytechnique fédérale de Lausanne (Switzerland)
Nuno Ferreira Duarte, École polytechnique fédérale de Lausanne (Switzerland)
Pascal Frossard, École polytechnique fédérale de Lausanne (Switzerland)
Ricardo Sanchez-Matilla, Queen Mary University of London (UK)
Riccardo Mazzon, Queen Mary University of London (UK)
Véronique Perdereau, Sorbonne Université (France)