CORSMAL: Collaborative object recognition, shared manipulation and learning

The 2020 CORSMAL Challenge

Multi-modal fusion and learning for robotics

Program (15 Jan 2021, starts at 3pm CET)

3:00 pm CET
Welcome and opening

3:05 pm CET
Collaborative Object Recognition, Shared Manipulation and Learning
Andrea Cavallaro
Queen Mary University of London & Alan Turing Institute

3:20 pm CET
The CORSMAL Challenge
Alessio Xompero
Queen Mary University of London

3:35 pm CET
Top-1 CORSMAL Challenge 2020 submission: Filling mass estimation using multi-modal observations of human-robot handovers
[paper] [video] [slides] [arxiv] [code]

Because It's Tactile team (Tampere University, Finland; Queen Mary University of London, U.K.)


Vladimir Iashin	Francesca Palermo	Gokhan Solak	Claudio Coppola

3:45 pm CET
Audio-Visual Hybrid Approach for Filling Mass Estimation
[paper] [video] [slides] [code]

HVRL team (Keio University, Japan)


Reina Ishikawa	Yuichi Nagao	Ryo Hachiuma	Hideo Saito

3:55 pm CET
VA2Mass: Towards the Fluid Filling Mass Estimation via Integration of Vision & Audio
[paper] [video] [slides]

Concatenation team (City University of Hong Kong)


Qi Liu	Chuanlin Lan	Fan Feng	Rosa Chan

4:05 pm CET
Round-table

4:35 pm CET
Challenge leaderboard and next steps

Leaderboard

Overall task: Filling mass estimation

Task 1: Filling level classification

Task 2: Filling type classification

Task 3: Container capacity estimation

Legend:
- View 1: view from the fixed camera on the left side of the manipulator
- View 2: view from the fixed camera on the right side of the manipulator
- View 3: view from the fixed camera mounted on the manipulator (robot)
- View 4: view from the moving camera worn by the demonstrator (human)
- A: audio modality
- RGB: colour data
- D: depth data
- IR: infrared data from narrow-baseline stereo camera

Submissions

Team name

Because It's Tactile
HVRL
Concatenation
NTNU-ERC
Challengers

Metadata

Team name:
Task 1:
Task 2:
Task 3:
Summary:
Paper:
Webpage:

Organisers

Alessio Xompero, Queen Mary University of London (UK)
Andrea Cavallaro, Queen Mary University of London (UK)
Apostolos Modas, École polytechnique fédérale de Lausanne (Switzerland)
Aude Billard, École polytechnique fédérale de Lausanne (Switzerland)
Dounia Kitouni, Sorbonne Université (France)
Kaspar Althoefer, Queen Mary University of London (UK)
Konstantinos Chatzilygeroudis, University of Patras (Greece)
Nuno Ferreira Duarte, École polytechnique fédérale de Lausanne (Switzerland)
Pascal Frossard, École polytechnique fédérale de Lausanne (Switzerland)
Ricardo Sanchez-Matilla, Queen Mary University of London (UK)
Riccardo Mazzon, Queen Mary University of London (UK)
Véronique Perdereau, Sorbonne Université (France)

Team name:	Because It's Tactile
Task 1:	Audio + RGB from all views. GRU(VGGish) + GRU(R(2+1)d [RGB-only]) for each view, and RandomForest(classical audio features)
Task 2:	Audio. GRU(VGGish) and and RandomForest
Task 3:	RGB + IR + Depth (left-side view). LoDE on detector's predictions; if no object was detected, use of the train set's average
Summary:	We sum up logits from all four views obtained from GRU on top of R(2+1)d features to form one prediction for each event, which are, then, averaged with the GRU output on top of VGGish features, and RandomForest predictions on top of 30+ classical audio features (eg mfccs, chromagram, energy, spread). LoDE with Mask R-CNN for object detection (glass, bottle, or book for boxes), in frame 1 and 20 of the videos (view 1, RGB-D-IR) to estimate container capacity. Average of the train set is used if no detection.
Paper:	Top-1 CORSMAL Challenge 2020 submission: Filling mass estimation using multi-modal observations of human-robot handovers
ArXiv:	https://arxiv.org/pdf/2012.01311.pdf
Code:	https://github.com/v-iashin/CORSMAL

Team name:	HVRL
Task 1:	Audio. From the prediction model for Task2, intermediate features are extracted and pass through LSTM models.
Task 2:	Audio. Raw audio waveform converted into a log-Mel spectrogram that is cropped into a fixed-size and provided as input to convolutional neural network model with a VGG backbone.
Task 3:	RGB + Depth from view 1: fixed camera on the left side of the manipulator (robot). Mask-RCNN detects the target object (silhouette) and a point cloud is obtained from a selected frame in the video. The volume of the container is then computed by approximating the object shape as a cuboid from the point cloud.
Paper:	Audio-Visual Hybrid Approach for Filling Mass Estimation
Code:	https://github.com/YuichiNAGAO/ICPRchallenge2020

Team name:	Concatenation
Task 1:	Audio and RGB from all views. Integrate the audio feature learning and the knowledge of container categories via the object detection pre-trained model.
Task 2:	Audio and RGB from all views. Integrate the audio feature learning and the knowledge of container categories via the object detection pre-trained model.
Task 3:	RGB from all views. Sample from the shape distribution based on the prior of container categories
Summary:	The solution is divided into three folds to help the agent shape a rich understanding of the pouring procedure. First, the agent obtains the prior of container categories (cup, glass, box) through the object detection framework. Second, audio features are integrated with the prior to make the agent learn a multi-modal feature space. Finally, the agent infers the distribution of both the container capacity and fluid properties.
Paper:	VA2Mass: Towards the Fluid Filling Mass Estimation via Integration of Vision & Audio

Team name:	NTNU-ERC
Task 1:	N/A
Task 2:	Audio
Task 3:	Depth from view 3: the fixed camera mounted on the manipulator (robot)
Summary:	Extraction of 40 normalized MFCC features in a window size of 20 ms at 22 kHz, with a maximum length of 30 s, and zero-padding to preserve the same duration across audio data. Filling type classification with a neural network consisting of 2 convolutional layers and 1 linear layer. Regression of the container capacity by extracting a region of interest (ROI) around the object localised in the depth data (view 3) and providing the ROI and its size to a neural network (4 convolutional-batchnorm followed by 3 linear layers). The size of the ROI is concatenated to the feature between the 2nd and 3rd linear layer. Only detections/ROIs up to 700 mm far from the camera, while processing the video backwards, are considered (prior knowledge that the person will extend the arm towards the robot). The closest contour is selected, if multiple detections in a frame.
Report:	NTNU-ERC Report
Code:	https://github.com/guichristmann/CORSMAL-Challenge-2020-Submission