# The CORSMAL Containers Manipulation dataset The dataset consists of 1,140 audio-visual recordings with 12 human subjects manipulating 15 containers, split into 5 cups, 5 drinking glasses, and 5 food boxes. These containers are made of different materials, such as plastic, glass and paper. Each container can be empty or filled with water, rice or pasta at two different levels of fullness: 50% and 90% with respect to the capacity of the container. The combination of containers and fillings results in a total of 95 configurations acquired for three scenarios with an increasing level of difficulty, caused by occlusions or subject motions: * Scenario 1. The subject sits in front of the robot, while a container is on a table. The subject pours the filling into the container, while trying not to touch the container, or shakes an already filled food box, and then initiates the handover of the container to the robot. * Scenario 2. The subject sits in front of the robot, while holding a container. The subject pours the filling from a jar into a glass/cup or shakes an already filled food box, and then initiates the handover of the container to the robot. * Scenario 3. A container is held by the subject while standing to the side of the robot, potentially visible from one third-person view camera only. The subject pours the filling from a jar into a glass/cup or shakes an already filled food box, takes a few steps to reach the front of the robot and then initiates the handover of the container to the robot. Each scenario is recorded with two different backgrounds and under two different lighting conditions. The first background condition involves a plain tabletop with the subject wearing a texture-less t-shirt, while the second background condition involves the table covered with a graphics-printed tablecloth and the subject wearing a patterned shirt. The lighting conditions include ceiling room lights and controlled lights. The 95 configurations are executed by a different subject for each scenario and for each background/illumination condition. The dataset is split into train set (9 containers, 684 configurations), public test set (3 unseen containers, 228 configurations), and private test set (3 unseen containers, 228 configurations). The containers for each set are evenly distributed among the three categories. We provide to the participants the annotations of the capacity of the container, the filling type, the filling level, the mass of the container, the mass of the filling, the maximum width and height (and depth for boxes) of each container for the train set. [CORSMAL Containers Manipulation webpage](http://corsmal.eecs.qmul.ac.uk/containers_manip.html) ## Download data The data is provided as *.zip* files. Due to the size of some data, e.g. *depth*, they are provided as a sequence of zips (e.g., filename.z01, filename.z02) for each view. To unzip the file, you must download all the zip file for a single sequence. To facilitate the download of the training and public test sets, scripts for Linux (*.sh*) and Windows (*.bat*) are provided. Given the large storage of the dataset, it is recommended to download only the necessary data. Comment or comment out the rows corresponding to files and modalities you need to download. ## Data organisation The data is structured by camera view (1-4) and modality: video (RGB, depth, infrared, calibration, IMU) and audio. IMU (Inertial Measuremnet Unit) data consists of acceleremoter and gyroscope data over time. Note IMU is only provided for *View4*. For each set (train, public test, private test), the data is already shuffled and filenames are provided as incremental numbers, e.g.: CCM | | train/test_pub/test_priv | |----audio | |----000000.wav | |----000001.wav | ... |----view1 | |----rgb | |----000000.mp4 | |----000001.mp4 | ... | |----calib | |----000000.pickle | |----000001.pickle | ... | |----infrared1 | |----000000.mp4 | |----000001.mp4 | ... | |----infrared2 | |----000000.mp4 | |----000001.mp4 | ... | |----depth | |----000000 | |----000000.png | |----000001.png | ... |----000001 | |----000000.png | |----000001.png | ... | ... |----view4 | ... | |----accel | |----000000.csv | |----000001.csv | ... | |----gyro | |----000000.csv | |----000001.csv | ... |----rgb | |----000000.mp4 | |----000001.mp4 | ... | |----calib | |----000000.pickle | |----000001.pickle | ... | |----infrared1 | |----000000.mp4 | |----000001.mp4 | ... | |----infrared2 | |----000000.mp4 | |----000001.mp4 | ... | |----depth | |----000000 | |----000000.png | |----000001.png | ... |----000001 | |----000000.png | |----000001.png | ... ### RGB Each file contains the RGB video in *.mp4* format. Videos are encoded with the x264 codec and visual lossless parameters. ### Infrared Each file contains the infrared video in *.mp4* format. /infrared1 contains the videos for the left infrared camera. /infrared2 contains the videos for the right infrared camera. Videos are encoded with the x264 codec and visual lossless parameters. ### Depth Each directory contains the depth frames: /.png Images are stored with the 16-bit encoding and provide the estimate distance to the pixel. To obtain distance in millimeters divide the value by 1000. ### Audio Each file contains the audio recording in *.wav* format for each configuration. Audio are recorded at 44.1 KHz with an 8-element circular microphone array of radius 15 cm. Microphone locations are manually measured with an uncertainty of up to 2 cm in each direction. Recordings were acquired in a university room in different moments of the day over a period of 8 days. Therefore, audio signals are affected by background noise, such as office noise and outside noise (e.g. busy street and wind). ### IMU Each file contains the accelerometer and gyroscope for each configuration in *view4*, in *.csv* format. Accelerometer files contain 4 columns that indicates the time stamp, X, Y and Z components. Gyroscope files contain 4 columns that indicates the time stamp, X, Y and Z components. ### Calibration Calibration files are provided for each view, configuration, and set of the dataset. Note that calibration files can be the same for multiple configurations (no change in the cameras position and orientation during acquisition). For the fixed views (left side of the robot, right side of the robot, and robot view, as shown in the image of the recording system), each calibration file provides * the intrinsic parameters (focal length: fx and fy; principal point: cx and cy) * the extrinsic parameters (3x3 rotation matrix, Rx; and 3x1 translation vector, Tx) for each camera (RGB, IR1, IR2), and * the annotated pixels of the center of the 2 markers (2x2 checkerboard) placed on the wall. For the body-word view, only the intrinsic parameters are provided. Note that images are provided already undistorted by the RealSense D415i sensor and hence calibration files do not provide any distortion coefficient. The markers are present in the scene only for verification of the calibration. Calibration files are provided in Pickle format and we provide below a sample Python code to read the files. ``` import pickle calibration = pickle.load(open( "", "rb")) intrinsic = calibration[0]['rgb'] extrinsic_rotation = calibration[1]['rgb']['rvec'] extrinsic_translation = calibration[1]['rgb']['tvec'] ``` Intrinsic parameters are structured as: fx 0 cx 0 fy cy 0 0 1 Rotation extrinsic parameters are structured as: R11 R12 R13 R21 R22 R23 R31 R32 R33 Translation extrinsic parameters are structured as: T1,T2,T3 ### Annotation The annotations are stored using JSON (file _ccm_train_annotations.json_). All annotations share the same basic data structure below: ``` { "info" : info, "licenses" : [license], "annotations" : [annotation], "containers" : [container], "filling_type" : [filling_type], "filling_level" : [filling_ level], "views" : [view], "transparencies" : [transparency], "scenarios" : [scenario], "handover hand" : [handover hand], } "info": { "description" : str, "version" : str, "year" : int, "contributor" : str, "date_created" : str, "url" : str, "doi" : str, "set" : str }, "licenses": { "url" : str, "id" : int, "name" : str }, "annotations": [ { "id" : int, "container id" : int, "scenario" : int, "background" : int, "illumination" : int, "width at the top" : float "width at the bottom" : float, "height" : float, "depth" : float, "container capacity" : float, "container mass" : float, "filling type" : int, "filling level" : int, "filling density" : float "filling mass" : float, "object mass" : float, "handover starting frame" : int, "handover start timestamp" : float, "handover hand" : int }, ... ] "containers": [ { "id" : int, "name" : str, "material" : str, "transparency" : int, "type" : str, }, ... ], "filling_type": [ { "id" : int, "name" : str, }, ... ], "filling_level": [ { "id" : int, "name" : str, "percentage" : str, }, ... ], "views": [ { "id" : int, "name" : str, "type" : str, }, ... ], "transparencies": [ { "id" : int, "name" : str, }, ... ], "scenarios": [ { "id" : int, "name" : str, }, ... ], "handover hand": [ { "id" : int, "name" : str, "label" : str, }, ... ] ``` ## Enquiries, Question and Comments If you have any further enquiries, question, or comments, please contact a.xompero@qmul.ac.uk or corsmal-challenge@qmul.ac.uk. ## Licence This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.