# The CORSMAL Containers Manipulation dataset

The dataset consists of 1,140 audio-visual recordings with 12 human subjects 
manipulating 15 containers, split into 5 cups, 5 drinking glasses, and 5 food boxes. 
These containers are made of different materials, such as plastic, glass and paper. 
Each container can be empty or filled with water, rice or pasta at two different 
levels of fullness: 50% and 90% with respect to the capacity of the container. 

The combination of containers and fillings results in a total of 95 configurations 
acquired for three scenarios with an increasing level of difficulty, caused by 
occlusions or subject motions:
* Scenario 1. The subject sits in front of the robot, while a container is on a 
table. The subject pours the filling into the container, while trying not to 
touch the container, or shakes an already filled food box, and then initiates 
the handover of the container to the robot.
* Scenario 2. The subject sits in front of the robot, while holding a container. 
The subject pours the filling from a jar into a glass/cup or shakes an already 
filled food box, and then initiates the handover of the container to the robot. 
* Scenario 3. A container is held by the subject while standing to the side of 
the robot, potentially visible from one third-person view camera only. 
The subject pours the filling from a jar into a glass/cup or shakes an already 
filled food box, takes a few steps to reach the front of the robot and then 
initiates the handover of the container to the robot.

Each scenario is recorded with two different backgrounds and under two different 
lighting conditions. The first background condition involves a plain tabletop 
with the subject wearing a texture-less t-shirt, while the second background 
condition involves the table covered with a graphics-printed tablecloth and 
the subject wearing a patterned shirt. The lighting conditions include ceiling 
room lights and controlled lights. The 95 configurations are executed by a 
different subject for each scenario and for each background/illumination condition.

The dataset is split into train set (9 containers, 684 configurations), 
public test set (3 unseen containers, 228 configurations), and private test set 
(3 unseen containers, 228 configurations). The containers for each set are 
evenly distributed among the three categories. We provide to the participants 
the annotations of the capacity of the container, the filling type, the filling 
level, the mass of the container, the mass of the filling, the maximum width and 
height (and depth for boxes) of each container for the train set.

[CORSMAL Containers Manipulation webpage](http://corsmal.eecs.qmul.ac.uk/containers_manip.html)

## Download data

The data is provided as *.zip* files. Due to the size of some data, e.g. *depth*, they are provided as a sequence of zips (e.g., filename.z01, filename.z02) for each view. 
To unzip the file, you must download all the zip file for a single sequence. 

To facilitate the download of the training and public test sets, scripts for Linux (*.sh*) and Windows (*.bat*) are provided. 
Given the large storage of the dataset, it is recommended to download only the necessary data.
Comment or comment out the rows corresponding to files and modalities you need to download. 

## Data organisation

The data is structured by camera view (1-4) and modality: video (RGB, depth, infrared, calibration, IMU) and audio. 
IMU (Inertial Measuremnet Unit) data consists of acceleremoter and gyroscope data over time. Note IMU is only provided for *View4*. 

For each set (train, public test, private test), the data is already shuffled and filenames are provided as incremental numbers, e.g.:

CCM
 |
 |
train/test_pub/test_priv
 |
 |----audio
 |      |----000000.wav
 |      |----000001.wav
 |           ...
 |----view1
 |      |----rgb
 |            |----000000.mp4
 |            |----000001.mp4
 |                 ...
 |      |----calib
 |            |----000000.pickle
 |            |----000001.pickle
 |                 ...
 |      |----infrared1
 |            |----000000.mp4
 |            |----000001.mp4
 |                 ...
 |      |----infrared2
 |            |----000000.mp4
 |            |----000001.mp4
 |                 ...
 |      |----depth
 |            |----000000
 |                 |----000000.png
 |                 |----000001.png
 |                 ...
              |----000001
 |                 |----000000.png
 |                 |----000001.png
 |                 ...
 |    ...
 |----view4
 |    ...
 |      |----accel
 |            |----000000.csv
 |            |----000001.csv
 |                 ...
 |      |----gyro
 |            |----000000.csv
 |            |----000001.csv
 |                 ...
        |----rgb
 |            |----000000.mp4
 |            |----000001.mp4
 |                 ...
 |      |----calib
 |            |----000000.pickle
 |            |----000001.pickle
 |                 ...
 |      |----infrared1
 |            |----000000.mp4
 |            |----000001.mp4
 |                 ...
 |      |----infrared2
 |            |----000000.mp4
 |            |----000001.mp4
 |                 ...
 |      |----depth
 |            |----000000
 |                 |----000000.png
 |                 |----000001.png
 |                 ...
              |----000001
 |                 |----000000.png
 |                 |----000001.png
 |                 ...


### RGB

Each file contains the RGB video in *.mp4* format. Videos are encoded with the x264 codec and visual lossless parameters.

### Infrared

Each file contains the infrared video  in *.mp4* format. 
<root>/infrared1 contains the videos for the left infrared camera. 
<root>/infrared2 contains the videos for the right infrared camera.

Videos are encoded with the x264 codec and visual lossless parameters.

### Depth

Each directory contains the depth frames: <root>/<frameID/sequence>.png

Images are stored with the 16-bit encoding and provide the estimate distance to the pixel. 
To obtain distance in millimeters divide the value by 1000.

### Audio

Each file contains the audio recording in *.wav* format for each configuration.

Audio are recorded at 44.1 KHz with an 8-element circular microphone array of radius 15 cm.
Microphone locations are manually measured with an uncertainty of up to 2 cm in each direction. 

Recordings were acquired in a university room in different moments of the day over a period of 8 days.
Therefore, audio signals are affected by background noise, such as office noise and outside noise (e.g. busy street and wind).

### IMU

Each file contains the accelerometer and gyroscope for each configuration in *view4*, in *.csv* format.

Accelerometer files contain 4 columns that indicates the time stamp, X, Y and Z components.
Gyroscope files contain 4 columns that indicates the time stamp, X, Y and Z components.

### Calibration

Calibration files are provided for each view, configuration, and set of the dataset. 

Note that calibration files can be the same for multiple configurations (no change in the 
cameras position and orientation during acquisition).

For the fixed views (left side of the robot, right side of the robot, and robot 
view, as shown in the image of the recording system), each calibration file 
provides
* the intrinsic parameters (focal length: fx and fy; principal point: cx and cy)
* the extrinsic parameters (3x3 rotation matrix, Rx; and 3x1 translation vector, Tx) 
for each camera (RGB, IR1, IR2), and 
* the annotated pixels of the center of the 2 markers (2x2 checkerboard) placed on the wall.

For the body-word view, only the intrinsic parameters are provided. 

Note that images are provided already undistorted by the RealSense D415i sensor and hence calibration files
do not provide any distortion coefficient.

The markers are present in the scene only for verification of the calibration.

Calibration files are provided in Pickle format and we provide below a sample Python code to read the files. 

```
import pickle

calibration = pickle.load(open( "<rootToCalibationFile>", "rb"))
intrinsic = calibration[0]['rgb']
extrinsic_rotation = calibration[1]['rgb']['rvec']
extrinsic_translation = calibration[1]['rgb']['tvec']
```

Intrinsic parameters are structured as:
fx	0	cx
0	fy	cy
0	0	1

Rotation extrinsic parameters are structured as:
R11	R12	R13
R21	R22	R23
R31	R32	R33

Translation extrinsic parameters are structured as:
T1,T2,T3

### Annotation

The annotations are stored using JSON (file _ccm_train_annotations.json_). 
All annotations share the same basic data structure below:

```
    {
    "info"              : info,
    "licenses"          : [license],
    "annotations"       : [annotation],
    "containers"        : [container],
    "filling_type"      : [filling_type],
    "filling_level"     : [filling_ level],
    "views"             : [view],
    "transparencies"    : [transparency],
    "scenarios"         : [scenario],
    "handover hand"     : [handover hand],
    }

    "info": {
        "description"   : str,
        "version"       : str,
        "year"          : int,
        "contributor"   : str,
        "date_created"  : str,
        "url"           : str,
        "doi"           : str,
        "set"           : str
    },


    "licenses": {
        "url"           : str,
        "id"            : int,
        "name"          : str
    },


    "annotations": [
      {
        "id"                        : int,
        "container id"              : int,
        "scenario"                  : int,
        "background"                : int,
        "illumination"              : int,
        "width at the top"          : float
        "width at the bottom"       : float,
        "height"                    : float,
        "depth"                     : float,
        "container capacity"        : float,
        "container mass"            : float,
        "filling type"              : int,
        "filling level"             : int,
        "filling density"           : float
        "filling mass"              : float,
        "object mass"               : float,
        "handover starting frame"   : int,
        "handover start timestamp"  : float,
        "handover hand"             : int
      },

      ...
      
      ]

      "containers": [
        {
            "id"            : int,
            "name"          : str,
            "material"      : str,
            "transparency"  : int,
            "type"          : str,
        },
        ...
    ],
    
    "filling_type": [
        {
            "id"            : int,
            "name"          : str,
        },
        ...
    ],

    "filling_level": [
        {
            "id"            : int,
            "name"          : str,
            "percentage"    : str,
        },
        ...
    ],

    "views": [
        {
            "id"      : int,
            "name"    : str,
            "type"    : str,
        },

        ...
    ],

    "transparencies": [
        {
            "id"      : int,
            "name"    : str,
        },

        ...
    ],

    "scenarios": [
        {
            "id"      : int,
            "name"    : str,
        },

        ...
    ],
    
    "handover hand": [
        {
            "id"      : int,
            "name"    : str,
            "label"   : str,
        },
        
        ...
    ]
```


## Enquiries, Question and Comments

If you have any further enquiries, question, or comments, please contact 
a.xompero@qmul.ac.uk or corsmal-challenge@qmul.ac.uk. 


## Licence

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 
International License. To view a copy of this license, visit 
http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative 
Commons, PO Box 1866, Mountain View, CA 94042, USA.