Human-to-Robot Handovers
(Essential Skills Sub-Track 4)

Solution design and development
  • Inferences must be generated automatically by a model that uses images or videos as input
  • No prior knowledge of the objects is available. The only prior knowledge available to the models is the high-level set of categories of the containers (cup, drinking glass, food box).
  • The use of prior 3D object models is not allowed (e.g., the reconstructed shapes of the containers in 3D).
  • Each subject is instructed to use only one hand and always the same.
  • The location where the robot needs to deliver the container must be inferred using perception from the initial location and not hard-coded (to be confirmed).
  • Learning across executions of configurations is not allowed: participants must not update/fine-tune the vision/robotic algorithms using the data/measurements captured during the execution of the configurations of the benchmark by the subjects (test time).
  • Online solutions - i.e., solutions that can be run on a continuous stream as for the case of human-to-robot handover - are preferred. To encourage this type of solution, organisers will refer to the CORSMAL real-to-simulation framework that allows the participants to observe models would perform for a human-to-robot handover.
  • Calls to exisiting Large Language Models are allowed.
  • Tactile sensors are allowed as part of the solution.
  • Both RGB and RGB-D inputs can be used for the solution.