This dataset features over 7 hours of footage over 360 videos containing natural gestures used by people when collaborating to complete a shared task.

We recorded 20 pairs of people who were put in separate rooms and connected via video chat. One person (the actor) was given blocks, and the other (the signaler) was given a picture showing an arrangement of blocks. The signaler’s goal was to get the actor to replicate the arrangement of blocks.

On some trials, the participants could not hear each other and were forced to use gestures to communicate. Other trials included sound, so gestures were used to suppliment spoken language.

For more information about the dataset including information about downloading the dataset, please visit

Naturally Occurring Gestures

The participants were given no instructions about gesturing, and consequently the gestures used are those that occur naturally. These gestures are distinct from the stylized gestures common in video gaming, or the highly structured gestures found in sign languages.

Continuous, Labeled Data

Each video contains the entirety of a task, capturing all gestures and instructions between two people. Gestures have been hand-labeled by individual body part motions with orientation (e.g. “RA: move, up; RH: into two, front;”), and also by perceived intents (e.g. “two new blocks”).

High-Quality Multi-Modal Data

This data set includes registered RGB video, depth, and skeleton joint position data from the Microsoft Kinect v2 sensor, which is significantly more accurate than the first-generation Kinect.