Launched in November 2010, the Xbox 360 Kinect Sensor set a world record for the fastest selling consumer electronic device, selling ten million units in its first four months alone. Natural user interface (NUI) experiences using Kinect have proved a compelling and inspiring feature of this console generation. Xbox One (Durango) provides the opportunity to push innovative NUI further. Every Xbox One (Durango) console will come with a next-generation Kinect sensor. Developers can integrate Kinect functionality, confident that it provides a great experience for all Durango users.

Durango Sensor

The next generation sensor improves the current sensor in many areas:

  • Improved field of view results in much larger play space.
  • RGB stream is higher quality and higher resolution.
  • Depth stream is much higher resolution and able to resolve much smaller objects.
  • Higher depth stream accuracy enables separating objects in close depth proximity.
  • Higher depth stream accuracy captures depth curvature around edges better.
  • Active infrared (IR) stream permits lighting independent processing and feature recognition.
  • End to end pipeline latency is improved by 33 ms.

Sensor Characteristics Summary vs Xbox 360 Kinect Sensor

FeatureXbox 360 Kinect SensorDurango Sensor
Field of View (FOV)57.5˚ horizontal by 43.5˚ vertical70˚ horizontal by 60˚ vertical
Resolvable Depth0.8 m -> 4.0 m0.8 m -> 4.0 m
Color Stream640 x 480 x 24 bpp 4:3 RGB @ 30fps640 x 480 x 16 bpp 4:3 YUV @ 15fps1920 x 1080 x 16 bpp 16:9 YUY2 @ 30 fps
Depth Stream320 x 240 16 bpp, 13-bit depth512 x 424 x 16 bpp, 13-bit depth
Infrared (IR) StreamNo IR stream512 x 424, 11-bit dynamic range
RegistrationColor <-> depthColor <-> depth and active IR
Audio Capture4-mic array returning 48 Hz audio4-mic array returning 48K Hz audio
Data PathUSB 2.0USB 3.0
Latency~90 ms with processing~60 ms with processing
Tilt MotorVertical onlyNo tilt motor

Play Space and Field Of Viewise

With a 70-degree horizontal and 60-degree vertical FOV, and a depth range of 0.8 m to 4.0 m, the sensor captures a much expanded area compared to the Xbox 360 Kinect Sensor. At 0.8 m, this area is 1.12 m wide by 0.92 m high; at 4 m the area is 5.6 m wide by 4.6 m high. This much larger play space fits multiple players. Four players should fit comfortably, and the place can accommodate up to six.

The form factor for the next generation sensor will be similar to the current sensor, which is a wired unit, separate from the console. However, this sensor will not have a tilt motor. The wider vertical FOV should permit the sensor to be placed and oriented to capture a large enough area in a typical room without adjustment to accommodate the vast majority of users that are in the height range 1 m to 1.83 m. The sensor is able to detect that range at just 1.58 m away.

User studies from the Kinect program coupled with the requirement to gather well-separated depth information suggest that the best position for the sensor will be above the display, looking downward toward the players. This position maximizes the available depth information and minimizes joint occlusion for seated ST scenarios.

The improved FOV means:

–          Titles can be played in a much larger selection of homes, usually without moving furniture.

–          Complexities of dynamic play space set up and tilt motor handling are removed.

–          Gameplay with players of different heights is much easier

–          Fitting two or more players in the play area is much more practical.

Sensor Data Streams – Color

The sensor can return a full HD resolution (that is, 1920 x 1080) color stream at 30 frames per second, returned in YUY2 format. YUY2 format packs two pixels as four 8-bit components: Y1, U, Y2, V where Y1 and Y2 are individual pixel luminance values, and U and V are shared chrominance values for the two pixels. Quality and resolution are considerably improved over the current generation sensor, especially in low-light situations.

Sensor Data Streams – Depth

The sensor returns a 512 x 424 16-bit depth stream, at 30 frames per second. The bit-depth layout is exactly as the current Kinect Sensor – 13 bits of depth information and a 3-bit segmentation mask. In addition to higher resolution, the depth sensor is more precise. For example, at 3.5 m it can resolve objects two to three times smaller than the current sensor.

Sensor Data Streams – Active IR

As part of the process of producing the depth stream, the sensor uses an active IR stream. This stream is 512 x 424 at 30 frames per second. The active IR stream is stable across variable lighting conditions. For example, shadows, pixel intensities and noise characteristics are the same for a well-lit room the same as for no light in the room. As a result, this stream could be used for feature detection in situations where a color stream would be useless.

Sensor Data Streams – Registration

The depth stream is derived from the active IR stream. That means both IR and depth streams will have precisely the same point of view (POV), pixel for pixel; there is no transformation mechanism to introduce artefacts. The color sensor, however, is not in the same position as the depth or IR sensor. That means the color stream will appear to be from a slightly different POV. A registration mechanism will be provided that transforms the color stream to the view space of the depth and IR stream, or the other way around. Registration inevitably adds some minor artefacts to the stream being transformed.

Sensor Data Streams – Audio

The audio hardware is a four-microphone array, each capturing a raw 48 KHz 24-bit stream. Multi-channel echo cancellation (MEC) is carried out on these streams by the MEC hardware, that is, not by the CPU and not at cost to a title. The title is presented with a noise-reduced 16 KHz 24-bit stream of voice data. The audio output from the console itself is part of the cancellation process.

Skeleton Tracking

The skeleton tracking system on Durango will be enhanced over the Xbox 360 system with the following new or improved features.

New features:

  • Tracking of players with height of one meter.
  • One mode for both seated and standing players.
  • Detection of hand states, for example, open or closed hands.
  • Detection of extra joints, and rotations for some joints.

Impoved features:

  • Tracking of six, rather than­­­­­­ two, active players.
  • Tracking of occluded joints, for example, an elbow occluded by a hand.
  • Detection of joint positions.
  • Detection of sideways poses.

Identity System

The NUI Identity system on Xbox 360 uses a combination of sensor inputs to recognize players. On Durango, identity will work the same way, except that the active IR stream provides an additional visible light-independent input, which will make identity recognition much more robust.

Durango’s identity system will be continuously running – its allocation is part of the system reservation. For this reason, developers can think of identity as another input stream from the NUI Identity system. This significantly reduced API set makes integrating identity with titles smooth and easy.

System Allocations

On Durango, from the POV of allocations, the NUI architecture is split into two parts.

Core Kinect functionality that is frequently used by titles and the system itself are part of the allocation system, including color, depth, active IR, ST, identity, and speech.  Using these features or not costs a game title the same memory, CPU time, and GPU time. These features also provide advantages. For example, the identity system will run across application switches because it is handled by the system, not individual applications, and avoids having to re-engage and sign-in repeatedly.

Functionality used less often has its allocation managed in a pay-per-play model. For example, registering color to depth and active IR (or the other way around) as an infrequently used operation will cost the title some small amount of CPU time.

System Latency

End-to-end system latency for Kinect is measured as the time from light hitting the sensor through to the display outputting an update based on that input. Improvements across the whole system on Durango are expected to remove around 33 ms from the end-to-end time. Kinect’s CPU, GPU and memory usage on Durango are part of the system reservation.