Ground Truth

Target ground truth defines the ‘ideal’ output of a perfect detection/tracking or other computer vision algorithm applied to the video stream, and provides the basis of a common framework for evaluating the performance of real algorithms. For conventional video, generating the ground truth requires very labour-intensive and error-prone manual inspection of every frame. Virtual Video alleviates the problem by generating detailed and exact target ground truth for each frame automatically. In Virtual Video, foreground targets consist of people and vehicles (both stationary and moving), and the ground truth generated per frame consists of:

  1. 3D target centroids in the world frame
  2. Per target bounding box around visible target pixels
  3. Per target bounding box around all target pixels
  4. Pixel-wise foreground segmentation map, which assigns a label to each pixel according to the target (or background) appearing at that location

The bounding boxes define the upper, lower, left and right extent of each target in each frame. Note that the first bounding box defines the extent of target pixels that are not occluded, while the second bounding box defines the extent of the entire target if it were completely unoccluded. These boxes can be used together to evaluate the performance of occlusion reasoning algorithms. The figure below left shows typical bounding boxes generated by Virtual Video, with the visible bounding box as solid rectangles and the full bounding box as dotted rectangles. Below right shows the associated label map in pseudo-colour, where each colour indicates a different target label.

  
Ground truth target bounding boxes overlaid on frame    Pixel-wise foreground label map

Virtual Video also provides the camera ground truth, which define the exact intrinsic and extrinsic parameters of the projective camera model used to render the frame. The camera parameters consist of:

  1. 3D location of camera centre in world frame
  2. Camera orientation as Euler angles in world frame
  3. Horizontal field of view
  4. Frame dimensions
  5. Elapsed time since the game started when the frame was rendered

Ground truth generation is computationally expensive and threrefore not generated by default. To enable ground truth generation you must first run the groundtruth console command. Ground truth can then be retrieved along with each frame using the CamGetLatestFrame C library function or the Get Latest Frame socket command. Ground truth cannot be directly accessed through the DirectShow filter, however in this case it is possible to save the ground truth to disk using the savedir URI option (see also Saving Ground Truth To Disk). Also the bounding boxes and foreground label map can be displayed through the DirectShow filter using the showbboxes and showlabelmap options.

Saving Ground Truth to Disk

Rendered frames and ground truth data can be saved to disk on the Virtual Video server using the savedir DirectShow filter option, or the CamSetType C library function or Set Camera Type socket command. Frames are stored as jpegs in the specified directory with the file name format clientCCC_frmNNNNNN.jpg where CCC is the client number (increases for each new connection) and NNNNNN is the frame number. A binary ground truth data file corresponding to each frame is stored with the file name format clientCCC_frmNNNNNN.gtd. Note that the ground truth files do not contain target ground truth when target ground truth has not been enabled using the groundtruth console command, or when the camera is an omnicam. In either case, the ground truth files always contain camera ground truth.

The binary *.gtd format consists of a header followed by two variable length data buffers in the following order:

Byte Length Description
52 Header and camera ground truth
T x 56 Target ground truth, T = number of visible targets (from header)
L x 4 RLE label map, L = number of RLE data elements (from header)

The header has the following fixed format:

Byte Offset Data Type Description
0-3 int Frame number
4-7 float Elapsed time in seconds since game started
8-11 int Frame width
12-15 int Frame height
16-19 float Horizontal field of view in degrees
20-23 float Camera x-coord in world frame in inches
24-27 float Camera y-coord in world frame in inches
28-31 float Camera z-coord in world frame in inches
32-35 float Camera rotation about world x-axis in deg
36-39 float Camera rotation about world y-axis in deg
40-43 float Camera rotation about world z-axis in deg
44-47 int T = Number of ground truth targets
48-51 int L = Length of RLE label map data in 32-bit elements

The ground truth centroids and bounding boxes are stored contiguously in T 56-byte chunks (one for each visible foreground target) following the header. Each chunk has the following format:

Byte Offset Data Type Description
0-3 int Target label (identifies target pixels in label map)
4-7 float World x-coord of 3D target centroid in inches
8-11 float World y-coord of 3D target centroid in inches
12-15 float World z-coord of 3D target centroid in inches
16-19 int Top y-coord of bounding box around visible target pixels
20-23 int Bottom y-coord of bounding box around visible target pixels
24-27 int Left x-coord of bounding box around visible target pixels
28-31 int Right x-coord of bounding box around visible target pixels
32-35 int Number of visible foreground pixels in bounding box
36-39 int Top y-coord of bounding box around all target pixels
40-43 int Bottom y-coord of bounding box around all target pixels
44-47 int Left x-coord of bounding box around all target pixels
48-51 int Right x-coord of bounding box around all target pixels
52-55 int Number of foreground pixels in bounding box

Finally, the label map is stored in run-length encoded format in a buffer of L contiguous elements following the target data. Each element is a 32-bit integer (hence the total RLE buffer length is L x 4 bytes), and are stored in pairs where each pair encodes a Label and Run Length in pixels. The following C code extract shows how to decode the label map from RLE data stored in rlebuf, where width and height are the frame dimensions from the header:

int index, pix, label, run, endOffset;
int *labelMap;
labelMap = malloc(width*height*4);
rleIndex = 0;
pixOffset = 0;
while ((rleIndex < L) && (pixOffset < width*height))
{
   label = rlebuf[index++];
   run = rlebuf[index++];
   for (endOffset = pixOffset + run;
      pixOffset < endOffset;
      pixOffset++)
   {
      labelMap[pixOffset] = label;
   }
}
How Target Ground Truth is Generated

This section provides a high level description of how the ground truth target bounding boxes and foreground label map are computed. This section is not required reading, but is provided to give the user a greater understanding of the nature of the ground truth data.

The target bounding boxes and foreground label map are generated in two rendering passes in addition to the usual frame rendering pass. The first pass extracts the full target bounding boxes in a process reminiscent of chroma keying. For each target, a frame buffer is initially filled with a distinctive background colour and the target is rendered in isolation (this both identifies target pixels and eliminates occlusions). The bounding box is then computed by finding the boundaries of all non-background coloured pixels. The frame buffer is again cleared to the background colour before rendering the next target.

The second pass is similar to the first but uses a z-buffer to enforce occlusions. The z-buffer stores the scene depth at each rendered pixel and enables correct occlusion handling by testing whether new pixels are closer than previously rendered pixels. To enforce environmental occlusions, the world (buildings, roads, etc) is first rendered without foreground targets and the frame buffer is then cleared while preserving the z-buffer. The foreground targets are then rendered separately, and the z-buffer ensures that only unoccluded pixels appear. Between rendering each target, non-background pixels are transferred to the label map and the frame buffer is again cleared. If radial distortion or anti-aliasing are enabled, the final label map is warped accordingly. Finally, visible target bounding boxes are computed by finding the extent of each uniquely labeled region in the label map.

© Copyright ObjectVideo, 2006. All rights reserved.