|
|
 |
| Ground Truth |
|
Target ground truth defines the ‘ideal’ output of a perfect detection/tracking
or other computer vision algorithm applied to the video stream, and provides
the basis of a common framework for evaluating the performance of real algorithms.
For conventional video, generating the ground truth requires very labour-intensive
and error-prone manual inspection of every frame. Virtual Video alleviates the
problem by generating detailed and exact target ground truth for each frame
automatically. In Virtual Video, foreground targets consist of people and
vehicles (both stationary and moving), and the ground truth generated per frame
consists of:
- 3D target centroids in the world frame
- Per target bounding box around visible target pixels
- Per target bounding box around all target pixels
- Pixel-wise foreground segmentation map, which assigns a label to each pixel according to the target (or background) appearing at that location
The bounding boxes define the upper, lower, left and right extent of each target
in each frame. Note that the first bounding box defines the extent of target pixels
that are not occluded, while the second bounding box defines the extent of the entire
target if it were completely unoccluded. These boxes can be used together to
evaluate the performance of occlusion reasoning algorithms. The figure below left
shows typical bounding boxes generated by Virtual Video, with the visible bounding
box as solid rectangles and the full bounding box as dotted rectangles. Below right
shows the associated label map in pseudo-colour, where each colour indicates a
different target label.
 |
|
 |
| Ground truth target bounding boxes overlaid on frame |
|
Pixel-wise foreground label map |
Virtual Video also provides the camera ground truth, which define the exact
intrinsic and extrinsic parameters of the projective camera model used to
render the frame. The camera parameters consist of:
- 3D location of camera centre in world frame
- Camera orientation as Euler angles in world frame
- Horizontal field of view
- Frame dimensions
- Elapsed time since the game started when the frame was rendered
Ground truth generation is computationally expensive and threrefore not generated by
default. To enable ground truth generation you must first run the
groundtruth console command. Ground truth
can then be retrieved along with each frame using the
CamGetLatestFrame C library function or
the Get Latest Frame socket command.
Ground truth cannot be directly accessed through the DirectShow filter, however in
this case it is possible to save the ground truth to disk using the
savedir URI option (see also
Saving Ground Truth To Disk). Also the bounding boxes
and foreground label map can be displayed through the DirectShow filter using the
showbboxes and
showlabelmap options.
|
 |
| Saving Ground Truth to Disk |
|
Rendered frames and ground truth data can be saved to disk on the Virtual Video server
using the savedir DirectShow
filter option, or the CamSetType C library
function or Set Camera Type socket command.
Frames are stored as jpegs in the specified directory with the file name format
clientCCC_frmNNNNNN.jpg where CCC
is the client number (increases for each new connection) and NNNNNN is the frame
number. A binary ground truth data file corresponding to each frame is stored with
the file name format clientCCC_frmNNNNNN.gtd.
Note that the ground truth files do not contain target ground truth when target ground truth has not
been enabled using the groundtruth console
command, or when the camera is an omnicam. In either case, the ground truth files always contain
camera ground truth.
The binary *.gtd format consists of a header followed
by two variable length data buffers in the following order:
| Byte Length |
Description |
| 52 |
Header and camera ground truth |
| T x 56 |
Target ground truth, T = number of visible targets (from header) |
| L x 4 |
RLE label map, L = number of RLE data elements (from header) |
The header has the following fixed format:
| Byte Offset |
Data Type |
Description |
| 0-3 |
int |
Frame number |
| 4-7 |
float |
Elapsed time in seconds since game started |
| 8-11 |
int |
Frame width |
| 12-15 |
int |
Frame height |
| 16-19 |
float |
Horizontal field of view in degrees |
| 20-23 |
float |
Camera x-coord in world frame in inches |
| 24-27 |
float |
Camera y-coord in world frame in inches |
| 28-31 |
float |
Camera z-coord in world frame in inches |
| 32-35 |
float |
Camera rotation about world x-axis in deg |
| 36-39 |
float |
Camera rotation about world y-axis in deg |
| 40-43 |
float |
Camera rotation about world z-axis in deg |
| 44-47 |
int |
T = Number of ground truth targets |
| 48-51 |
int |
L = Length of RLE label map data in 32-bit elements |
The ground truth centroids and bounding boxes are stored contiguously in T 56-byte chunks
(one for each visible foreground target) following the header. Each chunk has the following format:
| Byte Offset |
Data Type |
Description |
| 0-3 |
int |
Target label (identifies target pixels in label map) |
| 4-7 |
float |
World x-coord of 3D target centroid in inches |
| 8-11 |
float |
World y-coord of 3D target centroid in inches |
| 12-15 |
float |
World z-coord of 3D target centroid in inches |
| 16-19 |
int |
Top y-coord of bounding box around visible target pixels |
| 20-23 |
int |
Bottom y-coord of bounding box around visible target pixels |
| 24-27 |
int |
Left x-coord of bounding box around visible target pixels |
| 28-31 |
int |
Right x-coord of bounding box around visible target pixels |
| 32-35 |
int |
Number of visible foreground pixels in bounding box |
| 36-39 |
int |
Top y-coord of bounding box around all target pixels |
| 40-43 |
int |
Bottom y-coord of bounding box around all target pixels |
| 44-47 |
int |
Left x-coord of bounding box around all target pixels |
| 48-51 |
int |
Right x-coord of bounding box around all target pixels |
| 52-55 |
int |
Number of foreground pixels in bounding box |
Finally, the label map is stored in run-length encoded format in a buffer of L contiguous
elements following the target data. Each element is a 32-bit integer (hence the total RLE buffer
length is L x 4 bytes), and are stored in pairs where each pair encodes a Label and
Run Length in pixels. The following C code extract shows how to decode the label map
from RLE data stored in rlebuf,
where width and height are the frame
dimensions from the header:
int index, pix, label, run, endOffset;
int *labelMap;
labelMap = malloc(width*height*4);
rleIndex = 0;
pixOffset = 0;
while ((rleIndex < L) && (pixOffset < width*height))
{
label = rlebuf[index++];
run = rlebuf[index++];
for (endOffset = pixOffset + run;
pixOffset < endOffset;
pixOffset++)
{
labelMap[pixOffset] = label;
}
}
|
 |
| How Target Ground Truth is Generated |
|
This section provides a high level description of how the ground truth target bounding
boxes and foreground label map are computed. This section is not required reading, but
is provided to give the user a greater understanding of the nature of the ground truth data.
The target bounding boxes and foreground label map are generated in two rendering passes
in addition to the usual frame rendering pass. The first pass extracts the full target
bounding boxes in a process reminiscent of chroma keying. For each target, a frame
buffer is initially filled with a distinctive background colour and the target is
rendered in isolation (this both identifies target pixels and eliminates occlusions).
The bounding box is then computed by finding the boundaries of all non-background
coloured pixels. The frame buffer is again cleared to the background colour before
rendering the next target.
The second pass is similar to the first but uses a z-buffer to enforce occlusions.
The z-buffer stores the scene depth at each rendered pixel and enables correct occlusion
handling by testing whether new pixels are closer than previously rendered pixels. To
enforce environmental occlusions, the world (buildings, roads, etc) is first rendered
without foreground targets and the frame buffer is then cleared while preserving the
z-buffer. The foreground targets are then rendered separately, and the z-buffer ensures
that only unoccluded pixels appear. Between rendering each target, non-background pixels
are transferred to the label map and the frame buffer is again cleared. If radial
distortion or anti-aliasing are enabled, the final label map is warped accordingly.
Finally, visible target bounding boxes are computed by finding the extent of each
uniquely labeled region in the label map.
|
 |
|
|
|
|