# Object detection and encode with Neural Processing SDK

The use cases implement a `yolox.dlc` object detection model with Qualcomm Neural Processing SDK to identify an object from a camera stream. The use case is to overlay or compose the bounding boxes over the detected objects, and then encode the stream as a H.264 bitstream.

Download [YOLOX](https://aihub.qualcomm.com/iot/models/yolox?searchTerm=yolox)) Qualcomm AI runtime w8a8 precision model from AI hub. The YOLOX model uses the YOLOv8 postprocessing module.

Note

For Ubuntu Server, `sudo` access is necessary to write the encoded stream to the `/etc/media` folder.

## Use qtivoverlay plugin to apply detection overlay

Run the use case on the target device:

gst-launch-1.0 -e \
    qtiqmmfsrc name=camsrc ! video/x-raw,format=NV12_Q08C,width=1280,height=720,framerate=30/1 ! queue ! tee name=split \
    split. ! queue ! qtimetamux name=metamux ! queue ! qtivoverlay ! queue ! v4l2h264enc capture-io-mode=4 output-io-mode=5 ! h264parse ! queue ! mp4mux ! queue ! filesink location=/etc/media/video.mp4 \
    split. ! queue ! qtimlvconverter ! queue ! qtimlsnpe delegate=dsp model=/etc/models/yolox-yolo-x-w8a8.dlc tensors="<boxes,scores,class_idx>" ! queue ! \
    qtimlpostprocess settings="{\"confidence\": 70.0}" results=5 module=yolov8 labels=/etc/labels/yolox.json ! text/x-raw ! queue ! metamux.
    Copy to clipboard

To stop the use case, use **CTRL + C**.

The following figure shows the flow of the use case execution:

1. Identify object scenes from a video stream, which is coming through a camera source.
2. Overlay bounding boxes over the detected objects using overlaylib.
3. Encode this stream as an H.264 bitstream.
4. Multiplex the stream in an MP4 container and stored as an MP4 file.

The following table provides the sequential processing stages of the pipeline execution:

Table : Pipeline processing stages for bounding box overlay and encode

| Process | Description |
| --- | --- |
| qtiqmmfsrc | <ol class="arabic simple"><br><li><p>Collects the video stream (source) and creates two copies of the source:</p><ul class="simple"><br><li><p>One stream is sent to qtimetamux plugin to retain the video stream.</p></li><br><li><p>The other stream is sent to an ML inferencing pipeline.</p></li><br></ul><br></li><br></ol> |
| **Preprocessing** |
| qtimlvconverter | <ol class="arabic"><br><li><p>Receives the video stream on its sink pad.</p></li><br><li><p>Performs preprocessing:</p><ul class="simple"><br><li><p>Color conversion</p></li><br><li><p>Scaling down/up</p></li><br><li><p>Normalization on the stream data when the model expects the floating point values as input</p></li><br></ul><br></li><br><li><p>Converts the video stream to a tensor stream on its source pad.</p><br><p>The object detection model uses this tensor stream for inferencing.</p><br></li><br></ol> |
| **Inferencing** |
| qtimlsnpe | <ol class="arabic simple"><br><li><p>Loads the object detection model.</p></li><br><li><p>Modifies the graph for the chosen delegate.</p></li><br><li><p>Receives the tensor stream on its sinkpad.</p></li><br><li><p>Runs the inference and produces a tensor stream with the object detection results on its source pad.</p></li><br></ol> |
| **Postprocessing** |
| qtimlpostprocess | <ol class="arabic"><br><li><p>Receives the inference tensors from the object detection model.</p></li><br><li><p>Converts the inference tensors on its sinkpad into formats like video or text that the multimedia plugins can process later.</p></li><br><li><p>Applies the threshold to the chosen number of results.</p></li><br><li><p>Loads the corresponding modules for detection models.</p><br><p>In this use case, qtimlpostprocess does the following:</p><ol class="loweralpha simple"><br><li><p>Loads the YOLOv8 submodule.</p></li><br><li><p>Produces results as structures of text.</p></li><br><li><p>Sends them to the sinkpad of qtimetamux.</p></li><br></ol><br></li><br></ol> |
| qtimetamux | <ol class="arabic simple"><br><li><p>Receives video stream and text stream with the bounding box results corresponding to the video stream on its sinkpads.</p></li><br><li><p>Produces GST buffers with the contents of the video stream from its sink pad.</p></li><br><li><p>Adds the bounding boxes as <code class="docutils literal notranslate"><span class="pre">GstVideoRegionOfInterest</span></code> from data sinkpad to GST buffers meta (meta muxing) on its source pad.</p></li><br></ol> |
| qtivoverlay | <ol class="arabic simple"><br><li><p>Receives the multiplexed stream.</p></li><br><li><p>Overlays the bounding boxes on the VideoFrame using CL.</p></li><br><li><p>Produces GST buffers with overlays in its source pad.</p></li><br></ol> |
| v4l2h264enc | <ol class="arabic simple"><br><li><p>Applies parameters to each frame of the video stream it's receiving on its sinkpad.</p></li><br><li><p>Encodes it into bitstream and sends it over its sourcepad.</p></li><br></ol> |
| h264parse | Adds more information about the bitstream to the GStreamer buffer meta. |
| mp4mux | Receives these buffers and creates containers with format specification buffers. |
| **Output** |
| Filesink | Stores the resulting stream in a */etc/media/video.mp4*  file. |
| Playback | Pull *video.mp4*  from the host computer and play it on a media player: `scp root@<ip>\:/etc/media/video.mp4 <destination>` |

## Use qtivcomposer to mix original frame with detection mask

Run the use case on the target device:

gst-launch-1.0 -e \
    qtiqmmfsrc name=camsrc ! video/x-raw,format=NV12_Q08C,width=1280,height=720,framerate=30/1 ! queue ! tee name=split \
    split. ! queue ! qtivcomposer name=mixer ! queue ! video/x-raw,format=NV12,width=1920,height=1080,interlace-mode=progressive,colorimetry=bt601 ! \
    v4l2h264enc capture-io-mode=4 output-io-mode=5 ! h264parse ! queue ! mp4mux ! queue ! filesink location=/etc/media/video.mp4 \
    split. ! queue ! qtimlvconverter ! queue ! qtimlsnpe delegate=dsp model=/etc/models/yolox-yolo-x-w8a8.dlc tensors="<boxes,scores,class_idx>" ! queue ! \
    qtimlpostprocess settings="{\"confidence\": 70.0}" results=5 module=yolov8 labels=/etc/labels/yolox.json ! video/x-raw,width=640,height=360 ! queue ! mixer.
    Copy to clipboard

To stop the use case, use **CTRL + C**.

The following figure shows the flow of the use case execution:

1. Identify object scenes from a video stream, which is coming through a camera source.
2. Compose bounding boxes over objects detected and original video stream using qtivcomposer.
3. Encode this stream as an H.264 bitstream.
4. Multiplex the stream in an MP4 container and stored as an MP4 file.

<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" width="1053.934104919439051" height="357.535463333129883" viewbox="0 0 1053.934104919439051 357.535463333129883" aria-label="../../_images/pipeline_bounding_mask_encode_qtivcomposer.svg">
  <g id="Layer_1" data-name="Layer 1">
    <g>
      <rect x=".500267028808594" y=".499940872192383" width="1052.93359375" height="356.53515625" rx="7.499999999999957" ry="7.499999999999957" style="fill: #fafafa;"></rect>
      <path d="M1045.934104919439051,1c3.85986328125,0,7,3.140132904052734,7,7v341.535463333129883c0,3.85986328125-3.14013671875,7-7,7H8c-3.859870910644531,0-7-3.14013671875-7-7V8c0-3.859867095947266,3.140129089355469-7,7-7h1037.934104919439051M1045.934104919439051,0H8C3.581733703613281,0,0,3.581731796264648,0,8v341.535463333129883c0,4.41827392578125,3.581733703613281,8,8,8h1037.934104919439051c4.418212890619543,0,8-3.58172607421875,8-8V8c0-4.418268203735352-3.581787109380457-8-8-8h0Z" style="fill: #d2d7e1;"></path>
    </g>
    <g>
      <g>
        <text transform="translate(856.862510681152344 333.627168655395508)" style="font-family: Roboto-Regular, Roboto; font-size: 14px;"><tspan x="0" y="0">Qualcomm </tspan></text>
        <rect x="836.611544611447243" y="321.535699844360352" width="16" height="16" rx="2" ry="2" style="fill: #2a2aea;"></rect>
      </g>
      <g>
        <text transform="translate(955.444297790527344 333.627168655395508)" style="font-family: Roboto-Regular, Roboto; font-size: 14px;"><tspan x="0" y="0">Open source</tspan></text>
        <rect x="935.193326024300404" y="321.535699844360352" width="16" height="16" rx="2" ry="2" style="fill: #007884;"></rect>
      </g>
    </g>
  </g>
  <g id="Layer_2" data-name="Layer 2">
    <g>
      <g>
        <rect x="20.000055258294196" y="19.999946995800201" width="120" height="50" rx="4" ry="4" style="fill: #007884;"></rect>
        <text transform="translate(53.429786682128906 48.506645202636719)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">camsrc</tspan></text>
      </g>
      <g>
        <rect x="166.068116903908958" y="19.999946995800201" width="120" height="50" rx="4" ry="4" style="fill: #007884;"></rect>
        <text transform="translate(214.97833251953125 48.506645202636719)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">tee</tspan></text>
      </g>
      <g>
        <rect x="146.068116903908958" y="96.841006478947747" width="160" height="50" rx="4" ry="4" style="fill: #2a2aea;"></rect>
        <text transform="translate(170.595218658447266 125.347871780395508)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">qtimlvconverter</tspan></text>
      </g>
      <g>
        <line x1="140.534080505372003" y1="44.999948501586914" x2="160.510719299316406" y2="44.999948501586914" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="159.489433288574219 48.490381240844727 165.534080505372003 44.999948501586914 159.489433288574219 41.509515762329102 159.489433288574219 48.490381240844727"></polygon>
      </g>
      <g>
        <rect x="312.064964669495566" y="19.999946995800201" width="140" height="50" rx="4" ry="4" style="fill: #2a2aea;"></rect>
        <text transform="translate(333.205726623535156 49.675605297088623)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">qtivcomposer</tspan></text>
      </g>
      <g>
        <line x1="286.068107604980469" y1="44.999948501586914" x2="306.044761657714844" y2="44.999948501586914" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="305.023460388183594 48.490381240844727 311.068107604980469 44.999948501586914 305.023460388183594 41.509515762329102 305.023460388183594 48.490381240844727"></polygon>
      </g>
      <g>
        <line x1="452.064964294433594" y1="44.999948501586914" x2="472.041587829589844" y2="44.999948501586914" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="471.020286560058594 48.490381240844727 477.064964294433594 44.999948501586914 471.020286560058594 41.509515762329102 471.020286560058594 48.490381240844727"></polygon>
      </g>
      <g>
        <rect x="477.331980080897665" y="19.999946995800201" width="119.999999999999091" height="50" rx="4" ry="4" style="fill: #007884;"></rect>
        <text transform="translate(491.808616638183594 48.506645202636719)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">v4l2h264enc</tspan></text>
      </g>
      <g>
        <rect x="622.866010903704591" y="19.999946995800201" width="120" height="50" rx="4" ry="4" style="fill: #007884;"></rect>
        <text transform="translate(645.073081970214844 48.506645202636719)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">h264parse</tspan></text>
      </g>
      <g>
        <line x1="597.331993103027344" y1="44.999948501586914" x2="617.308616638183594" y2="44.999948501586914" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="616.287315368652344 48.490381240844727 622.331993103027344 44.999948501586914 616.287315368652344 41.509515762329102 616.287315368652344 48.490381240844727"></polygon>
      </g>
      <g>
        <rect x="768.400041726512427" y="19.999946995800201" width="120" height="50" rx="3.999999999999991" ry="3.999999999999991" style="fill: #007884;"></rect>
        <text transform="translate(797.021202087402344 48.506645202636719)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">mp4mux</tspan></text>
      </g>
      <g>
        <line x1="742.865989685058594" y1="44.999948501586914" x2="762.842674255371094" y2="44.999948501586914" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="761.821372985839844 48.490381240844727 767.865989685058594 44.999948501586914 761.821372985839844 41.509515762329102 761.821372985839844 48.490381240844727"></polygon>
      </g>
      <g>
        <rect x="913.934072549323901" y="19.999946995800201" width="120.000000000005457" height="50" rx="4" ry="4" style="fill: #007884;"></rect>
        <text transform="translate(948.500633239746094 48.506645202636719)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">filesink</tspan></text>
      </g>
      <g>
        <line x1="888.400047302246094" y1="44.999948501586914" x2="908.376670837402344" y2="44.999948501586914" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="907.355369567871094 48.490381240844727 913.400047302246094 44.999948501586914 907.355369567871094 41.509515762329102 907.355369567871094 48.490381240844727"></polygon>
      </g>
      <g>
        <line x1="226.068107604980469" y1="70.461526870727539" x2="226.068107604980469" y2="90.438165664671942" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="222.577690124511719 89.416872024536133 226.068107604980469 95.461526870727539 229.558555603027344 89.416872024536133 222.577690124511719 89.416872024536133"></polygon>
      </g>
      <g>
        <rect x="146.068116903908958" y="173.776563924028778" width="160" height="50" rx="4" ry="4" style="fill: #2a2aea;"></rect>
        <text transform="translate(190.747562408447266 202.283449172973633)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">qtimlsnpe</tspan></text>
      </g>
      <g>
        <line x1="226.068107604980469" y1="147.39708137512207" x2="226.068107604980469" y2="167.373720169067383" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="222.577690124511719 166.352434158325195 226.068107604980469 172.39708137512207 229.558555603027344 166.352434158325195 222.577690124511719 166.352434158325195"></polygon>
      </g>
      <g>
        <rect x="146.068116903908958" y="251.535518244678315" width="160" height="50" rx="4" ry="4" style="fill: #2a2aea;"></rect>
        <text transform="translate(170.817874908447266 280.042390823364258)" style="fill: #fff; font-family: Roboto-Regular, Roboto; font-size: 16px;"><tspan x="0" y="0">qtimlvdetection</tspan></text>
      </g>
      <g>
        <line x1="226.068107604980469" y1="225.156038284301758" x2="226.068107604980469" y2="245.132692337036133" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></line>
        <polygon points="222.577690124511719 244.111391067505792 226.068107604980469 250.156038284301758 229.558555603027344 244.111391067505792 222.577690124511719 244.111391067505792"></polygon>
      </g>
      <g>
        <polyline points="306.068107604980469 276.535524368286133 382.064964294433594 276.535524368286133 382.064964294433594 75.484888076782227" style="fill: none; stroke: #000; stroke-miterlimit: 10;"></polyline>
        <polygon points="385.555381774902344 76.506181716918945 382.064964294433594 70.461526870727539 378.574546813964844 76.506181716918945 385.555381774902344 76.506181716918945"></polygon>
      </g>
    </g>
  </g>
</svg>
**Figure : Pipeline for bounding box mask and encode with qtivcomposer**

The following table provides the sequential processing stages of the pipeline execution:

Table : Pipeline processing stages for bounding box mask and encode with qtivcomposer

| Process | Description |
| --- | --- |
| qtiqmmfsrc | <ol class="arabic simple"><br><li><p>Collects the video stream (source) and creates two copies of the source:</p><ul class="simple"><br><li><p>One stream is sent to qtimetamux plugin to retain the video stream.</p></li><br><li><p>The other stream is sent to an ML inferencing pipeline.</p></li><br></ul><br></li><br></ol> |
| **Preprocessing** |
| qtimlvconverter | <ol class="arabic"><br><li><p>Receives the video stream on its sink pad.</p></li><br><li><p>Performs preprocessing:</p><br><blockquote><br><div><ul class="simple"><br><li><p>Color conversion</p></li><br><li><p>Scaling down/up</p></li><br><li><p>Normalization on the stream data when the model expects the floating point values as input</p></li><br></ul><br></div></blockquote><br></li><br><li><p>Converts the video stream to a tensor stream on its source pad.</p><br><p>The object detection model uses this tensor stream for inferencing.</p><br></li><br></ol> |
| **Inferencing** |
| qtimlsnpe | <ol class="arabic simple"><br><li><p>Loads the object detection model.</p></li><br><li><p>Modifies the graph for the chosen delegate.</p></li><br><li><p>Receives the tensor stream on its sinkpad.</p></li><br><li><p>Runs the inference and produces tensor stream with the object detection results on its source pad.</p></li><br></ol> |
| **Postprocessing** |
| qtimlpostprocess | <ol class="arabic"><br><li><p>Receives the inference tensors from the object detection model.</p></li><br><li><p>Converts the inference tensors on its sinkpad into formats like video or text that the multimedia plugins can process later.</p></li><br><li><p>Applies the threshold to the chosen number of results.</p></li><br><li><p>Loads the corresponding modules for detection models.</p><br><p>In this use case, qtimlpostprocess does the following:</p><br><blockquote><br><div><ol class="loweralpha simple"><br><li><p>Loads the YOLOv8 submodule.</p></li><br><li><p>Produces video frames with only bounding boxes that can be overlaid on objects.</p></li><br><li><p>Sends them to sinkpad of qtivcomposer.</p></li><br></ol><br></div></blockquote><br></li><br></ol> |
| qtivcomposer | <ol class="arabic simple"><br><li><p>Receives the original video stream and video stream with bounding boxes on its sinkpads</p></li><br><li><p>On its sourcepads, produces content that's composed of the video streams processed from its sinkpads.</p></li><br></ol> |
| v4l2h264enc | <ol class="arabic simple"><br><li><p>Applies parameters to each frame of the video stream its receiving on its sinkpad.</p></li><br><li><p>Encodes it into bitstream and sends it over its sourcepad.</p></li><br></ol> |
| h264parse | Adds more information about the bitstream to the GStreamer buffer meta. |
| mp4mux | Receives these buffers and creates containers with format specification buffers. |
| **Output** |
| Filesink | Stores the resulting stream in a */etc/media/video.mp4*  file. |
| Playback | Pull *video.mp4*  from the host computer and play it on a media player: `scp root@<ip>\:/etc/media/video.mp4 <destination>` |

Last Published: Apr 02, 2026

Previous Topic
 
Object detection and display with Neural Processing SDK Next Topic

Image segmentation and display with Neural Processing SDK