# Object detection and encode with Neural Processing SDK

The use cases implement a `yolox.dlc` object detection model with Qualcomm Neural Processing SDK to identify an object from a camera stream. The use case is to overlay or compose the bounding boxes over the detected objects, and then encode the stream as a H.264 bitstream.

Download [YOLOX](https://aihub.qualcomm.com/iot/models/yolox?searchTerm=yolox)) Qualcomm AI runtime w8a8 precision model from AI hub. The YOLOX model uses the YOLOv8 postprocessing module.

## Use qtivoverlay plugin to apply detection overlay

Run the use case on the target device:

gst-launch-1.0 -e filesrc location=/opt/Animals_000_1080p_180s_30FPS.mp4 ! qtdemux ! h264parse ! v4l2h264dec capture-io-mode=4 output-io-mode=4 ! tee name=split \
    split. ! queue ! qtimetamux name=metamux ! queue ! qtivoverlay ! queue ! \
    v4l2h264enc capture-io-mode=4 output-io-mode=4 ! h264parse ! mp4mux ! filesink location=/opt/video.mp4 \
    split. ! queue ! qtimlvconverter ! queue ! qtimlsnpe delegate=dsp tensors="<heatmap,bbox,landmark,landmark_visibility>" model=/etc/models/foot_track_net-person-foot-detection-w8a8.dlc ! queue ! qtimlpostprocess name=stage_01_postproc results=10 module=qpd labels=/etc/labels/foot_track_net.json settings=/etc/labels/foot_track_net_settings.json ! text/x-raw ! queue ! metamux.
    Copy to clipboard

To stop the use case, use **CTRL + C**.

The following figure shows the flow of the use case execution:

1. Identify object scenes from a video stream, which is coming through a file source.
2. Overlay bounding boxes over the detected objects using overlaylib.
3. Encode this stream as an H.264 bitstream.
4. Multiplex the stream in an MP4 container and stored as an MP4 file.

**Figure : Pipeline for bounding box overlay and encode**

The following table provides the sequential processing stages of the pipeline execution:

| Process | Description |
| --- | --- |
| File source: filesrc | <ol class="arabic simple"><br><li><p>Captures the video stream using filesrc, followed by qtdemux, which demultiplexes the stream.</p></li><br><li><p>Uses tee to split the stream for inferencing.</p></li><br></ol> |
| h264parse | Parses the H.264 video. |
| [v4l2h264dec](https://docs.qualcomm.com/doc/80-80022-50/topic/v4l2h264dec.html) | Decodes the video. |
| **Preprocessing** | **Preprocessing** |
| [qtimlvconverter](https://docs.qualcomm.com/doc/80-80022-50/topic/qtimlvconverter.html) | <ol class="arabic"><br><li><p>Receives the video stream on its sink pad.</p></li><br><li><p>Performs preprocessing:</p><ul class="simple"><br><li><p>Color conversion</p></li><br><li><p>Scaling down/up</p></li><br><li><p>Normalization on the stream data when the model expects the floating point values as input</p></li><br></ul><br></li><br><li><p>Converts the video stream to a tensor stream on its source pad.</p><br><p>The object detection model uses this tensor stream for inferencing.</p><br></li><br></ol> |
| **Inferencing** | **Inferencing** |
| [qtimlsnpe](https://docs.qualcomm.com/doc/80-80022-50/topic/qtimlsnpe.html) | <ol class="arabic simple"><br><li><p>Loads the object detection model.</p></li><br><li><p>Modifies the graph for the chosen delegate.</p></li><br><li><p>Receives the tensor stream on its sinkpad.</p></li><br><li><p>Runs the inference and produces a tensor stream with the object detection results on its source pad.</p></li><br></ol> |
| **Postprocessing** | **Postprocessing** |
| [qtimlpostprocess](https://docs.qualcomm.com/doc/80-80022-50/topic/qtimlpostprocess.html) | <ol class="arabic"><br><li><p>Receives the inference tensors from the object detection model.</p></li><br><li><p>Converts the inference tensors on its sinkpad into formats like video or text that the multimedia plugins can process later.</p></li><br><li><p>Applies the threshold to the chosen number of results.</p></li><br><li><p>Loads the corresponding modules for detection models.</p><br><p>In this use case, qtimlpostprocess does the following:</p><ol class="loweralpha simple"><br><li><p>Loads the YOLOv8 submodule.</p></li><br><li><p>Produces results as structures of text.</p></li><br><li><p>Sends them to the sinkpad of qtimetamux.</p></li><br></ol><br></li><br></ol> |
| [qtimetamux](https://docs.qualcomm.com/doc/80-80022-50/topic/qtimetamux.html) | <ol class="arabic simple"><br><li><p>Receives video stream and text stream with the bounding box results corresponding to the video stream on its sinkpads.</p></li><br><li><p>Produces GST buffers with the contents of the video stream from its sink pad.</p></li><br><li><p>Adds the bounding boxes as <code class="docutils literal notranslate"><span class="pre">GstVideoRegionOfInterest</span></code> from data sinkpad to GST buffers meta (meta muxing) on its source pad.</p></li><br></ol> |
| [qtivoverlay](https://docs.qualcomm.com/doc/80-80022-50/topic/qtioverlay.html) | <ol class="arabic simple"><br><li><p>Receives the multiplexed stream.</p></li><br><li><p>Overlays the bounding boxes on the VideoFrame using CL.</p></li><br><li><p>Produces GST buffers with overlays in its source pad.</p></li><br></ol> |
| [v4l2h264enc](https://docs.qualcomm.com/doc/80-80022-50/topic/v4l2h264enc.html) | <ol class="arabic simple"><br><li><p>Applies parameters to each frame of the video stream it's receiving on its sinkpad.</p></li><br><li><p>Encodes it into bitstream and sends it over its sourcepad.</p></li><br></ol> |
| h264parse | Adds more information about the bitstream to the GStreamer buffer meta. |
| mp4mux | Receives these buffers and creates containers with format specification buffers. |
| **Output** | **Output** |
| Filesink | Stores the resulting stream in a */etc/media/video.mp4*  file. |
| Playback | Pull *video.mp4*  from the host computer and play it on a media player: `scp root@<ip>\:/etc/media/video.mp4 <destination>` |

## Use qtivcomposer to mix original frame with detection mask

Run the use case on the target device:

gst-launch-1.0 -e --gst-debug=2 \
    filesrc location=/opt/Animals_000_1080p_180s_30FPS.mp4 ! qtdemux ! h264parse ! v4l2h264dec capture-io-mode=4 output-io-mode=4 ! queue ! tee name=split \
    split. ! queue ! qtivcomposer name=mixer sink_1::position="<30, 30>" sink_1::dimensions="<320, 180>" ! queue ! \
    v4l2h264enc capture-io-mode=4 output-io-mode=5 ! h264parse ! queue ! mp4mux ! queue ! filesink location=/opt/video.mp4 \
    split. ! queue ! qtimlvconverter ! queue ! qtimlsnpe delegate=dsp tensors="<heatmap,bbox,landmark,landmark_visibility>" model=/etc/models/foot_track_net-person-foot-detection-w8a8.dlc ! queue ! qtimlpostprocess name=stage_01_postproc results=10 module=qpd labels=/etc/labels/foot_track_net.json settings=/etc/labels/foot_track_net_settings.json ! video/x-raw,format=BGRA,width=960,height=540 ! queue ! mixer.
    Copy to clipboard

To stop the use case, use **CTRL + C**.

The following figure shows the flow of the use case execution:

1. Identify object scenes from a video stream, which is coming through a file source.
2. Compose bounding boxes over objects detected and original video stream using qtivcomposer.
3. Encode this stream as an H.264 bitstream.
4. Multiplex the stream in an MP4 container and stored as an MP4 file.

<!--?xml version="1.0" encoding="UTF-8"?-->
<svg id="Layer_2" data-name="Layer 2" xmlns="http://www.w3.org/2000/svg" width="947.865921020507812" height="448.549873352050781" viewbox="0 0 947.865921020507812 448.549873352050781" aria-label="../../_images/pipeline_bounding_mask_encode_qtivcomposer.svg">
  <defs>
    <style>.svg-1 .cls-1 { fill: none; stroke: #000; stroke-miterlimit: 10 }
.svg-1 .cls-2 { fill: #fff; font-size: 16px }
.svg-1 .cls-2,.svg-1 .cls-3 { font-family: Roboto-Regular, Roboto }
.svg-1 .cls-4 { fill: #007884 }
.svg-1 .cls-5 { fill: #d2d7e1 }
.svg-1 .cls-6 { fill: #2a2aea }
.svg-1 .cls-3 { font-size: 14px }
.svg-1 .cls-7 { fill: #fafafa }</style>
  </defs>
  <g>
    <rect class="cls-7" x=".500198364257812" y=".50006103515625" width="946.8662109375" height="447.5498046875" rx="7.500000000000007" ry="7.500000000000007"></rect>
    <path class="cls-5" d="M939.865921020507812,1c3.8597412109375,0,7,3.140228271484375,7,7v432.549873352050781c0,3.859766006469727-3.1402587890625,7-7,7H8c-3.859771728515625,0-7-3.140233993530273-7-7V8c0-3.859771728515625,3.140228271484375-7,7-7h931.865921020507812M939.865921020507812,0H8C3.581764221191406,0,0,3.581756591796875,0,8v432.549873352050781c0,4.418235778808594,3.581764221191406,8,8,8h931.865921020507812c4.4183349609375,0,8-3.581764221191406,8-8V8c0-4.418243408203125-3.5816650390625-8-8-8h0Z"></path>
  </g>
  <g>
    <g>
      <text class="cls-3" transform="translate(757.439041137695312 424.64129638671875)"><tspan x="0" y="0">Qualcomm </tspan></text>
      <rect class="cls-6" x="737.188051817740416" y="412.549824622436063" width="16" height="16" rx="2" ry="2"></rect>
    </g>
    <g>
      <text class="cls-3" transform="translate(856.020858764648438 424.64129638671875)"><tspan x="0" y="0">Open source</tspan></text>
      <rect class="cls-4" x="835.769833230593576" y="412.549824622436063" width="16" height="16" rx="1.999999999999986" ry="1.999999999999986"></rect>
    </g>
  </g>
  <g>
    <rect class="cls-4" x="19.999986593745234" y="111.014308285106381" width="160" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(88.910202026367188 139.52099609375)"><tspan x="0" y="0">tee</tspan></text>
  </g>
  <g>
    <rect class="cls-6" x="19.999986593745234" y="187.855367768253927" width="160" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(44.527084350585938 216.36224365234375)"><tspan x="0" y="0">qtimlvconverter</tspan></text>
  </g>
  <g>
    <rect class="cls-6" x="205.996834359330933" y="111.014308285106381" width="140" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(227.137601852416992 140.689971923828125)"><tspan x="0" y="0">qtivcomposer</tspan></text>
  </g>
  <g>
    <line class="cls-1" x1="179.999984741210938" y1="136.014312744140625" x2="199.258987426757812" y2="136.014312744140625"></line>
    <polygon points="198.091812133789062 140.003387451171875 204.999984741210938 136.014312744140625 198.091812133789062 132.025238037109375 198.091812133789062 140.003387451171875"></polygon>
  </g>
  <g>
    <line class="cls-1" x1="345.996841430664062" y1="136.014312744140625" x2="365.255844116210938" y2="136.014312744140625"></line>
    <polygon points="364.088668823242188 140.003387451171875 370.996841430664062 136.014312744140625 364.088668823242188 132.025238037109375 364.088668823242188 140.003387451171875"></polygon>
  </g>
  <g>
    <rect class="cls-4" x="371.263849770733032" y="111.014308285106381" width="119.999999999999091" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(385.740478515625 139.52099609375)"><tspan x="0" y="0">v4l2h264enc</tspan></text>
  </g>
  <g>
    <rect class="cls-4" x="516.797880593539958" y="111.014308285106381" width="120" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(539.004959106445312 139.52099609375)"><tspan x="0" y="0">h264parse</tspan></text>
  </g>
  <g>
    <line class="cls-1" x1="491.263870239257812" y1="136.014312744140625" x2="510.522842407226562" y2="136.014312744140625"></line>
    <polygon points="509.355667114257812 140.003387451171875 516.263870239257812 136.014312744140625 509.355667114257812 132.025238037109375 509.355667114257812 140.003387451171875"></polygon>
  </g>
  <g>
    <rect class="cls-4" x="662.331911416347793" y="111.014308285106381" width="120" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(690.953079223632812 139.52099609375)"><tspan x="0" y="0">mp4mux</tspan></text>
  </g>
  <g>
    <line class="cls-1" x1="636.797866821289062" y1="136.014312744140625" x2="656.056900024414062" y2="136.014312744140625"></line>
    <polygon points="654.889724731445312 140.003387451171875 661.797866821289062 136.014312744140625 654.889724731445312 132.025238037109375 654.889724731445312 140.003387451171875"></polygon>
  </g>
  <g>
    <rect class="cls-4" x="807.865942239159267" y="111.014308285106381" width="120.000000000009095" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(842.432510375976562 139.52099609375)"><tspan x="0" y="0">filesink</tspan></text>
  </g>
  <g>
    <line class="cls-1" x1="782.331924438476562" y1="136.014312744140625" x2="801.590957641601562" y2="136.014312744140625"></line>
    <polygon points="800.423721313476562 140.003387451171875 807.331924438476562 136.014312744140625 800.423721313476562 132.025238037109375 800.423721313476562 140.003387451171875"></polygon>
  </g>
  <g>
    <line class="cls-1" x1="99.999984741210938" y1="161.47589111328125" x2="99.999984741210938" y2="180.734893798828125"></line>
    <polygon points="96.01092529296875 179.567718505859375 99.999984741210938 186.47589111328125 103.989044189453125 179.567718505859375 96.01092529296875 179.567718505859375"></polygon>
  </g>
  <g>
    <rect class="cls-6" x="19.999986593745234" y="264.790925213334958" width="160" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(64.679428100585938 293.297805786132812)"><tspan x="0" y="0">qtimlsnpe</tspan></text>
  </g>
  <g>
    <line class="cls-1" x1="99.999984741210938" y1="238.411453247070312" x2="99.999984741210938" y2="257.67047119140625"></line>
    <polygon points="96.01092529296875 256.503265380859375 99.999984741210938 263.411453247070312 103.989044189453125 256.503265380859375 96.01092529296875 256.503265380859375"></polygon>
  </g>
  <g>
    <rect class="cls-6" x="19.999986593745234" y="342.549879533984495" width="160" height="50" rx="4" ry="4"></rect>
    <text class="cls-2" transform="translate(37.722396850585938 371.056747436523438)"><tspan x="0" y="0">qtimlpostprocess</tspan></text>
  </g>
  <g>
    <line class="cls-1" x1="99.999984741210938" y1="316.170394897460938" x2="99.999984741210938" y2="335.429412841796875"></line>
    <polygon points="96.01092529296875 334.262222290039062 99.999984741210938 341.170402526855469 103.989044189453125 334.262222290039062 96.01092529296875 334.262222290039062"></polygon>
  </g>
  <g>
    <polyline class="cls-1" points="179.999984741210938 367.549873352050781 275.996963500976562 367.549880981445312 275.996963500976562 167.216888427734375"></polyline>
    <polygon points="279.986038208007812 168.384063720703125 275.996963500976562 161.47589111328125 272.007919311523438 168.384063720703125 279.986038208007812 168.384063720703125"></polygon>
  </g>
  <rect class="cls-4" x="20.000223287972403" y="20.045578671308249" width="160" height="50" rx="4" ry="4"></rect>
  <text class="cls-2" transform="translate(78.082275390625 48.552276611328125)"><tspan x="0" y="0">filesrc</tspan></text>
  <g>
    <line class="cls-1" x1="180.000228881835938" y1="45.04559326171875" x2="199.259231567382812" y2="45.04559326171875"></line>
    <polygon points="198.092056274414062 49.034637451171875 205.000228881835938 45.04559326171875 198.092056274414062 41.0565185546875 198.092056274414062 49.034637451171875"></polygon>
  </g>
  <rect class="cls-4" x="205.000223287972403" y="20.045578671308249" width="160" height="50" rx="4" ry="4"></rect>
  <text class="cls-2" transform="translate(253.70338249206543 48.552276611328125)"><tspan x="0" y="0">qtdemux</tspan></text>
  <g>
    <line class="cls-1" x1="365.000228881835938" y1="45.04559326171875" x2="384.259231567382812" y2="45.04559326171875"></line>
    <polygon points="383.092056274414062 49.034637451171875 390.000228881835938 45.04559326171875 383.092056274414062 41.0565185546875 383.092056274414062 49.034637451171875"></polygon>
  </g>
  <rect class="cls-4" x="390.000223287972403" y="20.045578671308249" width="160" height="50" rx="4" ry="4"></rect>
  <text class="cls-2" transform="translate(432.207244873046875 48.552276611328125)"><tspan x="0" y="0">h264parse</tspan></text>
  <g>
    <line class="cls-1" x1="550.000198364257812" y1="45.04559326171875" x2="569.259231567382812" y2="45.04559326171875"></line>
    <polygon points="568.092056274414062 49.034637451171875 575.000198364257812 45.04559326171875 568.092056274414062 41.0565185546875 568.092056274414062 49.034637451171875"></polygon>
  </g>
  <g>
    <polyline class="cls-1" points="99.999984741210938 105.2733154296875 99.999984741210938 90.552734375 655.000198364257812 90.552734375 655.000198364257812 70.091156005859375"></polyline>
    <polygon points="103.989044189453125 104.10614013671875 99.999984741210938 111.014312744140625 96.01092529296875 104.10614013671875 103.989044189453125 104.10614013671875"></polygon>
  </g>
  <rect class="cls-4" x="575.000223287972403" y="20.045578671308249" width="160" height="50" rx="4" ry="4"></rect>
  <text class="cls-2" transform="translate(609.379165649414062 48.552276611328125)"><tspan x="0" y="0">v4l2h264dec</tspan></text>
</svg>

**Figure : Pipeline for bounding box mask and encode with qtivcomposer**

The following table provides the sequential processing stages of the pipeline execution:

| Process | Description |
| --- | --- |
| File source: filesrc | <ol class="arabic simple"><br><li><p>Captures the video stream using filesrc, followed by qtdemux, which demultiplexes the stream.</p></li><br><li><p>Uses tee to split the stream for inferencing.</p></li><br></ol> |
| h264parse | Parses the H.264 video. |
| [v4l2h264dec](https://docs.qualcomm.com/doc/80-80022-50/topic/v4l2h264dec.html) | Decodes the video. |
| **Preprocessing** | **Preprocessing** |
| [qtimlvconverter](https://docs.qualcomm.com/doc/80-80022-50/topic/qtimlvconverter.html) | <ol class="arabic"><br><li><p>Receives the video stream on its sink pad.</p></li><br><li><p>Performs preprocessing:</p><ul class="simple"><br><li><p>Color conversion</p></li><br><li><p>Scaling down/up</p></li><br><li><p>Normalization on the stream data when the model expects the floating point values as input</p></li><br></ul><br></li><br><li><p>Converts the video stream to a tensor stream on its source pad.</p><br><p>The object detection model uses this tensor stream for inferencing.</p><br></li><br></ol> |
| **Inferencing** | **Inferencing** |
| [qtimlsnpe](https://docs.qualcomm.com/doc/80-80022-50/topic/qtimlsnpe.html) | <ol class="arabic simple"><br><li><p>Loads the object detection model.</p></li><br><li><p>Modifies the graph for the chosen delegate.</p></li><br><li><p>Receives the tensor stream on its sinkpad.</p></li><br><li><p>Runs the inference and produces tensor stream with the object detection results on its source pad.</p></li><br></ol> |
| **Postprocessing** | **Postprocessing** |
| [qtimlpostprocess](https://docs.qualcomm.com/doc/80-80022-50/topic/qtimlpostprocess.html) | <ol class="arabic"><br><li><p>Receives the inference tensors from the object detection model.</p></li><br><li><p>Converts the inference tensors on its sinkpad into formats like video or text that the multimedia plugins can process later.</p></li><br><li><p>Applies the threshold to the chosen number of results.</p></li><br><li><p>Loads the corresponding modules for detection models.</p><br><p>In this use case, qtimlpostprocess does the following:</p><ol class="loweralpha simple"><br><li><p>Loads the YOLOv8 submodule.</p></li><br><li><p>Produces video frames with only bounding boxes that can be overlaid on objects.</p></li><br><li><p>Sends them to sinkpad of qtivcomposer.</p></li><br></ol><br></li><br></ol> |
| [qtivcomposer](https://docs.qualcomm.com/doc/80-80022-50/topic/qtivcomposer.html) | <ol class="arabic simple"><br><li><p>Receives the original video stream and video stream with bounding boxes on its sinkpads</p></li><br><li><p>On its sourcepads, produces content that's composed of the video streams processed from its sinkpads.</p></li><br></ol> |
| [v4l2h264enc](https://docs.qualcomm.com/doc/80-80022-50/topic/v4l2h264enc.html) | <ol class="arabic simple"><br><li><p>Applies parameters to each frame of the video stream its receiving on its sinkpad.</p></li><br><li><p>Encodes it into bitstream and sends it over its sourcepad.</p></li><br></ol> |
| h264parse | Adds more information about the bitstream to the GStreamer buffer meta. |
| mp4mux | Receives these buffers and creates containers with format specification buffers. |
| **Output** | **Output** |
| Filesink | Stores the resulting stream in a */etc/media/video.mp4*  file. |
| Playback | Pull *video.mp4*  from the host computer and play it on a media player: `scp root@<ip>\:/etc/media/video.mp4 <destination>` |

Last Published: May 14, 2026

[Previous Topic
Object detection and display with Neural Processing SDK](https://docs.qualcomm.com/bundle/publicresource/80-80022-50/topics/single-camera-stream-with-object-detection-and-display-with-mobilenet-v2-ssd.md) [Next Topic
Image segmentation and display with Neural Processing SDK](https://docs.qualcomm.com/bundle/publicresource/80-80022-50/topics/single-camera-stream-with-image-segmentation-and-display-with-deeplabv3-quantized.md)

Source: [https://docs.qualcomm.com/doc/80-80022-50/topic/single-camera-stream-with-object-detection-and-encode-with-mobilenet-v2-ssd.html](https://docs.qualcomm.com/doc/80-80022-50/topic/single-camera-stream-with-object-detection-and-encode-with-mobilenet-v2-ssd.html)