Re-produce inferecing speed of Yolov4 using acceleration framework

6 minute read

Summary

Detail steps to re-produce the inference speed of Yolov4 reported at https://github.com/AlexeyAB/darknet . The following acceleration frameworks are tested.

  • tkDNN
  • OpenCV
  • TVM : not available yet
# Ubuntu : 18.04
# Device : Gtx1060
# Cuda   : 10.1 
# OpenCV : master, 4.4-pre
# tkDNN  : 0.5
# TVM    : 0.7

Before going further, please follow steps at : tensorflow install gpu to install Cuda, cuDNN and tensorRT. All commands are using apt-get so it’ll save a lot of time and effort.

workspace for this post is at https://github.com/gachiemchiep/source_code/tree/master/yolov4

tkDNN

Prepare

/datadrive/workspace/tkDNN
├── darknet                 : customed darknet version of tkDNN
├── data                    : where to store yolov4 weight and configure files
    ├── yolov4
        ├── debug
        ├── layers
        ├── yolov4.cfg
        └── yolov4.weights
├── tkDNN                   : tkDNN source code
└── tkDNN.build             : build directory of tkDNN

Go to tkDNN github and follow compiling to compile tkDNN.


# Download tkDNN and fullfill the requirements

# Add custom build location of opencv
# Edit CMakeLists.txt
set(OpenCV_DIR ~/opt/opencv/opencv.build/)
find_package(OpenCV REQUIRED PATHS "~/opt/opencv/opencv.build/")

# then compile 
mkdir tkDNN.build;cd  tkDNN.build
cmake ../tkDNN
make

Testing with Yolov4 (608x608)

The model *YOLOv4 (608x608): is used to test. Download yolov4.weights , yolov4.cfg, coco.names .

(Note: tkDNN use the yolov4 416x416 by default)

├── data                    
    ├── yolov4
        ├── yolov4.cfg
        └── yolov4.weights

Using tkDNN to accelerate yolov4 inferencing required 3 followings steps

  • Export weights from darknet
## a. Download customized darknet used to export weights
cd darknet; make
## b. export weights
mkdir layers debug
./darknet export ../data/yolov4/yolov4.cfg ../data/yolov4/yolov4.weights layers
## c. then move layers, debug directory into this directory for storing
├── data          
    ├── yolov4
        ├── debug
        ├── layers 

  • Combine into RT file. Replace content of yolo4.cpp main method by following.(it isJust change some path, no big deal). The the modified file is at yolo4.cpp
int main() {
    std::string bin_path  = "../data/yolov4";
    std::vector<std::string> input_bins = { 
        bin_path + "/layers/input.bin"
    };
    std::vector<std::string> output_bins = {
        bin_path + "/debug/layer139_out.bin",
        bin_path + "/debug/layer150_out.bin",
        bin_path + "/debug/layer161_out.bin"
    };
    std::string wgs_path  = bin_path + "/layers";
    std::string cfg_path  = bin_path + "/yolov4.cfg";
    std::string name_path = bin_path + "/coco.names";
    std::cout << cfg_path << std::endl;
    std::cout << name_path << std::endl;

    // parse darknet network
    tk::dnn::Network *net = tk::dnn::darknetParser(cfg_path, wgs_path, name_path);
    net->print();

    //convert network to tensorRT
    tk::dnn::NetworkRT *netRT = new tk::dnn::NetworkRT(net, net->getNetworkRTName("yolov4"));
    
    int ret = testInference(input_bins, output_bins, net, netRT);
    net->releaseLayers();
    delete net;
    delete netRT;
    return ret;
}

## b. create model in tensorRT format (.rt file)
cd tkDNN.build
make

# Export fp32 model
rm yolo4_fp32.rt; 
export TKDNN_MODE=FP32
./test_yolo4
# Export fp16 model
rm yolo4_fp16.rt; 
export TKDNN_MODE=FP16
./test_yolo4
  • Then test the inferencing
# fp32
./demo yolov4_fp32.rt ../tkDNN/demo/yolo_test.mp4 y
# fp16
./demo yolov4_fp16.rt ../tkDNN/demo/yolo_test.mp4 y

Result

# yolov4 608x608 fp32
...
RtBuffer 0   dim: Data dim: 1 3 608 608 1
RtBuffer 1   dim: Data dim: 1 255 76 76 1
RtBuffer 2   dim: Data dim: 1 255 38 38 1
RtBuffer 3   dim: Data dim: 1 255 19 19 1
camera started
....

Time stats:
Min: 46.8783 ms
Max: 88.2218 ms
Avg: 49.6727 ms	20.1318 FPS

# yolov4 608x608 fp16
RtBuffer 0   dim: Data dim: 1 3 608 608 1
RtBuffer 1   dim: Data dim: 1 255 76 76 1
RtBuffer 2   dim: Data dim: 1 255 38 38 1
RtBuffer 3   dim: Data dim: 1 255 19 19 1


Time stats:
Min: 46.6338 ms
Max: 66.4354 ms
Avg: 51.1639 ms	19.545 FPS

OpenCV

Prepare

We need to compile OpenCV using the following command. If you’re using a newer card, please add your architecture to the CUDA_ARCH_BIN parameters. See Matching SM architectures (CUDA arch and CUDA gencode) for various NVIDIA cards for a list

# Download the lastest version of opencv and opencv_contrib and put at following location
~/opt/opencv
├── opencv                  : opencv master
├── opencv_contrib          : opencv contrib master branch
├── opencv.build            : build directory

# Install minoconda at ~/miniconda3 and create the py2 and py3 environment
conda create --name py2 python=2.7
conda create --name py3 python=3.7

# activate thoese environments and install numpy 
conda activate py2; pip install numpy

# Build opencv
cd opencv.build
# Note: check whether your card can support CUDA_ARCH_BIN
cmake -D CMAKE_BUILD_TYPE=RELEASE \
    -D CMAKE_INSTALL_PREFIX=/usr/local \
    -D OPENCV_EXTRA_MODULES_PATH=~/opt/opencv/opencv_contrib/modules \
    -D OPENCV_ENABLE_NONFREE=OFF \
    -D BUILD_EXAMPLES=OFF ..\
	-D BUILD_DOCS=OFF \
	-D BUILD_PERF_TESTS=OFF \
	-D BUILD_TESTS=OFF \
    -D BUILD_NEW_PYTHON_SUPPORT=ON \
    -D BUILD_opencv_python3=ON \
    -D BUILD_opencv_python2=ON \
    -D HAVE_opencv_python3=ON \
    -D INSTALL_PYTHON_EXAMPLES=OFF \
    -D OPENCV_FORCE_PYTHON_LIBS=ON \
    -D PYTHON2_EXECUTABLE=~/miniconda3/envs/py2/bin/python \
    -D PYTHON2_LIBRARY=~/miniconda3/envs/py2/lib/libpython2.7.so \
    -D PYTHON2_INCLUDE_DIRS=~/miniconda3/envs/py2/include \
    -D PYTHON2_NUMPY_INCLUDE_DIRS=~/miniconda3/envs/py2/lib/python2.7/site-packages/numpy \
    -D PYTHON3_EXECUTABLE=~/miniconda3/envs/py3/bin/python \
    -D PYTHON3_LIBRARY=~/miniconda3/envs/py3/lib/libpython3.7m.so \
    -D PYTHON3_INCLUDE_DIRS=~/miniconda3/envs/py3/include \
    -D PYTHON3_NUMPY_INCLUDE_DIRS=~/miniconda3/envs/py3/lib/python3.7/site-packages/numpy \
    -D ENABLE_FAST_MATH=ON \
    -D CUDA_FAST_MATH=ON \
    -D WITH_CUBLAS=ON \
    -D WITH_LIBV4L=ON \
    -D WITH_GSTREAMER=ON \
    -D WITH_GSTREAMER_0_10=OFF \
    -D WITH_TBB=ON \
    -D WITH_CUDA=ON -D WITH_CUDNN=ON -D WITH_NVCUVID=ON  \
	-D OPENCV_DNN_CUDA=ON \
    -D WITH_FFMPEG=1 \
    -D WITH_TIFF=ON \
    -D WITH_CUBLAS=1 \
    -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.1 \
    -D CUDA_ARCH_BIN="6.0 6.2 7.0 7.5" -D CUDA_ARCH_PTX="" ../opencv

make -j8

# Copy into corresponding environments
# py2
cp lib/cv2.so ~/miniconda3/envs/py2/lib/python2.7/site-packages
# py3
cp lib/python3/cv2.cpython-37m-x86_64-linux-gnu.so ~/miniconda3/envs/py3/lib/python3.7/site-packages/

Note : Remember to compile with OPENCV_DNN_CUDA or else we won’t be able to use the DNN_TARGET_CUDA_FP16

Testing with Yolov4 (608x608)

The model *YOLOv4 (608x608): is used to test. Download yolov4.weights , yolov4.cfg, coco.names, demo_wc.py, yolo_test.mp4.

# Then use following scrip to run yolov4 using tkDNN demo video. 
python demo_wc.py --cfg /datadrive/workspace/tkDNN/data/yolov4/yolov4.cfg --weight /datadrive/workspace/tkDNN/data/yolov4/yolov4.weights --shape 608 --batch_size 1 --classes /datadrive/workspace/tkDNN/data/yolov4/coco.names --file_path /datadrive/workspace/tkDNN/tkDNN/demo/yolo_test.mp4

Result

# fp32
Namespace(batch_size=1, cfg='/datadrive/workspace/tkDNN/data/yolov4/yolov4.cfg', classes='/datadrive/workspace/tkDNN/data/yolov4/coco.names', conf=0.5, file_path='/datadrive/workspace/tkDNN/tkDNN/demo/yolo_test.mp4', mode='fp32', nms=0.6, shape=608, weight='/datadrive/workspace/tkDNN/data/yolov4/yolov4.weights')
Init network in: 689.98 ms
Warmup network in: 482.0 ms
Min: 59.8621 ms
Max: 74.6419 ms
Avg: 61.6272 ms 16.2266 FPS

# fp16 : maybe fp16 didn't work on my GTX1060

Namespace(batch_size=1, cfg='/datadrive/workspace/tkDNN/data/yolov4/yolov4.cfg', classes='/datadrive/workspace/tkDNN/data/yolov4/coco.names', conf=0.5, file_path='/datadrive/workspace/tkDNN/tkDNN/demo/yolo_test.mp4', mode='fp16', nms=0.6, shape=608, weight='/datadrive/workspace/tkDNN/data/yolov4/yolov4.weights')
Init network in: 707.43 ms
Warmup network in: 3075.9 ms
^CMin: 2595.4211 ms
Max: 2636.2209 ms
Avg: 2614.7480 ms 0.3824 FPS

Note

The postprocess which is being used at https://github.com/opencv/opencv/blob/master/samples/dnn/object_detection.py run very slow. In some case postprocessing will cost more time than the inferencing. So we re-write using numpy and matrix manipulation. The result are as follow

# Ver 0 : use opencv version
1 loop, best of 3: 27.4 s per loop
-> postprocess = 27.4 / 100 = 0.274 sec = 274 ms

# Ver 1 : use numpy argmax (remove the opencv for loop)
1 loop, best of 3: 1.46 s per loop
-> postprocess = 1.46 / 100 = 0.0146 = 14.6 ms

# Ver 2 : merge features and do postprocess for each image
1 loop, best of 3: 1.54 s per loop
-> postprocess  = 1.54 / 100 = 0.0154 = 15.4 ms

# Ver 3: merge all features, use confidence threshold to remove invalid bbox, then do nms for each image
# note : each image has different valid bboxes. so in the last step we must use the for loop
1 loop, best of 3: 1.45 s per loop
-> post process = 1.45 / 100 = 0.0145 = 14.5 ms

TVM

DIDN’T WORK YET

Preparation

Follow tvm from source to compile and install tvm python wrapper

# install llvm : https://apt.llvm.org/
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
# use the config.make at : 
# or edit as https://tvm.apache.org/docs/install/from_source.html

cd build
cmake ..
make -j4

# use the py3 environment
conda activate py3
cd python; python setup.py install;
cd topi/python; python setup.py install;

Testing with Yolov4 (608x608)

The model *YOLOv4 (608x608): is used to test. Download yolov4.weights , yolov4.cfg, coco.names, demo_wc.py, yolo_test.mp4.

Result

The Compile YOLO-V2 and YOLO-V3 in DarkNet Models haven’t work for yolov4 yet

Result comparing

Device = GTX1060, input size = 608x608

FP tkDNN (ms) OpenCV (ms)
FP32 49.6727 61.6272
FP16 51.1639 XXX

Personal thinking

  • tkDNN : very fast, but the code isn’t mature, need to write C++ code
  • OpenCV : fast, easy to use, don’s need to much step, can use under C++ and Python
  • TVM : not available yet, but based on the official document, it can by used under Python

Reference

  1. tkDNN
  2. tensorflow install gpu
  3. Matching SM architectures (CUDA arch and CUDA gencode) for various NVIDIA cards
  4. tkDNN install opencv4
  5. opencv dnn
  6. tvm
  7. tvm: yolov2 + yolov3
  8. tvm install from source

Leave a comment