Training automl efficientdet-lite4

1 idea was running efficientlion-lite4.onnx in the 32 bit tensorflow backend, extracting the computed value of K & using graphsurgeon to insert it back in. If there was a way to precompute K, polygraphy should have already done it.

1 idea was using the pretrained efficientdet-lite4 checkpoint from

https://github.com/google/automl/tree/master/efficientdet

https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco/efficientdet-lite4.tgz

with cropping. This was the only one which made it to tensorrt. The problem is efficientdet-lite was already shown to not do the job unless it was trained specifically on lion/human hybrids.

Checking the ONNX dump, automl was a radically different model with no topK operator.

Another idea was creating another model quantized to INT8 so https://github.com/zhenhuaw-me/tflite2onnx could get to the next step, but they might all use the same topK operator.

Another hit introduced the concept of making tensorrt plugins for the offending operators. Source code for tensorrt would be nice, but it's an nvidia-only program.

Another go at training the automl model seemed like the easiest idea. There's not much on training it besides a whitepaper. There's an example command in a ponderously long tutorial.ipynb

python3 main.py \
--mode=train_and_eval \
--train_file_pattern='../../train_lion/pascal-00000-of-00010.tfrecord' \
--val_file_pattern='../../val_lion/pascal-00000-of-00010.tfrecord' \
--model_name=efficientdet-lite4  \
--model_dir=../../efficientlion-lite4/ \
--backbone_ckpt=efficientnet-b4  \
--train_batch_size=1  \
--eval_batch_size=1 \
--eval_samples=100 \
--num_examples_per_epoch=1000 \
--hparams="num_classes=1,moving_average_decay=0,mixed_precision=true" \
--num_epochs=300

model_name: efficientdet-lite0-4 num_examples_per_epoch: is the number of training images
eval_samples: is the number of validation images
train_batch_size, eval_batch_size: are the batch sizes, limited by RAM
model_dir: is the destination directory
num_classes: is the number of object types
backbone_ckpt: directory with the starting checkpoint.
train_file_pattern, val_file_pattern: shortpaw notation for a range of files in a data set directory

num_epochs: the README shows all the efficentdets using 300

He downloads the starting checkpoint from

https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b4.tar.gz

There's an efficientnet-b* file for each efficientdet model.

He downloads the training & validation images from

http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar

Then runs a program to create the tfrecord metadata

mkdir tfrecord
PYTHONPATH=. python3 dataset/create_pascal_tfrecord.py --data_dir=VOCdevkit --year=VOC2012 --output_path=tfrecord/pascal

It's important to specify the PYTHONPATH.

The VOC dataset has a really complicated structure. The tfrecords are binary files containing metadata + JPEGs.

There's also a create_coco_tf_record.py which takes a JSON file. Run it twice to make the train & val data sets.

PYTHONPATH=. python3 dataset/create_coco_tfrecord.py --image_dir=../../train_lion --image_info_file=../../train_lion/instances_train.json --output_file_prefix=../../train_lion/pascal --num_shards=10

PYTHONPATH=. python3 dataset/create_coco_tfrecord.py --image_dir=../../val_lion --image_info_file=../../val_lion/instances_val.json --output_file_prefix=../../val_lion/pascal --num_shards=10

With the training function ingesting this data, there's not much verbosity. It saves every epoch in model_dir & loads the last saved epoch from model_dir when it starts. It burns 8 minutes per epoch for efficientdet4.

Then it's about repeating the successful tensorrt conversion on the same computer which did the training.

https://hackaday.io/project/190480-robot-mounted-tracking-cam/log/220398-converting-tflite-to-tensorrt

Create the efficientlion-lite4.yaml file in ../../efficientlion-lite4/

---
image_size: 640x640
num_classes: 1 
moving_average_decay: 0
nms_configs: 
     method: hard
     iou_thresh: 0.35
     score_thresh: 0.
     sigma: 0.0
     pyfunc: False
     max_nms_inputs: 0
     max_output_size: 100

num_classes worked around a mismatched layer size

moving_average_decay worked around a missing ExponentialMovingAverage operator

The warn= bug fixes still applied, but it didn't fail on eager execution.

Inside automl/efficientdet/tf2/ run

PYTHONPATH=.:.. python3 inspector.py --mode=export --model_name=efficientdet-lite4 --model_dir=../../../efficientlion-lite4/ --saved_model_dir=../../../efficientlion-lite4.out --hparams=../../../efficientlion-lite4/efficientlion-lite4.yaml

Then copy the efficientlion-lite4.out/ directory to the jetson nano. Enable the swap space. Inside TensorRT/samples/python/efficientdet run

OPENBLAS_CORETYPE=CORTEXA57 python3 create_onnx.py --input_size="640,640" --saved_model=/root/efficientlion-lite4.out --onnx=/root/efficientlion-lite4.out/efficientlion-lite4.onnx

Finally comes the trtexec command

/usr/src/tensorrt/bin/trtexec --fp16 --workspace=1024 --onnx=/root/efficientlion-lite4.out/efficientlion-lite4.onnx --saveEngine=/root/efficientlion-lite4.out/efficientlion-lite4.engine

That worked. It was the lion kingdom's 1st end to end trained object detector in tensorrt format. Very important to specify --fp16.

Training efficientdet-lite4

Efficientdet with no detections

Discussions

Become a Hackaday.io Member