1 idea was running efficientlion-lite4.onnx in the 32 bit tensorflow backend, extracting the computed value of K & using graphsurgeon to insert it back in. If there was a way to precompute K, polygraphy should have already done it.
1 idea was using the pretrained efficientdet-lite4 checkpoint from
https://github.com/google/automl/tree/master/efficientdet
https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco/efficientdet-lite4.tgz
with cropping. This was the only one which made it to tensorrt. The problem is efficientdet-lite was already shown to not do the job unless it was trained specifically on lion/human hybrids.
Checking the ONNX dump, automl was a radically different model with no topK operator.
Another idea was creating another model quantized to INT8 so https://github.com/zhenhuaw-me/tflite2onnx could get to the next step, but they might all use the same topK operator.
Another hit introduced the concept of making tensorrt plugins for the offending operators. Source code for tensorrt would be nice, but it's an nvidia-only program.
Another go at training the automl model seemed like the easiest idea. There's not much on training it besides a whitepaper. There's an example command in a ponderously long tutorial.ipynb
python3 main.py \ --mode=train_and_eval \ --train_file_pattern='../../train_lion/pascal-00000-of-00010.tfrecord' \ --val_file_pattern='../../val_lion/pascal-00000-of-00010.tfrecord' \ --model_name=efficientdet-lite4 \ --model_dir=../../efficientlion-lite4/ \ --backbone_ckpt=efficientnet-b4 \ --train_batch_size=1 \ --eval_batch_size=1 \ --eval_samples=100 \ --num_examples_per_epoch=1000 \ --hparams="num_classes=1,moving_average_decay=0,mixed_precision=true" \ --num_epochs=300
model_name: efficientdet-lite0-4
num_examples_per_epoch: is the number of training images
eval_samples: is the number of validation images
train_batch_size, eval_batch_size: are the batch sizes, limited by RAM
model_dir: is the destination directory
num_classes: is the number of object types
backbone_ckpt: directory with the starting checkpoint.
train_file_pattern, val_file_pattern: shortpaw notation for a range of files in a data set directory
num_epochs: the README shows all the efficentdets using 300
He downloads the starting checkpoint from
https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b4.tar.gz
There's an efficientnet-b* file for each efficientdet model.
He downloads the training & validation images from
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
Then runs a program to create the tfrecord metadata
mkdir tfrecord
PYTHONPATH=. python3 dataset/create_pascal_tfrecord.py --data_dir=VOCdevkit --year=VOC2012 --output_path=tfrecord/pascal
It's important to specify the PYTHONPATH.
The VOC dataset has a really complicated structure. The tfrecords are binary files containing metadata + JPEGs.
There's also a create_coco_tf_record.py which takes a JSON file. Run it twice to make the train & val data sets.
PYTHONPATH=. python3 dataset/create_coco_tfrecord.py --image_dir=../../train_lion --image_info_file=../../train_lion/instances_train.json --output_file_prefix=../../train_lion/pascal --num_shards=10
PYTHONPATH=. python3 dataset/create_coco_tfrecord.py --image_dir=../../val_lion --image_info_file=../../val_lion/instances_val.json --output_file_prefix=../../val_lion/pascal --num_shards=10
With the training function ingesting this data, there's not much verbosity. It saves every epoch in model_dir & loads the last saved epoch from model_dir when it starts. It burns 8 minutes per epoch for efficientdet4.
Then it's about repeating the successful tensorrt conversion on the same computer which did the training.
Create the efficientlion-lite4.yaml file in
../../efficientlion-lite4/
--- image_size: 640x640 num_classes: 1 moving_average_decay: 0 nms_configs: method: hard iou_thresh: 0.35 score_thresh: 0. sigma: 0.0 pyfunc: False max_nms_inputs: 0 max_output_size: 100
num_classes worked around a mismatched layer size
moving_average_decay worked around a missing ExponentialMovingAverage operator
The warn= bug fixes still applied, but it didn't fail on eager execution.
Inside automl/efficientdet/tf2/ run
PYTHONPATH=.:.. python3 inspector.py --mode=export --model_name=efficientdet-lite4 --model_dir=../../../efficientlion-lite4/ --saved_model_dir=../../../efficientlion-lite4.out --hparams=../../../efficientlion-lite4/efficientlion-lite4.yaml
Then copy the efficientlion-lite4.out/ directory to the jetson nano. Enable the swap space. Inside TensorRT/samples/python/efficientdet run
OPENBLAS_CORETYPE=CORTEXA57 python3 create_onnx.py --input_size="640,640" --saved_model=/root/efficientlion-lite4.out --onnx=/root/efficientlion-lite4.out/efficientlion-lite4.onnx
Finally comes the trtexec command
/usr/src/tensorrt/bin/trtexec --fp16 --workspace=1024 --onnx=/root/efficientlion-lite4.out/efficientlion-lite4.onnx --saveEngine=/root/efficientlion-lite4.out/efficientlion-lite4.engine
That worked. It was the lion kingdom's 1st end to end trained object detector in tensorrt format. Very important to specify --fp16.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.