Retro Information Science: Testing the First Variations of YOLO | by Dmitrii Eliuseev

Let’s journey 8 years again in time

Objects detection with YOLO, Picture by writer

The world of knowledge science is continually altering. Typically, we can not see these adjustments simply because they’re going slowly, however after a while, it’s straightforward to observe again and see that the panorama grew to become drastically totally different. Instruments and libraries, which have been on the chopping fringe of progress solely 10 years in the past, could be fully forgotten at this time.

YOLO (You Solely Look As soon as) is a well-liked object detection library. Its first model was launched a fairly very long time in the past, in 2015. YOLO was working quick, it offered good outcomes, and the pre-trained fashions have been publicly out there. The mannequin rapidly grew to become common, and the mission continues to be actively enhancing these days. This provides us the chance to see how information science instruments and libraries have developed over time. On this article, I’ll take a look at totally different YOLO variations, from the very first V1 as much as the most recent V8.

For additional testing, I’ll use a picture from the OpenCV YOLO tutorial:

Take a look at picture, Supply © https://opencv-tutorial.readthedocs.io

Readers who wish to reproduce the outcomes on their very own can open that hyperlink and obtain the unique picture.

Let’s get began.

YOLO V1..V3

The very first paper, “You Solely Look As soon as: Unified, Actual-Time Object Detection,” about YOLO was released in 2015. And surprisingly, YOLO v1 continues to be available for download. As Mr.Redmon, one of many authors of the unique paper, wrote, he’s retaining this model “for historic functions”, and that is very nice certainly. However can we run it at this time? The mannequin is distributed within the type of two recordsdata. The configuration file “yolo.cfg” accommodates particulars concerning the neural community mannequin:

[net]
batch=1
peak=448
width=448
channels=3
momentum=0.9
decay=0.0005
...[convolutional]
batch_normalize=1
filters=64
dimension=7
stride=2
pad=1
activation=leaky

And the second file “yolov1.weights“, because the title suggests, accommodates the weights of the pre-trained mannequin.

This sort of format is just not from PyTorch or Keras. It turned out that the mannequin was created utilizing Darknet, an open-source neural community framework written in C. This mission continues to be available on GitHub, but it surely appears to be like deserted. In the meanwhile of writing this text, there are 164 pull requests and 1794 open points; the final commits have been made in 2018, and later solely README.md was modified (properly, that is most likely how the demise of the mission appears to be like within the trendy digital world).

The unique Darknet mission is deserted; that is dangerous information. The excellent news is that the readNetFromDarknet methodology continues to be out there in OpenCV, and it’s present even within the newest OpenCV variations. So, we are able to simply attempt to load the unique YOLO v1 mannequin utilizing the fashionable Python setting:

import cv2mannequin = cv2.dnn.readNetFromDarknet("yolo.cfg", "yolov1.weights")

Alas, it didn’t work; I solely obtained an error:

darknet_io.cpp:902: error: 
(-212:Parsing error) Unknown layer kind: native in operate 'ReadDarknetFromCfgStream'

It turned out that “yolo.cfg” has a layer named “native”, which isn’t supported by OpenCV, and I don’t know if there’s a workaround for that. Anyway, the YOLO v2 config doesn’t have this layer anymore, and this mannequin could be efficiently loaded in OpenCV:

import cv2mannequin = cv2.dnn.readNetFromDarknet("yolov2.cfg", "yolov2.weights")

Utilizing the mannequin is just not as straightforward as we would anticipate. First, we have to discover the output layers of the mannequin:

ln = mannequin.getLayerNames()
output_layers = [ln[i - 1] for i in mannequin.getUnconnectedOutLayers()]

Then we have to load the picture and convert it into binary format, which the mannequin can perceive:

img = cv2.imread('horse.jpg')
H, W = img.form[:2]blob = cv2.dnn.blobFromImage(img, 1/255.0, (608, 608), swapRB=True, crop=False)

Lastly, we are able to run ahead propagation. A “ahead” methodology will run the calculations and return the requested layer outputs:

mannequin.setInput(blob)
outputs = mannequin.ahead(output_layers)

Making the ahead propagation is easy, however parsing the output could be a bit difficult. The mannequin is producing 85-dimensional function vectors as an output, the place the primary 4 digits signify object rectangles, the fifth digit is a chance of the presence of an object, and the final 80 digits include the chance info for the 80 classes the mannequin was educated on. Having this info, we are able to draw the labels over the unique picture:

threshold = 0.5
bins, confidences, class_ids = [], [], []# Get all bins and labels
for output in outputs:
for detection in output:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > threshold:
center_x, center_y = int(detection[0] * W), int(detection[1] * H)
width, peak = int(detection[2] * W), int(detection[3] * H)
left = center_x - width//2
high = center_y - peak//2
bins.append([left, top, width, height])
class_ids.append(class_id)
confidences.append(float(confidence))
# Mix bins collectively utilizing non-maximum suppression
indices = cv2.dnn.NMSBoxes(bins, confidences, 0.5, 0.4)
# All COCO courses
courses = "particular person;bicycle;automobile;motorcycle;aeroplane;bus;prepare;truck;boat;visitors mild;fireplace hydrant;cease signal;parking meter;bench;chook;" 
"cat;canine;horse;sheep;cow;elephant;bear;zebra;giraffe;backpack;umbrella;purse;tie;suitcase;frisbee;skis;snowboard;sports activities ball;kite;" 
"baseball bat;baseball glove;skateboard;surfboard;tennis racket;bottle;wine glass;cup;fork;knife;spoon;bowl;banana;apple;sandwich;" 
"orange;broccoli;carrot;sizzling canine;pizza;donut;cake;chair;couch;pottedplant;mattress;diningtable;bathroom;tvmonitor;laptop computer;mouse;distant;keyboard;" 
"mobile phone;microwave;oven;toaster;sink;fridge;e book;clock;vase;scissors;teddy bear;hair dryer;toothbrush".break up(";")
# Draw rectangles on picture
colours = np.random.randint(0, 255, dimension=(len(courses), 3), dtype='uint8')
for i in indices.flatten():
x, y, w, h = bins[i]
coloration = [int(c) for c in colors[class_ids[i]]]
cv2.rectangle(img, (x, y), (x + w, y + h), coloration, 2)
textual content = f"{courses[class_ids[i]]}: {confidences[i]:.2f}"
cv2.putText(img, textual content, (x + 2, y - 6), cv2.FONT_HERSHEY_COMPLEX, 0.5, coloration, 1)
# Present
cv2.imshow('window', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Right here I exploit np.argmax to search out the category ID with the utmost chance. The YOLO mannequin was educated utilizing the COCO (Frequent Objects in Context, Artistic Commons Attribution 4.0 License) dataset, and for simplicity causes, I positioned all 80 label names instantly within the code. I additionally used the OpenCV NMSBoxes methodology to mix embedded rectangles collectively.

The ultimate end result appears to be like like this:

We efficiently ran a mannequin launched in 2016 in a contemporary setting!

The following model, YOLO v3, was launched two years later, in 2018, and we are able to additionally run it utilizing the identical code (the weights and config recordsdata are available online). Because the authors wrote in the paper, the brand new mannequin is extra correct, and we are able to simply confirm this:

Certainly, a V3 mannequin was capable of finding extra objects on the identical picture. These readers who’re taken with technical particulars can learn this TDS article written in 2018.

YOLO V5..V7

As we are able to see, the mannequin loaded with the readNetFromDarknet methodology works, however the required code is fairly “low-level” and cumbersome. OpenCV builders determined to make life simpler, and in 2019, a brand new DetectionModel class was added to model 4.1.2. We will load the YOLO mannequin this fashion; the final logic stays the identical, however the required quantity of code is far smaller. The mannequin instantly returns class IDs, confidence values, and rectangles in a single methodology name:

import cv2mannequin = cv2.dnn_DetectionModel("yolov7.cfg", "yolov7.weights")
mannequin.setInputParams(dimension=(640, 640), scale=1/255, imply=(127.5, 127.5, 127.5), swapRB=True)
class_ids, confidences, bins = mannequin.detect(img, confThreshold=0.5)
# Mix bins collectively utilizing non-maximum suppression
indices = cv2.dnn.NMSBoxes(bins, confidences, 0.5, 0.4)
# All COCO courses
courses = "particular person;bicycle;automobile;motorcycle;aeroplane;bus;prepare;truck;boat;visitors mild;fireplace hydrant;cease signal;parking meter;bench;chook;" 
"cat;canine;horse;sheep;cow;elephant;bear;zebra;giraffe;backpack;umbrella;purse;tie;suitcase;frisbee;skis;snowboard;sports activities ball;kite;" 
"baseball bat;baseball glove;skateboard;surfboard;tennis racket;bottle;wine glass;cup;fork;knife;spoon;bowl;banana;apple;sandwich;" 
"orange;broccoli;carrot;sizzling canine;pizza;donut;cake;chair;couch;pottedplant;mattress;diningtable;bathroom;tvmonitor;laptop computer;mouse;distant;keyboard;" 
"mobile phone;microwave;oven;toaster;sink;fridge;e book;clock;vase;scissors;teddy bear;hair dryer;toothbrush".break up(";")
# Draw rectangles on picture
colours = np.random.randint(0, 255, dimension=(len(courses), 3), dtype='uint8')
for i in indices.flatten():
x, y, w, h = bins[i]
coloration = [int(c) for c in colors[class_ids[i]]]
cv2.rectangle(img, (x, y), (x + w, y + h), coloration, 2)
textual content = f"{courses[class_ids[i]]}: {confidences[i]:.2f}"
cv2.putText(img, textual content, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, coloration, 1)
# Present
cv2.imshow('window', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

As we are able to see, all of the low-level code wanted for extracting bins and confidence values from the mannequin output is just not wanted anymore.

The results of operating YOLO v7 is, typically, the identical, however the rectangle across the horse appears to be like extra correct:

YOLO V8

The 8th version was launched in 2023, so I can not take into account it “retro”, at the very least in the intervening time of scripting this textual content. However simply to check the outcomes, let’s see the code required these days to run YOLO:

from ultralytics import YOLO
import supervision as svmannequin = YOLO('yolov8m.pt')
outcomes = mannequin.predict(supply=img, save=False, save_txt=False, verbose=False)
detections = sv.Detections.from_yolov8(outcomes[0])
# Create record of labels
labels = []
for ind, class_id in enumerate(detections.class_id):
labels.append(f"{mannequin.mannequin.names[class_id]}: {detections.confidence[ind]:.2f}")
# Draw rectangles on picture
box_annotator = sv.BoxAnnotator(thickness=2, text_thickness=1, text_scale=0.4)
box_annotator.annotate(scene=img, detections=detections, labels=labels)
# Present
cv2.imshow('window', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

As we are able to see, the code grew to become much more compact. We don’t have to deal with dataset label names (the mannequin gives a “names” property) or how to attract rectangles and labels on the picture (there’s a particular BoxAnnotator class for that). We even don’t have to obtain mannequin weights anymore; the library will do it routinely for us. In comparison with 2016, this system from 2023 “shrunk” from about 50 to about 5 strains of code! It’s clearly a pleasant enchancment, and trendy builders don’t have to learn about ahead propagation or the output stage format anymore. The mannequin simply works as a black field with some “magic” inside. Is it good or dangerous? I don’t know 🙂

As for the end result itself, it’s roughly comparable:

The mannequin works properly, and at the very least on my laptop, the calculation velocity improved in comparison with v7, possibly due to the higher use of the GPU.

Conclusion

On this article, we have been in a position to take a look at virtually all YOLO fashions, created from 2016 as much as 2023. At first look, makes an attempt to run a mannequin, launched virtually 10 years in the past, could appear to be a waste of time. However as for me, I discovered quite a bit whereas doing these checks:

It was fascinating to see how common information science instruments and libraries have developed over time. The development of shifting from low-level code to high-level strategies, which do all the pieces and even obtain the pre-trained mannequin earlier than execution (at the very least for now, with out asking for a subscription key but, however who is aware of what can be subsequent 10 years later?), appears to be like clear. Is it good or dangerous? That is an fascinating and open query.
It was necessary to know that OpenCV is “natively” able to operating deep studying fashions. This enables utilizing neural community fashions not solely in giant frameworks like PyTorch or Keras but in addition in pure Python and even C++ functions. Not each utility is operating in a cloud with nearly limitless sources. The IoT market is rising, and that is particularly necessary for operating neural networks on low-power gadgets like robots, surveillance cameras, or sensible doorbells.

Within the subsequent article, I’ll take a look at it in additional element and present how YOLO v8 runs on a low-powered board like a Raspberry Pi, and we will take a look at each Python and C++ variations. Keep tuned.

In the event you loved this story, be happy to subscribe to Medium, and you’ll get notifications when my new articles can be revealed, in addition to full entry to 1000’s of tales from different authors.