Extracting 3D Coordinates of Objects Detected by Vision Framework in ARSCNView Video Feeds
This article is the entry for day [21] in the KINTO Technologies Advent Calendar 2024🎅🎄
Introduction
Hey there! I’m Viacheslav, an iOS engineer, and today’s article is part of the Advent Calendar 2024 event at KINTO Technologies.
This year, I had the opportunity to work on a new feature in our Unlimited app called "これなにガイド" (Kore Nani Gaido, meaning "What's This Guide"). Kore Nani Gaido is an augmented reality (AR) manual that allows users to "scan" their car's dashboard and displays virtual markers over a car's buttons, switches, and other elements. By selecting a marker corresponding to a specific button, users can access a detailed manual page for that control.
Today, I’d like to share a short and simple solution to one of the challenges I encountered while working on this feature: accurately capturing the coordinates of a physical object recognized by Vision framework on the screen and converting them into 3D coordinates within an AR scene.
What initially seemed like a trivial task turned out to be more complex than expected. After exploring several approaches and performing a lot of manual calculations, I finally arrived at a solution that is both straightforward and surprisingly concise. I wish I’d known about it when I started, as there is relatively little information available about ARKit and CoreML integration. So, let’s add to that knowledge base!
A Couple of Preconditions
Before we get to the actual code, let's clearly define the environment we’ll be working with:
-
ARSCNView
This is a view that displays the video feed from the device's camera, capturing the real-world environment and allowing 3D objects to "blend in" for an AR experience.ARSCNView
is part of Apple's ARKit, built on top of SceneKit, which handles rendering 3D objects in an AR scene. -
Core ML Object Detection Model
Before we can determine the coordinates of an object, we first need to recognize it within the video feed frames provided by the device's camera. Vision framework utilizes Core ML Object detection models for that purpose. For this article, I’ll assume you already have a model ready to use. If not, there are many pre-trained models available for download, such as YOLOv3-Tiny, which you can find here.
And that’s all you need for a bare-bones solution! We’ll capture video frames from ARSCNView
, use the Core ML model to detect the object’s position within the ARSCNView
viewport, and apply a technique called "hit-testing" to determine the object’s coordinates in 3D AR space.
Capturing the Coordinates of a Recognized Object in ARSCNView
When performing recognition requests with Vision, a typical setup you might have is described below.
You initialize a Core ML model and a VNCoreMLRequest
to handle recognition using that model.
let vnModel = try! VNCoreMLModel(for: myCoreMLModel)
let vnRequest = VNCoreMLRequest(model: vnModel) { [weak self] request, _ in
guard let observations = request.results else { return }
// Observations handling
}
request.imageCropAndScaleOption = .centerCrop
You then keep a reference to vnRequest
in a suitable place, ready to be performed with the next set of arguments. The argument types depend on where you’re capturing video feed frames from.
In our scenario, we are passing frames from an ARSCNView
, which should be captured in the session(_:didUpdate:)
method of ARSessionDelegate
. This delegate method is called whenever a new frame is available for ARSCNView
to display.
func session(_ session: ARSession, didUpdate frame: ARFrame) {
guard let vnRequest else { return } // 1
let options: [VNImageOption: Any] = [.cameraIntrinsics: frame.camera.intrinsics] // 2
let requestHandler = VNImageRequestHandler(
cvPixelBuffer: frame.capturedImage, // 3
orientation: .downMirrored, // 4
options: options
)
try? requestHandler.perform([vnRequest]) // 5
}
Breaking Down the Code:
- Reference to
VNCoreMLRequest
: Once we receive a new frame, we are ready to perform the request we have initialized earlier. - Camera Intrinsics:
frame.camera.intrinsics
provides camera calibration data to help Vision interpret the geometric properties of the scene. - Image Input:
VNImageRequestHandler
accepts raw image data as aCVPixelBuffer
, obtained from the AR frame. - Image Orientation: The
.downMirrored
orientation accounts for the inverted coordinate system of the camera feed compared to Vision's default orientation. - Performing the Request: The prepared request is executed using the request handler.
Once you start passing frames to Vision, object detection results are returned as arrays of VNRecognizedObjectObservation
objects in the VNCoreMLRequest
completion handler. While you might filter these results by confidence level or perform other processing, today we’ll focus on extracting the coordinates of a specific recognized object.
Extracting Bounding Box Coordinates
At first, this might seem straightforward since VNRecognizedObjectObservation
has a boundingBox
property — a CGRect
enclosing the recognized object. However, there are a few complications:
- The
boundingBox
is in a normalized coordinate system relative to the input image of the object recognition model (meaning that the coordinates' values are between 0 and 1) which also has inverted Y-axis. - The sizes and aspect ratios of the camera feed, Core ML model input, and the
ARSCNView
viewport differ from each other.
This means that a series of coordinate conversions and rescaling steps are required to transform the boundingBox
into the ARSCNView
viewport's coordinate system. Doing those conversions manually might be cumbersome and mistake-prone.
Fortunately, there’s an embarrassingly simple way to handle this using CGAffineTransform
. Here’s how:
let sceneView: ARSCNView
func getDetectionCenter(from observation: VNRecognizedObjectObservation) -> CGRect? {
guard let currentFrame = sceneView.session.currentFrame else { return nil }
let viewportSize = sceneView.frame.size
// 1
let fromCameraImageToViewTransform = currentFrame.displayTransform(for: .portrait, viewportSize: viewportSize)
let viewNormalizedBoundingBox = observation.boundingBox.applying(fromCameraImageToViewTransform)
// 2
let scaleTransform = CGAffineTransform(scaleX: viewportSize.width, y: viewportSize.height)
let viewBoundingBox = viewNormalizedBoundingBox.applying(scaleTransform)
return viewBoundingBox
}
Explanation:
- Transforming to View Coordinates: Using
displayTransform(for:viewportSize:)
the detected bounding box is converted from the normalized coordinate system of the input image to a normalized coordinate system ofARSCNView
. - Scaling to Pixel Dimensions: The normalized bounding box is scaled to match the size of the
ARSCNView
viewport, resulting in a bounding box in screen pixel dimensions.
And that’s it! You now have the bounding box in the coordinate system of the ARSCNView
viewport.
Getting the Third Coordinate
I promised that we will get the coordinates of the recognized object within the 3D coordinate space of the AR scene.
To do that, we are going to utilize a technique called "hit-testing." It allows us to measure the distance to the closest physical object at an arbitrary point in the viewport. You can imagine this technique as casting a straight ray from your device to the first intersection with a physical object at the selected point of the viewport and then measuring the length of that ray. This functionality is a part of ARKit and is really easy to use.
Here is how we can find the 3D coordinates of the perceivable center of the object we detected earlier.
func performHitTestInCenter(of boundingBox: CGRect) -> SCNVector3? {
let center = CGPoint(x: boundingBox.midX, y: boundingBox.midY) // 1
return performHitTest(at: center)
}
func performHitTest(at location: CGPoint) -> SCNVector3? {
guard let query = sceneView.raycastQuery( // 2
from: location,
allowing: .estimatedPlane, // 3
alignment: .any // 4
) else {
return nil
}
guard let result = sceneView.session.raycast(query).first else { return nil } // 5
let translation = result.worldTransform.columns.3 // 6
return .init(x: translation.x, y: translation.y, z: translation.z)
}
Explanation:
- Here, we calculate the center of the bounding box from the previous step, as we need a single point to perform the hit-testing.
- Creates a raycast query starting from the given 2D point.
- Allows hit-testing to consider non-planar surfaces or planes about which ARKit can only estimate.
- Enables hit-testing for both horizontal and vertical surfaces (default is horizontal only).
- Executes the raycast query using the AR session. Returns
nil
if there’s no intersection. - Each
ARRaycastResult
contains aworldTransform
, which is a 4x4 matrix representing the 3D transformation of the detected point in world space. Thecolumns.3
contains the translation vector, which specifies the 3D position of the intersection. This translation is returned as anSCNVector3
, which ARKit/SceneKit uses to represent 3D positions.
Done! You now have the 3D coordinates of an object detected by Vision. Feel free to use them for your purposes. :)
Conclusion
In the Unlimited app, we use these 3D coordinates to display AR manual markers in a car. Of course, there are many additional techniques we employ to make the user experience smoother and the marker positions more stable, but this is one of the core techniques.
That said, this same method can be used for any other purpose you can think of. I hope you find it helpful.
Finally, here’s a tiny sneak peek into our testing process and how our AR manual markers are displayed after object detection.
That's it for today—thanks for reading!
Wishing you the Merriest Christmas and the Happiest New Year!
関連記事 | Related Posts
We are hiring!
【iOSエンジニア】モバイルアプリ開発G/大阪
モバイルアプリ開発GについてKINTOテクノロジーズにおける、モバイルアプリ開発のスペシャリストが集まっているグループです。KINTOやmy routeなどのサービスを開発・運用しているグループと協調しながら品質の高いモバイルアプリを開発し、サービスの発展に貢献する事を目標としています。
【iOSエンジニア】モバイルアプリ開発G/東京
モバイルアプリ開発GについてKINTOテクノロジーズにおける、モバイルアプリ開発のスペシャリストが集まっているグループです。KINTOやmy routeなどのサービスを開発・運用しているグループと協調しながら品質の高いモバイルアプリを開発し、サービスの発展に貢献する事を目標としています。