Synthetic dataset generation for object-to-model deep learning in industrial applications

Matthew Z. Wong; Kiyohito Kunii; Max Baylis; Wai Hong Ong; Pavel Kroupa; Swen Koller

doi:10.7717/peerj-cs.222

Synthetic dataset generation for object-to-model deep learning in industrial applications

Matthew Z. Wong , Kiyohito Kunii, Max Baylis, Wai Hong Ong, Pavel Kroupa, Swen Koller

Department of Computing, Imperial College London, London, UK

DOI: 10.7717/peerj-cs.222

Published: 2019-10-14
Accepted: 2019-08-29
Received: 2019-04-17

Academic Editor: Feng Xia

Subject Areas: Computer Vision, Data Mining and Machine Learning
Keywords: Industrial computer vision, Photogrammetry, Convolutional neural network, Computer science applications, 3D Modelling, Synthetic data, Deep learning with limited data

Copyright: © 2019 Wong et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Wong MZ, Kunii K, Baylis M, Ong WH, Kroupa P, Koller S. 2019. Synthetic dataset generation for object-to-model deep learning in industrial applications. PeerJ Computer Science 5:e222 https://doi.org/10.7717/peerj-cs.222

The authors have chosen to make the review history of this article public.

Abstract

The availability of large image data sets has been a crucial factor in the success of deep learning-based classification and detection methods. Yet, while data sets for everyday objects are widely available, data for specific industrial use-cases (e.g., identifying packaged products in a warehouse) remains scarce. In such cases, the data sets have to be created from scratch, placing a crucial bottleneck on the deployment of deep learning techniques in industrial applications. We present work carried out in collaboration with a leading UK online supermarket, with the aim of creating a computer vision system capable of detecting and identifying unique supermarket products in a warehouse setting. To this end, we demonstrate a framework for using data synthesis to create an end-to-end deep learning pipeline, beginning with real-world objects and culminating in a trained model. Our method is based on the generation of a synthetic dataset from 3D models obtained by applying photogrammetry techniques to real-world objects. Using 100K synthetic images for 10 classes, an InceptionV3 convolutional neural network was trained, which achieved accuracy of 96% on a separately acquired test set of real supermarket product images. The image generation process supports automatic pixel annotation. This eliminates the prohibitively expensive manual annotation typically required for detection tasks. Based on this readily available data, a one-stage RetinaNet detector was trained on the synthetic, annotated images to produce a detector that can accurately localize and classify the specimen products in real-time.

Introduction

In this paper, we present a framework for using photogrammetry-based synthetic data generation to create an end-to-end deep learning pipeline for use in industrial applications.

While deep learning techniques have documented great success in many areas of computer vision, a key barrier that remains today with regard to large-scale industry adoption is the availability of data that can be used for model training. While large high-quality datasets are readily available for use with common object categories like animals and household items, this no longer holds in the case of many potential applications (e.g., the products in an industrial warehouse). For such applications, costly and labor-intensive data acquisition and labelling must first be carried out before deep learning can be applied to the task at hand. As deep learning moves out of the academia and into industry, this limitation poses a serious problem for potential users: how can one cheaply and efficiently acquire training data when a large dataset does not already exist?

Working in collaboration with a leading UK online supermarket, we address the problem of dataset generation for fixed-appearance objects: object classes whose appearance have little to no change for all instances within a class. Such objects are ubiquitous in many potential deployment settings and include items such as consumer products, industrial goods, and machine parts, among others.

To that end, we propose combining the use of 3D modelling and render-based image synthesis to generate a synthetic dataset which can be used to train a deep convolutional neural network (CNN), a type of neural network which is frequently used for computer vision tasks. The inspiration comes from the following insight: for fixed (or low variation) appearance object classes, texture, and geometry information can be thought of as unchanging and can therefore be captured from a small number of physical samples without the risk of overfitting to individual specimens. Our approach builds on the existing literature by generating 3D scans of physical objects using photogrammetry. By rendering scenes from these realistic 3D models, we were able to generate a diverse array of synthetic images that were used as training data.

While previous work has made use of computer-generated 3D models to train CNNs, our study extends this concept by successfully demonstrating that this approach can be extended to 3D models acquired using photogrammetry.

In order to demonstrate our approach, we sought to train a deep learning model capable of performing the task of recognizing groceries. A total of 10 products were chosen for a classification task. Our results were promising: using our approach, we were able to successfully train and optimize a CNN that achieves a maximum classification accuracy of 95.8% on a general environment test set.

Furthermore, our use of synthetic training data generation has also enabled the automatic annotation and segmentation of training data. This has allowed us to train an object-detector for the same set of objects.

Methods

Figure 1 shows the overall design of our object-to-model training pipeline, which is comprised of four main stages. The first two stages, 3D modelling and image rendering, are used to generate a synthetic dataset from physical samples of the target objects, while the next two stages are used to train and validate a deep learning model trained on the synthetic dataset.

Figure 1: Overall system design.

Download full-size image

DOI: 10.7717/peerj-cs.222/fig-1

3D Modelling: This stage involves scanning physical products to produce 3D models. These include texture and color representations of the product and are of high enough quality to produce realistic images in the next stage.
Image rendering: This stage produces a specified number of synthetic training images for each object which vary object pose, lighting, background, and occlusions.
Network training: For the purposes of this study, several deep CNN architectures were trained to classify grocery items using rendered images generated from the image rendering step.
Testing and evaluation: The trained classifier is then tested on a relatively small number (100 images per class) of test images collected.

These four stages fully define our end-to-end pipeline for generating and evaluating an image recognition network, for a number of specified fixed-appearance objects. In essence, the full pipeline takes in a set of physical objects and outputs a trained model. Each of the four stages will be described in detail in the sections below.

3D Modelling

The goal of the 3D modelling stage was to demonstrate a means by which 3D models of target objects can be feasibly created to be used as input for the image synthesis stage. The method employed required the constructed models to be of high enough quality to generate photo-realistic and consistent representations, as well as being economical to employ in a research setting.

Photogrammetry was used as a tool of choice. This technique takes 2D images of a 3D surface (photographs captured using a conventional camera) as input, and attempts to reconstruct the surfaces primarily using texture cues on the surfaces by following the steps below:

Camera calibration: This is done automatically using matching features in the images, and estimating the most probable arrangement of cameras and features. A sparse point cloud of features on the modelled surface is calculated.
Mesh generation: The point cloud is then triangulated and used to create a structural mesh of the surface.
Texture generation: Texture information on the mesh surface is recovered by combining information from the original images.

Figure 2 shows a visualization of the steps described above as applied to an example object. This approach has allowed us to fully capture all geometry and texture information of all modelled objects. Minimal human effort is required—typically no more than 40 images were required per model, which can be done manually at a rate of approximately 5 min per product. In an industrial setting, this can be carried out even faster using various commercial photogrammetry-based photo-capture systems which can be used to automate the 3D modelling process.

Figure 2: Visualization of the steps in photogrammetry: (A) camera calibration and point cloud generation; (B) after mesh generation; (C) after texture generation.

Download full-size image

DOI: 10.7717/peerj-cs.222/fig-2

Image rendering

The goal of the image rendering stage is to produce an infinite supply of high-quality training data. This stage involves using a rendering engine to render a pose image of the 3D model. The appearance of this image depends on a user-defined distribution of rendering parameters θ. This is combined with a background to generate the final training image. These steps are repeated to generate a potentially unlimited number of training images. Figure 3 provides an illustration of this process.

Scene appearance was fully defined with the following parameters: camera position w.r.t object (defined via an azimuth Θ and elevation φ), camera distance to the object, lighting intensity (equivalently distance), number of lights.

As a simple heuristic, the camera location was defined to be evenly distributed around rings in a spherical coordinate system. This is illustrated in Fig. 4. The reason for this choice of distribution was due to the fact that it corresponded to common viewpoints of handheld grocery items.

Figure 4: (A) The rings on the shell (red rings) around which random normalized camera locations are sampled from (blue points); (B) The uniform distribution on the sphere of lamp locations.

Download full-size image

DOI: 10.7717/peerj-cs.222/fig-4

Lamp locations were distributed evenly on a sphere. Additionally, camera-subject distance and lamp energies were distributed according to truncated normal distributions to ensure no negative energy lamps were generated. Point lamps were used primarily as a simple lighting model.

Location distributions

We generated location distributions according to the method below. Assuming that one could define Cartesian coordinates (x, y, z) in terms of spherical coordinates (defined using radius ρ, azimuth Θ, and elevation φ) as: (1) $x = ρ \cos θ \sin ϕ y = ρ \cos θ \sin ϕ z = ρ \cos ϕ$

Distribution of camera locations were defined by stating: (2) $ρ \sim T (μ_{ρ}, σ_{ρ}, a_{ρ}, b_{ρ})$ (3) $ϕ \sim T (0, σ_{ϕ}, - π / 2, π / 2)$ (4) $θ \sim U (0, 2 π)$

Where $T (μ, σ, a, b)$ is the truncated normal distribution, with mean μ and standard deviation σ, a, and b define the limits of the distribution, for which the probability density function is zero outside the limits. Therefore if $X \sim T (μ, σ, a, b)$ , then $X \sim N (μ, σ)$ if a ≤ X ≤ b. This set of variables defines a distribution around a ring in the X–Y plane (with a normal Z), and the width of the ring can be controlled by specifying σ_φ. If x = (x, y, z) is drawn from the distribution, one can “flip” the ring to have normal aligned with the X axis by doing: (5) $x^{'} = R_{y} (π / 2) x$

Where $R_{y} (π / 2)$ is the rotation matrix that rotates the point about the axis y, π/2 radians. This way points can be distributed around multiple rings (specified by their normals). For the purposes of this project, camera was distributed about the rings with the Y and Z axes as normals.

It is also worth mentioning that setting σ_φ to roughly π/3 radians generates a roughly uniform distribution of points around a sphere. This approach was used for generated lamp distributions.

Lighting conditions

The number of point light sources n_L was also sampled from a uniform distribution of non-negative integers (the min and max can be user-defined). The lamp energies E were sampled using a truncated normal distribution: (6) $n_{L} \sim U {n_{min}, n_{max}}$ (7) $E \sim T (μ_{E}, σ_{E}, 0, + \infty)$

Background scene generation

The rendered pose image is generated with a transparent background. Alpha composition was used to combine the rendered pose image with a background image. The resulting generated images are highly varied in terms of appearance. While a significant proportion of images seemed to look “unrealistic” (e.g., a yogurt pot in the International Space Station), the aim was not to achieve a simulation with perfect realism. Instead, the focus was on achieving enough background variation within the training data so that the final trained network would be as robust as possible, while ensuring that the objects themselves were rendered as realistically as possible.

Figure 5 shows examples of synthetic training data generated using the method described above. Note the variety of poses and lighting conditions represented, as well as the multitude of different backgrounds.

Figure 5: Examples of generated synthetic images (A–T).

Download full-size image

DOI: 10.7717/peerj-cs.222/fig-5

Network training

The aim of the network training stage was to take a generated dataset as an input and produce a trained neural network that could be used to classify products from the dataset. The tool of choice for this task was a CNN, a class of neural networks that are frequently used for image classification tasks. CNNs are specialized neural networks that perform transformation functions (called convolutions) on image data. Deep CNNs contain hundreds of convolutions in series, arranged in various different architectures. It is common practice to take the output of the CNN and input it into a regular neural network (referred to as the fully-connected (FC) layer) in order to perform more specialized functions (in our context, a classification task). We refer readers to Rawat & Wang (2017) for a comprehensive review of the use of Deep CNNs in Image Classification.