By Dr. Jonathan Howe, NVIDIA; Dr. Aaron Reite, NGA; Dr. Kyle Pula, CACI; and Dr. Jonathan Von Stroh, CACI

The majority of modern image interpretation and analysis rely on deep neural networks (DNNs) due to their unparalleled accuracy, large capacity to differentiate between multiple object classes, generalizability, and relative simplicity to develop and apply new applications when compared to traditional computer vision methods. In recent years, DNN research has resulted in off-the-shelf classification, detection, and semantic segmentation algorithms, which, when properly trained, approach human-level or better performance in many imagery domains. However, large amounts of task-specific labeled training data are required to obtain these benefits. These data must exhibit the extent and variability of the target domain. Like other statistical models, DNNs do not extrapolate well in out-of-domain environments without special care. For example, training a model to segment roads using images of North American cities, then deploying these models on images of European cities, will produce a less than ideal outcome. A basic obstacle to generalization for DNNs is that variations that seem obviously irrelevant to humans (e.g., different lane markings or agricultural practices) are perceived as completely alien to a DNN, leading to unpredictable results. Data augmentation during training (e.g., random mirroring, rotation, contrast and brightness changes, color balance, scaling, etc.) can partially alleviate these issues; however, more advanced methods are required for DNNs to generalize well to new environments.

To counter poor generalization, a number of methods rapidly create labeled datasets for training purposes, but doing so efficiently, at scale, and with extensibility in mind requires careful thought. Developing a system using active learning methodologies deployed in collaborative environments can help annotators rapidly label data and create an operational capability, beginning from only a small amount of labeled data.[1] Many of these insights have been made in other fields, particularly with autonomous driving and health care, which require extra factors such as safety and interoperability.[2,3]

In addition to robust labeling, training, and a validation and deployment environment, more advanced techniques can maximize model accuracy within short time scales and with limited training data. For example, semi-supervised and unsupervised modeling can aid labeling tasks while simulated environments can replace or supplement training and validation datasets. This article focuses on the latter approach: creating synthetic data workflows to increase model accuracy when labeled data are scarce.

Methods of Simulation

Geoffrey Hinton’s 2007 paper “To Recognize Shapes, First Learn to Generate Images,”[4] greatly impacted the neural network and statistics research community. The paper lays out the steps to develop a multilayered neural network, methods to define loss functions, and the calculus to update model parameters to maximize model accuracy (known as back-propagation). In addition to this highly popular model training recipe, Hinton’s work discusses modeling image generation to increase detection or classification accuracy further. In essence, understanding how to create images greatly benefits image interpretation and analysis (and vice versa).

There are two main approaches to simulate data, each with benefits and drawbacks: traditional computer graphics and data-driven generative models. Computer graphics use ray tracing and rasterization to render simulated scenes. This works particularly well in remote sensing and autonomous vehicle use cases where the basic primitives (buildings, roads, vehicles) and spectral conditions (viewing geometries, lighting angles, spectral content, atmospheric attenuation) are relatively simple and easy to model. For example, the Digital Imaging and Remote Sensing Image Generation (DIRSIG) modeling tool, developed at the Rochester Institute of Technology, provides methods to create physics-driven synthetic images for sensor development and to aid DNN model training.[5,6] Similar methods have been studied to render maritime vessels, placing them into real imagery to vastly improve object detection metrics.[7] Autonomous vehicle and healthcare industries use rendering methods to generate simulated datasets to improve model accuracy, particularly when labeled datasets are scarce.[8,9] However, composing the scenes to be rendered can take time, particularly if the target domain is diverse. Compared to the naive approach of gathering and labeling additional data, this approach trades human annotator work for illustrator work. In some cases, it may not be possible to perform this exercise without significant investment.

Alternatively, the generative approach to synthetic data views an existing corpus of real data as a collection of samples from the true distribution of real data, and tries to build a model that draws additional samples from this distribution. The generated samples (or imagery) resemble the data corpus and, if the model is trained correctly, can have very high levels of visual fidelity. This reduces the need to use the computer graphics approach of constructing and rendering objects of interest within scenes with realistic spectral conditions. However, if these parameters are known and available at training time, they can also be used to condition the model to control the generated output. Prime examples of generative modeling, specifically using generative adversarial networks (GANs), include the works of Karras et al. for generating extremely high-fidelity imagery at high-resolution (in many cases fooling human examiners),[10] and Wang et al. for conditioning the GAN output at the pixel level using semantic labels.[11]

Illustration of the GAN framework: D (Discriminator) is alternately presented with images from G (Generator) and from the dataset. D is asked to distinguish between the two sources. The problem is formulated as a minimax game: D is trying to minimize the number of errors it makes. G is trying to maximize the number of errors D makes on generated samples. The curvy arrows represent the back propagation of gradients into the target set of parameters.

Generative Modeling vis GANs

A GAN consists of a pair of networks that, as the name suggests, compete against one another during the training phase. The generator network G takes as input a random vector called the latent vector. If other metadata are available (lighting angle, etc.), then these values may be concatenated with the latent vector to condition the output. When generating new data, the network may be controlled via the metadata to create images with specific parameters. This latent vector is fed into a series of reshaping and deconvolutional layers to reconstruct and transform the vector into a generated image. The second network, the discriminator D, takes images from the real dataset (the data we are attempting to model) and the generated dataset and passes them through a series of convolutional and reshaping layers in a near mirror image of the generator network. It attempts to correctly predict which images were generated by G, and which are real. These networks compete in a minimax two-player game: D’s objective is to correctly guess the generated versus real images, while G’s objective is to fool D. In the ideal outcome, G generates convincing synthetic images and D cannot determine if G’s images are real or not. During deployment, G is passed random latent vectors with conditioning metadata (if available) to create new plausible images. The discriminator is typically discarded.[12]

GANs have been used successfully in the healthcare sector, which has a large imbalance between healthy medical images and those containing unhealthy tissue or tumors. GANs can help reduce this imbalance through the modeling and creation of additional data.[13] In addition, when privacy concerns are an issue, GANs have been used to apply anonymity, creating synthetic data that lack personal information while still exhibiting the scan details of patients.[14]

Using GANs for Remote Sensing Applications

To train remote sensing DNNs using generative models for data augmentation, one must model both the imagery and the associated labels to a high degree of accuracy and fidelity. Researchers have made progress toward this by transferring image statistics from one domain, where there is an abundance of data, to the target domain that is similar in appearance and content, but with far fewer examples.

For example, Yun et al. use cycle-consistent generative adversarial networks to convert visible band data to infrared data.[15] Similarly, Benjdira et al. used the output of CycleGANs between visible band and infrared data to significantly increase segmentation accuracy of remote sensing datasets.[16] Seo et al. transferred image statistics from real images into synthetically rendered imagery containing military vehicles to increase the overall image fidelity.[17] In each of these works, real data are used to augment synthetic data for object detection or segmentation model training.

In our recent work (Howe et al.), both the imagery and the labels are modeled together to create completely new labeled images, which we use to train an object detector.[18] To our knowledge, this is the first time such joint modeling has been attempted using GAN methods for any application area. Here, we use the International Society for Photogrammetry and Remote Sensing (ISPRS) 2D Semantic Labeling Contest—Potsdam dataset. This dataset consists of 24 segmentation labeled 6,000×6,000 pixel images collected at a 5-cm ground sample distance with six categories of land use types: impervious surfaces (white), buildings (blue), low vegetation (cyan), trees (green), vehicles (yellow), and clutter (red). We use the methods of Karras et al. (ProgressiveGAN) and Wang et al. (Pix2PixHD) to model the spaces of segmentation masks and imagery conditioned on such masks, respectively.[19,20] Figure 1 presents examples of real and synthetic image and label pairs.

Figure 1. Real ISPRS Potsdam image and label pairs (left), and synthetically generated image label pairs (right). The synthetic segmentation masks were generated via ProgressiveGAN, and the synthetic imagery was generated via Pix2PixHD conditioned on the generated mask.

From a qualitative perspective, it is difficult to differentiate real from synthetic datasets. The Fréchet Inception Distance (FID) metric is commonly used to quantitatively measure how well generated data match the real data’s distribution. Informally, FID tries to measure how different images are from real images when processed through a particular DNN trained on the ImageNet dataset. We observed that increasing the quantity of training data for the GANs resulted in an increased FID score—meaning that the generated images became less similar to real images when the quantity of training data was increased. This makes sense as GANs learn to interpolate between training images, which becomes more difficult as the number and diversity of training images increases.

When using GAN-generated data to augment real training datasets, a similar trend is found. If only a small amount of data is available to train the GANs and an object detector, in this case, RetinaNet,[21] the relative increase in mean average precision (mAP) can increase by more than 10 percent, as compared to training with real data alone using standard data augmentation methodologies. For a practical comparison, this improvement is about 40 percent of the benefit realized by exhaustively labeling an additional 6000×6000 pixel image. As the number of training images is increased, the relative improvement in mAP decreases; until eventually this GAN augmentation method becomes detrimental. This pipeline is effective, but only when very little labeled data is available. If labeled data are abundant, it may not offer a benefit and could possibly hurt performance. However, for small amounts of training data, these methods can provide an additional boost in performance beyond traditional augmentation techniques.

Summary and Future Work

In some imagery domains, the computer vision tasks of classification, detection, and segmentation can be viewed as solved problems in the sense that, given plentiful, diverse, and well-labeled data, off-the-shelf techniques can now approach or even exceed human-level performance. Unfortunately, in practice, these data requirements often far exceed the volume, diversity, and fidelity of most labeled datasets. In addition, these off-the-shelf techniques typically don’t hold up well to the highly imbalanced datasets that are the norm in many applications. These problems are compounded by the fact that techniques for transferring information from labeled data in one domain to another (often called domain adaptation) do not remotely approach human-level performance but are an active area of research.

Aside from labeling more data, which can be costly or even impossible in some scenarios, the two main approaches to augmenting scarce data are synthesizing data by computer graphics and generative models. Both techniques have shown promise in remote sensed imagery but have a common shortfall: They optimize for photorealism instead of usefulness as training data. Other than human feedback changing hyper-parameters, neither approach attempts to use prediction accuracy as a training signal to improve the simulation. The situation is akin to students preparing for an exam, yet the teacher completely ignores exam performance in further curriculum development. In proper instruction, the curriculum is dynamically adjusted based on student performance. In machine learning (ML), this feedback loop, wired via gradient descent, is referred to as meta-learning. We anticipate that future advances in data synthesis for ML will come from unifying graphics and generative approaches in a meta-learning construct to directly optimize for the desired computer vision task, rather than photorealism.


This article is approved for public release by the National Geospatial-Intelligence Agency #20-084.

  1. Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. “Learning Active Learning from Data.” In Advances in Neural Information Processing Systems, pp. 4225-4235. 2017.
  2. https://blogs.nvidia.com/blog/2018/09/13/how-maglev-speeds-autonomous-vehicles-to-superhuman-levels-of-safety/
  3. https://developer.nvidia.com/clara
  4. Geoffrey Hinton. “To Recognize Shapes, First Learn to Generate Images.” Progress in Brain Research. 2007. no. 165: 535-547. https://doi.org/10.1016/S0079-6123(06)65034-6
  5. http://www.dirsig.org
  6. Sanghui Han,, Alex Fafard, John Kerekes, Michael Gartley, Emmett Ientilucci, Andreas Savakis, Charles Law et al. “Efficient Generation of Image Chips for Training Deep Learning Algorithms.” In Automatic Target Recognition XXVII. 2017. vol. 10202, p. 1020203. International Society for Optics and Photonics.
  7. Yiming Yan, Zhichao Tan, and Nan Su. “A Data Augmentation Strategy Based on Simulated Samples for Ship Detection in RGB Remote Sensing Images.” ISPRS International Journal of Geo-Information 2019:8(6):276.
  8. Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.” In 2017 IEEE/ RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 23-30. IEEE. 2017.
  9. A.F. Frangi, S.A. Tsaftaris, and J.L. Prince. “Simulation and Synthesis in Medical Imaging.” IEEE Transactions on Medical Imaging. 2018:37(3):673-679. doi:10.1109/TMI.2018.2800298.
  10. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs for Improved Quality, Stability, and Variation.” 2017. arXiv preprint arXiv:1710.10196.
  11. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. pp. 8798-8807.
  12. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems. 2014. pp. 2672-2680.
  13. Felix Lau, Tom Hendriks, Jesse Lieman-Sifry, Sean Sall, and Dan Golden. “Scargan: Chained Generative Adversarial Networks to Simulate Pathological Tissue on Cardiovascular MR Scans.” In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. 2018. Springer, Cham. pp. 343-350.
  14. Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke,and Walter F. Stewart, Jimeng Sun. “Generating Multi-label Discrete Patient Records Using Generative Adversarial Networks.” In Proceedings of the 2nd Machine Learning for Healthcare Conference.2017. PMLR 68:286-305.
  15. Kyongsik Yun, Kevin Yu, Joseph Osborne, Sarah Eldin, Luan Nguyen, Alexander Huyen, and Thomas Lu. “Improved Visible to IR Image Transformation Using Synthetic Data Augmentation with Cycle-Consistent Adversarial Networks.” In Pattern Recognition and Tracking. 2019. XXX, vol. 10995. p. 1099502. International Society for Optics and Photonics.
  16. Bilel Benjdira, Yakoub Bazi, Anis Koubaa, and Kais Ouni. “Unsupervised Domain Adaptation Using Generative Adversarial Networks for Semantic Segmentation of Aerial Images.” Remote Sensing. 2019: 11(11):1369.
  17. Junghoon Seo, Seunghyun Jeon, and Taegyun Jeon. “Domain Adaptive Generation of Aircraft on Satellite Imagery via Simulated and Unsupervised Learning.” 2018. arXiv preprint arXiv:1806.03002.
  18. Jonathan Howe, Kyle Pula, and Aaron A. Reite. “Conditional Generative Adversarial Networks for Data Augmentation and Adaptation in Remotely Sensed Imagery.” 2019. arXiv preprint arXiv:1908.03809.
  19. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs for Improved Quality, Stability, and Variation.” 2017. arXiv preprint arXiv:1710.10196.
  20. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. pp. 8798-8807.
  21. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. “Focal Loss for Dense Object Detection.” In Proceedings of the IEEE International Conference on Computer Vision. 2017. pp. 2980.2988.

Related


Posted by USGIF