The roles of data collection, storage, management, analysis, and consumption as factors that drive designs
By Chuck Herring, AllSource Analysis; Dave Gauthier, NGA; Ed Kunz, Midwest Aerial; Jenny Yu, Kimetrica; Steven Truitt, Descartes Labs
The ability to collect, analyze, and use data about our environment is increasing every day. This abundance of capability and promise of continued growth are key elements in the resurgence of digital transformation across all sectors of industry, government, and private life. Engineering solutions for this world, especially when working with heavy and complicated remote sensing data, can pose unique challenges that call for new and innovative solutions. In this article, we inspect the roles of data collection, storage, management, analysis, and consumption as factors that drive designs.
Data collection has matured greatly from the era of black-and-white photography. Modern sensors, whether in ground, air, or space, cover the full electromagnetic spectrum with a variety of sensor types, and often can collect multiple modalities simultaneously, ranging from wide-area, low-resolution to small-area, high-resolution. These diverse options, combined with choices in the time domain from space-based periodic revisit, air-based on-demand collection, and ground-based constant collection, all create an incredibly rich set of data collection options to choose from.
Adding to the range of options, data can be collected in multiple different operating modes. Just in the aerial domain, for example, there are options to digitize historical film imagery, pre-flight selection of imaging sensor and telescope, and on-the-fly selection of a camera’s operating mode. This data, once collected, must then be coupled with location information (i.e., GPS or inertial sensors). Selecting a level of processing and which algorithms to apply adds another decision layer. Finally, all of this collection optimization compounds with data transfer schemes that can include post-collection transfers, streaming collection, or targeted sub-collect delivery. Space- and ground-based collectors have a similar range of choices, although somewhat more constrained due to the environments they operate in.
As if the technical complications were not sufficient, the business model of a data collection campaign matters as well. At least in the aerial domain, post hoc sales of individual datasets from a data library represent a value of approximately 1 percent of new collection. While this number differs by collector, sensor type, and regime, the notion that data collection is highly time-dependent remains constant. This poses both a challenge and an opportunity: If additional value can be extracted from entire historical collection libraries, data can be made more affordable and data collectors can expand into new business models. If it cannot, then the costs of storage and management can be overly burdensome, and the data collection world will remain bifurcated into targeted high-resolution and wide-area low-resolution.
Storage, Management, and Analysis Enablement
The eruption of cloud computing (and the renaissance of edge computing) has solidified one truth for the foreseeable future: High-performance IT design for complex problems will be dominated by the ability to centralize data. Additionally, the rate of growth of data collection capabilities is continuing to outstrip the ability of global communication networks. For small data volumes, this does not pose a substantial issue; however, for large-volume collection and use, there are two domains where this gap between collection and communication becomes evident.
When collecting heavy data at long standoff ranges, a narrow communications channel is often required to move from the sensor to a data center. Space systems are often constrained in this way, where the sensor is able to capture much more information than a communications link is able to handle. Alternatively, when analysis is spanning many types of collection or it takes place over a long time period, moving data to a common location for co-processing can be exceedingly taxing on even high-performance network systems. To dramatically simplify the issue, the root of both is a cost concern on how often expensive communications links are used.
From a data storage and management perspective, there are three potential solutions:
- Heavy processing at or near the sensor to minimize comms.
- Integration across data centers of convenience (inter-cloud).
- Centralization into common data stores.
The first option, processing at or near the sensor, is always a good idea and is nearly always implemented to some degree. The challenges here consist primarily of size, weight, and power concerns when balancing preservation of signal (larger data size) with information extraction (larger power and size, potential to lose valuable data). At the end of the day, the ongoing proliferation of increasingly powerful computing units will lead to a state of equilibrium where sensors are producing compressed or extracted data of high value with minimal data loss.
The second option, integration across data centers, simplifies integration for sensor providers as a local data center can be used rather than identifying a global communications route. For data sources that are accessed locally or analyzed in isolation, this serves as a good choice. It strikes a balance between cloud computing power and communications network difficulty. This approach is particularly good for analysis of extremely large datasets, as the network performance within a data center due to colocation of storage and compute is orders of magnitude higher than crossing an internet backbone between data centers.
The third option, data centralization, is the only real choice when performing multisource large computations. While the upfront cost and difficulty to move data to a central location is higher than either edge or regional storage, the nature of multisource analysis means that data will be accessed multiple times to be computed in a single location. This means that if a regional data center is used, the data will be pulled multiple times over the internet rather than one single time for the initial centralization. Furthermore, this type of design benefits from the same intra-network latency reduction as the second option, but can cover multiple datasets.
For the foreseeable future, data collection rates will exceed communication bandwidth from the sensor to a high-bandwidth fiber connection, and analysis is continuing to grow in complexity and scale. Ultimately, it is a cost and performance question that will dictate whether regional or central management is the best architecture, with centralization proving more effective if analysis is multisource and large-scale.
Analysis, Exploitation, and Data Science
Analysis, exploitation, and use of data science to solve problems can take myriad forms. For illustrative purposes, we look at two models that are near the extremes of what is possible:
- Multiple-model assemblies for fully automated environment characterization.
- Human interpretation of problems with high cultural and historical context requirements.
These two extremes share some similarities regarding the speed and scale of automation support, however, they also differ dramatically in the demands placed on underlying systems.
Developing multiple-model assemblies
In developing multiple automated models to predict complex environmental and human processes, various datasets from different sources must be collated for preprocessing and normalization. For this kind of workflow to be scalable, there’s a need to centralize storage and inventory for the satellite imagery, underlying truth data, and an interface platform to query the data collection in a consistent way. Federating out to multiple environments carries a price-and-performance cost that interferes with model iteration cycles. This approach solves the most common issue in working with remote sensing data: How to efficiently manage the data in a pipeline and develop a system that is deployable outside of a local environment and its dependencies.
From a modeling feature perspective, some key attributes for large-scale modeling are:
- Worldwide spatial coverage, in particular coverage for rural and remote areas of interest.
- Consistent temporal coverage.
- Distributed processing to reduce the time to analyze vast areas.
When these three features are met, training and deploying automated models is quite feasible. For example, creating computer vision models to extract out human geography can be quickly iterated and deployed over continental scales when there is the combination of GPUs, consistent and multisource data, and scalable access. Almost as important as the computing environment in this scenario is the ability to respond quickly on iteration; systems that can reduce the cycle time from training to testing dramatically improve modeling efficiency. A platform that enables this human-to-human interaction in addition to performing large-scale computation is critical. Multiple-model assemblies are in essence the combination of multiple people’s understanding of the world and conveying these perspectives to each other is critical.
Interpreting activity using deep contextual and cultural knowledge
The far opposite of automated environmental modeling is the application of deep human knowledge of context and history as broadly as possible to understand particular goals and activities. This approach has always relied on a many-source approach to providing insights, with steps commonly known to this community such as finished geospatial intelligence. As human as this process is, it can still benefit greatly from large-scale data processing as artificial intelligence (AI) has been demonstrated to greatly reduce the search space and focus human analysis on true areas of interest.
Traditionally, analysts have had to make inferences when viewing imagery and disparate datasets, oftentimes requiring separate analysis techniques for each. Each person would be naturally limited by the pace at which these datasets could be joined as well as the pace at which potential targets could be evaluated. When data and systems are separate, this leads to a compounding challenge as people are asked to both context-switch between systems and datasets while also transiting multiple locations, multiple times. The time involved just in mechanically walking through this process can be substantial, although the real challenge exists in maintaining attention and context through multiple sets of information. As the range of sensors and types of data increases, this problem can compound to the point where otherwise valuable data cannot be gainfully integrated in an analysis while maintaining reasonable timeliness.
Automation and integration of these sources is proving to be a fertile ground for improvement, however. Of special interest is the ability to place dynamic AI capabilities in the hands of analysts such that multiple data sources and techniques can be deployed on demand or pre-processed across a wide area for search-space reduction. As every analytic challenge is unique, the ability to adapt capabilities on the fly and have a near-real-time response from the system holds enormous potential for site identification and activity characterization. This human-in-the-loop approach to rapid system design has long been considered a valuable target, and recent capabilities benefiting from data centralization and configurable AI are making it reality.
The final consideration from an analysis perspective is that as AI and automation improve in the ability to perform inference of increasing complexity, the role of the human analyst becomes more important for both final product creation and information integrity validation. Even in an ideal world when there is a perfectly accurate automated model, translating this refined information into a product that can be consumed and acted upon will depend on a deep understanding of both the human problem and the supporting observation technology.
The digital transformation of organizations is a dominant trend for the next decade. What is observed of technologies and designs for the collection, storage, management, and use of multisource and large-scale data will directly impact the pace of transformation for organizations that depend on this information. Trade-offs exist between operational efficiencies designed for the user, the data storage solution, the data transport layer, and on-demand analytic computing power. Organizations need to determine which design trades are best for their own mission outcomes.
One factor that often gets overlooked is that organizations also have different user types: managers, analysts, and scientists, to name a few. One initial insight from integrating in large organizations is to optimize data environments and management systems more for the scientific requirements because that is where an organization can mine the most potential for creating new value. This translation from scientific discovery creates a competitive edge.
There has been an incredible push to move legacy capabilities to a cloud-native version. However, in the organizational push to migrate onto cloud environments and provide shareable data, a key capability has been left behind: multi-source analytics at scale. This requirement of people producing useful intelligence was formerly served by niche capabilities but must now be itself translated to a high-performance, scalable solution. This begs the ultimate question for geospatial system design: Going forward, will we evolve into hybrid-cloud architectures, duplicate data across multiple custom architectures, or see the pendulum swing back to more on-premises solutions?
Time will tell; there are compelling cases for each.
Synthetic data generation has emerged as a keystone technology to address many needs and challenges
ArcGIS AllSource connects disparate data, enabling actionable, holistic intelligence for decision-makers