The community must go beyond statistically analyzing data to truly gain understanding
By the year 2020, many experts predict the global universe of accessible data to be on the order of 44 zettabytes—44 trillion gigabytes—with no signs of the exponential growth slowing. As a result, data science has quickly been thrust to the forefront of the international job market, and people cannot seem to get enough of it. Salaries for data scientists have increased significantly in the past few years, as the demand for a workforce fluent in scripting, machine learning (ML), and data analytics inundates jobsites spanning the globe. Websites such as glassdoor.com list the top three jobs of 2016 as data scientist, DevOps engineer, and data engineer. As a result, the market is responding with a slew of data science degree programs, with indicators that more students are opting for bachelor’s and master’s degrees over Ph.D.s, likely in an attempt to enter the competitive job market.
Supporting this observation is the fact that salaries for data scientists started to level off as an influx of new data scientists entered the job market. Still, the demand for data scientists remains high; and the race to score employment in one of the century’s “sexiest jobs” is arguably at its peak with little sign of slowing down.
The United States Intelligence Community (IC) is just now starting to demand these skills in earnest as it strives to maintain its leading edge to support national policy-makers and military forces, as well as to protect the nation’s borders and interests abroad. While the delayed diffusion of private sector practice to the public sector is not new, the speed of the technology growth has exacerbated the time lag and placed the IC behind the power curve. The rapid growth in sensor diversity and volume in the unmanned aerial systems (UAS) market alone, compounded with the resulting flood of derived products, structured observations, and increasing volumes of publicly available information is simply overwhelming analysts.
The data scientist’s ability to navigate petabytes of raw and unstructured data, then clean, analyze, and visualize the data, has routinely proven their value to the decision cycles of their often non-technical leaders. It is no wonder then the demand signal to meet the IC’s big data problem has created a buzz around data science, with many senior executives wanting more of “it.”
However, there is little strategic assessment of what actual skills will be needed in the future or how these emerging technologies and data science tools should reshape the IC’s organizational dynamics. But this is not solely the fault of senior executives. We feel there is a definition problem with data science; it is too general, too broad, and continually expanding. We also believe that while data science undoubtedly has a future in the National Geospatial-Intelligence Agency’s (NGA) vision to “Know the Earth, Show the Way, Understand the World,” the community must go beyond statistically analyzing data collected on the world around us to truly gain an understanding of the people who inhabit the world.
Future policy-makers and military leaders will be faced with a complex environment that is increasingly urban and unstable, where the observed complexity of people’s behavior over time is actually a reflection of the complexity of the system in which they are immersed. The information technology revolution we are witnessing with data science is allowing policy-makers and military leaders to see the complexity of the world they are trying to influence, but to grapple with this complexity will require new mental and organizational paradigms. Geospatial intelligence (GEOINT) is critical to the characterization and understanding of this complex world, providing the context and visualization necessary to support the decision-making process at all echelons. Perhaps for this reason, the most demanding area for advancement in computational tradecraft should be in the realm of GEOINT. The overwhelming volume, size, diversity, complexity, and speed at which geospatial data is generated requires significant improvements to the processes fielded by today’s GEOINT practitioners.
Future GEOINT practitioners will also need to apply these data to support requirements for near real-time human interpretation and synthesis into intelligence in order to describe and visualize the operating environment and provide objective predictions of physical and human actions. The National System for Geospatial Intelligence (NSG) must transition away from a discipline doctrinally constrained by multiple single-source stovepipes and embrace a multidisciplinary, dynamic, and computational analytic approach dedicated to addressing complex geographic and social issues.
During USGIF’s GEOINT 2017 Symposium, NGA signaled its intent to shift its workforce planning heavily toward data science, even suggesting it will no longer hire analysts without computer programming skills. Even the director of NGA is taking a Python course. Naturally, NSG members are following NGA’s lead, initiating pilots to build out data science capabilities within their current structures.
In all the demand for data scientists, something is lost—the fact that data science will not be enough for the future of GEOINT.
- This article is part of USGIF’s 2018 State & Future of GEOINT Report. Download the PDF to view the report in its entirety and to read this article with citations.
Data Science Undefined
Today’s NSG leaders are united in their recognition for the need to respond to the increasingly massive amounts of generated data—growing in veracity and volume—and want employees capable of searching, wrangling, and analyzing the massive amounts of data. These leaders seem to agree that data science is the profession appropriate to perform these duties, regardless of the fact that only a de facto definition for data science exists. A recent U.S. Air Force (USAF) white paper on Intelligence, Surveillance, and Reconnaissance (ISR) offers a definition in which data scientists “[extract] knowledge from datasets … find, interpret, and merge rich data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate data insights/findings to a variety of audiences.”
Data science is not a new phenomenon. In fact, as early as 1977, John Tukey—the scientist who coined the term “bit”—was developing statistical methods for digital data. In a 2013 blog “rant,” one author posits data science is simply the “application of statistics to specific activities.” Following that, “we name sciences according to what is being studied. … If what is being studied is business activity … then it is not ‘data science,’ it is business science.”
This is an extremely important counterpoint to the USAF ISR white paper that concludes, “adding ‘data science’ to an intelligence analyst’s job description would both diminish the focus on his or her core competency (intelligence analysis) and also result in sub-optimal data science.”
We feel the danger in the current NSG narrative is the expected degree of data science integration and focus. Integrating new technologies to one’s career field is critical, and we see that intelligence analysis should not be any different. It is not that data science is a powerful breakthrough in and of itself, but rather it is the application of computational analytic tools to enhance domain knowledge that demonstrates exponential gains.
Exacerbating the definition problem is the tendency of leaders to excitingly convolve developing technologies into data science, most likely based on the assertion that the technologies rely on big data and computers, and thus, are data science. Artificial intelligence—traditionally captured under the umbrella of “computer science”—is suddenly being lumped under “data science” as well, possibly because it requires massive training data.
This brings to light the emerging problem of senior leaders throughout NSG and industry blurring the lines of an already loose definition and searching for the rumored “unicorn:” a geospatial or imagery analyst that can map-reduce multiple near real-time data feeds from the cloud, develop a ML neural network and deploy them into the cloud, and improve computer vision to automatically extract targets—ultimately providing policy-makers with advanced visualizations of predictive assessments on socio-political activities. This turns the focus on data science into a cure-all black box instead of an integral tool that should be present in each career field.
While this may seem like semantics, it is an important point for the community to realize the implications of the narrative that data science can provide all the answers without knowledge of geospatial and social sciences. The NSG must establish a common understanding of what data science can, and, perhaps more importantly, cannot do in order to develop concrete strategies to move forward.
Data science is focused on how to access, store, mine, structure, analyze, and visualize data. This requires deep expertise in computational statistics and plays an important part in getting pertinent data from the information technology and computer science sphere into the hands of the geospatial and social scientists focusing on their subject matter expertise. However, this deep expertise comes at a price, as few data scientists will be experts in critical geospatial principles and rather will likely focus more on their ability to write processing scripts. For instance, while the application of pure data science has discovered new species through statistical signatures, it offers no information on “what they look like, how they live, or much of anything else about their morphology.” Data scientists are also not software developers, which means they are unlikely to implement the algorithms and develop the tools to auto extract objects from imagery or deploy progressive neural networks. Moreover, as tools implement more successful ML algorithms, analysts will likely be expected to elevate to higher-level tasks. For these reasons, it is highly unlikely data science is the answer to the problems geospatial and social scientists are trying to solve, but rather serves as a tool to be leveraged when and where appropriate.
The reality of the deluge of spatial-temporally enabled data is that it is both a data science problem and a geospatial domain problem. A modern weapon system offers an analogy. Soldiers spend countless hours developing the expertise to effectively employ the weapon system, while still performing some basic maintenance and operations. In direct support of this system, however, mechanics and system specialists that are part of the team complete most of the major maintenance and modernization. Consequently, when the weapon system is operating subpar, discussions between the crews and maintainers are critical. Similarly, the ability to script and their understanding of statistical algorithms will improve GEOINT analysts’ effectiveness operating their tools in addressing more complex issues, but the primary concentration should be on their fundamental intelligence tradecraft and domain knowledge. It follows that as the GEOINT analyst focuses on the challenges of synthesizing data on the complex geospatial and social environment into intelligence, computer and data scientists should focus less on basic data formatting and process simplification for GEOINT analysts, concentrating more on the challenges of researching, developing, processing, visualizing, and developing new data streams and tools that are critical to maintain the IC’s competitive edge.
The benefits of this symbiotic, multidisciplinary approach go beyond data science. GEOINT analysts who are versed in the foundations of computer and data science, and able to communicate with data and computer scientists, will be able to overcome the hurdle of data wrangling and advance toward geospatial computational social science. This position is in line with an earlier published RAND Corporation paper for the Defense Intelligence Agency, in which data science is termed “a team sport.”
The Future GEOINT Analyst
Geospatial computational social science (CSS) is an emerging area of diverse study that explores geographic and social science through the application of computing power, which includes data science. With origins closely tied to that of advanced computing and GIS, the geospatial CSS field is in relative infancy when compared to the traditional schools of sociology, political science, anthropology, economics, and geography. It is important to note CSS does not replace these traditional social sciences, but rather advances them through applications of computational methods. By leveraging high-performance computing, advanced geostatistical analytics, and agent-based modeling, geospatial CSS empowers a multidisciplinary approach to the development of methodologies and algorithms to gather, analyze, and explore complex geospatial and social phenomenon.
Geospatial CSS presents a nexus of geographical information science, social network analysis, and agent-based modeling. It will require a solid foundation in geographic principles and the ability to apply computational thinking to complex social problems. Already, there are programs being written that simulate a 1:1 ratio of humans to computer agents. Imagine a catastrophic scenario in a megacity such as New York, and being able to simulate what every human in the city may do in reaction to the event being layered over high-precision terrain, physical models of buildings, super and subterranean features, dynamic traffic patterns, and reactive infrastructures.
Data science methodologies will undoubtedly play a key part in future multidisciplinary teams, helping to find the proverbial “needle in the stack of needles.” Geospatial CSS, however, is not only about making statistical inferences based on zettabytes of spatial-temporal observations; it is concerned with the exploration of the theories and processes that result from interactions caused by the observables. In a complex world, the aggregation of these interactions provides more unique pathways to understanding the reasoning behind the behavior of our adversaries than would a holistic analysis of the whole system. To further the needle in a stack of needles analogy, geospatial CSS aims to provide insight into as to why a needle would fall a particular way, and into a particular position in space and time, in that stack of hay. Geospatial CSS will be key in advancing GEOINT to “Understand the World.”
The NSG should work closely with academia to shape future geospatial computational social scientists who will be able to apply advanced computational methods, such as agent-based modeling, social network analysis, geographic information science, and deep learning algorithms toward analyzing and understanding physical and human geographic behaviors. Whereas high-performance computing and ML and visualization fall mostly within computer science and image science, geospatial CSS presents the nexus of geographical information science, social science, and data science. Future GEOINT analysts will require enhanced skills, applying computational power to explore and test hypotheses based on social and geographic theory to truly achieve the understanding of human interactions.
We recommend the NSG focus analytical modernization initiatives on forming multidisciplinary teams to attack key intelligence questions using geospatial CSS now to refocus the narrative on future. We understand data science techniques are still widely needed now, but feel the NSG community must come together to decompose data science for the future, focusing on key skills that rely not only on data, but on advances in information technology (IT) architectures, computation, and the application of geospatial computation to the social sciences. This also will help to delineate and define tasks to establish government workforce structure and career development, especially in the Armed Services, where traditional career series such as Intelligence Specialist (0132), Physical Scientist (1301), or Operations Research Systems Analyst (1515) strictly define work roles.
History has repeatedly shown new technology does not change the conduct of war alone, but it is how new technologies are integrated which creates advantage. This work will also guide industry initiatives and shape academia for the future geospatial scientist, rather than risk investing in a skill set that may be superseded in the future. In other words, generalizing data science as a black box catchall will risk creating generalists. This would result in the NSG losing oversight of what truly matters, which are analysts utilizing pertinent spatial-temporal data to provide timely, accurate, and objective assessments to not only monitor and analyze observed activity, but also to provide understanding of the geographic and human processes. This understanding requires the application of advanced computational methods to support the intelligence needs of policy-makers and warfighters.
Headline Image: Tracey Birch, Michael Gleason, and Ian Blenke, members of the SOFWERX data science team, develop software during the ThunderDrone Rodeo at their newest facility in Tampa, Fla., October, 2017. ThunderDrone is a U.S. Special Operations Command initiative dedicated to drone prototyping, which focuses on exploring drone technologies through idea formation, testing, and demonstration. Photo Credit: Master Sgt. Barry Loo