Posted on June 18, 2003 The Compleat Poindexter The Pentagon released its Congressionally-mandated report on DARPA's Total (err,now Terrorist) Information Awareness initiative a couple of weeks ago. TIA is, you'll recall, Dr. John Poindexter's baby.
The report, in a graph of Rumsfeld-ian rhetoric, sets out the reasons why TIA was initiated after 9/11:
The major problems that TIA research and development aim to address include: the difficulties of sharing data across agency boundaries; mistaking absence of evidence for evidence of absence; confusing unfamiliar with improbable; having too many unknown unknowns, generating a single hypothesis versus competing hypotheses; and better exploitation of all permitted and open source information.
The least controversial of these is the first. DARPA’s effort to develop a secure collaborative environment that:
allows analysts on the edges of the organizations to quickly form ad hoc groups in virtual spaces (somewhat like chat rooms on the Internet) across organizational boundaries and, at the same time, use the center-based systems of their parent organizations
should be, as the report points out,
a big step forward in punching holes in the existing organizational ‘stove pipes’.
But TIA is mostly about new tools and techniques for identifying and tracking terrorists and their cells early, then generating options for action from that work. The initiatives are, in total, extraordinarily ambitious. And, in geek-ish, highly cool.
There is an interesting diagram (the TIA Reference Model) in the report which sets out all the components of TIA and their relationship to one another, using a signal processing analogy. One of the main issues in all this will be the signal to noise ratio in the system – the accuracy of all the assumptions built into the many models that process the massive volumes of structured and unstructured data fed into the system, as well as the accuracy and quality of the data itself. The utility – and risks—of the options ultimately generated will depend entirely on this ratio. As the report indicates,
If the concepts and algorithsm in IAO programs cannot extract terrorist signatures from a world of noise, even for simulated data, there is no reason to proceed.
As an example, one project called EARS is focused on getting the error-rate in automated speech-to-text translation down to 5 or 10%. Which means that 1 in 20 words, at best, will be mis-translated. And, this would be a five-fold improvement from current best rates translating conversations. This technology is meant to be applied in English, Arabic and Chinese. One imagines that improvements in non-English languages will naturally lag. (Interestingly, the report notes that there are about 228 countries in the world, and our people speak about 6,700 languages…
DoD is currently interested in about 200 languages, and the list constantly changes.
Although TIA has been much-blasted by legislators and others concerned about its implications for civil liberty and privacy, DoD is moving the program along rapidly. The 2004 Bush Budget includes $74 million in FY 2004 alone just for the subset of programs listed below which are most alarming in terms of privacy. TIA alone (only one of DARPA's initiatives in the overall Information Awareness Office) will receive $55 million from 2003 through 2005. That's quite a bit of prototyping money, these days, and it is leveraged by amounts spent by other departments which find their efforts seconded (without compensation) into TIA's prototype-and-test processes.
DARPA develops and buys (all via contract) whatever technologies it thinks might help predictive mass data mining, then runs those technologies over a prototype network at theArmy's INSCOM IOC. (Just as TIA was "Total" and is now "Terrorist", so the blander "Information Operations Center" is the new name for the old "Information Dominance Center" ...). The Pentagon assures us that;
these agencies and commands are using data that is available to them under existing laws and procedures for these tests.
The military is, of course, acronym Babel, so it isn't surprising that TIA's R&D projects all have jarring names. EARS. TIDES. RAW. GALE. Babylon. Symphony. ARM. Bio-ALIRT. NGFR. MInDet. The initiatives fall into three areas:
1. Simultaneous translation. The Pentagon apparently hopes to make up for its lack of obscure-language human resources, and dependence on human processes, by automating translation. EARS, TIDES, GALE, Babylon, and Symphony all relate to this.
2. Data collection and mining. This is the effort to comb through vast quantities of data to 'mine' the nuggets critical to find terrorists. TIA, Genisys, EELD, SSNA, MInDet, Bio-ALIRT, HumanID, ARM, and NGFR are here.
3. Collaborative decision support. These initiatives are meant to allow decision-takers, particularly in a war setting, to take decisions based on widely disparate and fractal input. Genoa II, WAE, RAW, and FutureMAP live here.
The programs which,
if applied to data on U.S. persons, would raise serious issues about privacy
are all in category #2 above, from TIA through NGFR.
Here's the brief on each program in that category:
1. Genisys. "The program aims to develop a federated database architecture and algorithms that would allow analysts and invetigators to more easily obtain answers to complex questions by eliminating their need to know where information resides or how it is structured in multiple databases." So, Sally, the FBI Terrorist Agent in Wichita, types in 'Muslim extremist railroad engineers' and, Presto!, the system spits back a report compiled from FBI and DOT and wherever else about anyone who fits that filter. The aim is high, here; by FY '05 the program will 'develop virtually centralized databases with no practical size limit'.
Genisys will guarantee privacy by 'controlling access to unauthorized information, enforcing laws and policies through software mechanisms, and enusring that any misuse of data ... is quickly detected.."
AlphaTech is the prime contractor for this work, with Oracle its sub-prime.
It's important to emphasize how dependent privacy protection will be on technology and software under this scheme;
DARPA aims to develop algorithms that prevent unauthorized access to sensitive identity data based on statistical and logical inference control, and create roles-based rules to distinguish between authorized and unauthorized uses of data and to automate access control.,
As much protection as the algorithms, statistics for inference, and software rules are ultimately accurate, in other words.
PARC in Palo Alto is the prime contractor for the privacy technology part of Genisys.
2. EELD, Evidence Extraction and Link Discovery. This program will develop ..
a suite of technologies that will automatically extract evidence about relationships among people, organizations, places, and things from unstructured textual data, such as intelligence messages or news reports, which are the starting points for further analysis.
So in the above example, Sally would EELD (rather than, say, Google) Ahmet and Ali and Saudi Arabian Charities and, Presto!, a report is returned with rankings that provides a variety of source material pointing to connections among these 'things'.
The ultimate goal, and test of both the programs utility as well as how likely it will be to identify 'false positives', i.e. nail someone who is actually innocent, will be in a year or so when the program decides .
which dots to connect -- starting with suspect people, places or organizations known or suspected to be suspicious based on intelligence reports; recognizing patterns of connections and activity corresponding to scenarios of concern between these people, places, and organizations; and learning patterns to discriminate as accurately as possible between real concerns and apparently similar but actually legitimate activities.
So, as long as Agent Sam has accurately specified "suspected to be suspicious" people or places or 'things', while Analyst Stephen's algorithms for 'recognizing patterns' and AI-expert Slade's rules for 'learning' are all accurate ... Sally won't get a system 'ping' from EELD that incorrectly identifies Akhmar as a terrorist...
3. SSNA, Scalable Social Network Analysis.
The purpose of SSNA algorithms program is to extend techniques of social network analysis to assist with distinguishing potential terrorist cells from legitimate groups of people, based on their patterns of interactions, and to identify when a terrorist group plans to execute an attack...... Based on publicly available information about the September 11 hijackers, contractors working under the EELD Program and Small Business Innovation Research (SBIR) contracts have demonstrated the feasibility of using these techniques to identify the transition a terrorist cell activity from dormant to active state by observing which social network metrics changed significantly and simultaneously.
You've probably seen social network diagrams, those clusters of connected nodes that show who has what relative authority in a network, and how interconnected via, for example, communication frequency, the members of the network are. So, assuming that we monitor social interactions, financial transactions, and telephone calls among terrorist cell members, this system would predict when they're active. This assumes that the system has already successfully established that the connections between these people is in fact terrorist, rather than innocent, which presumably will be based on patterning algorithms of some sort (for example, if Achmed and Albert have at least one financial transaction over $1000 between them and are also members of the same mosque with a radical Imam and also talk on the phone together at least once a week, then they (or the probability that they) are terrorists is high.
DARPA will develop a library of models of social network features that represent potential terrorist groups, then will develop algorithms that allow for the discovery of instances of these models in large databases.
Agent Sally will call up her SSNA Star on Acmed to see whom the system says he's connected to and whether they are terrorists, and alert her when they all 'wake up'.
4. MisInformation Detection (MInDet)
This is Spielberg (from Phil Dick)'s AI, in part;
In FY 2002, researchers under SBIR contracts demonstrated the ability to detect public corporations that might be potential targets of Securities and Exchange Commission investigations, based on their SEC filings, well in advance of actual SEC investigations. They also demonstrated the ability to distinguish between news reports of deaths in a particular country as suicides or murders, depending on whether the sources were the official news agency or independent reports.
5. HumanID (Human Identification at a Distance)
The program will ... develop automated, multimodal biometric technologies with the capability to detect, recognize, and identify humans at a distance.... wiht a focus on body parts identification, face identification, and human kinematics. Biometric signatures will be acquired from various collection sensors including video, infrared, and multispectral sensors ... networked to allow for complete coverage of large facilities.
Agent Sally will periodically check the HumanID feed from the OU/Texas football game to see if the system has flagged anyone in the crowd as having the look or walk or aura of a terrorist.
6. Activity, Recognition and Monitoring (ARM)
From human activity models, the ARM Program will develop scenario-specific models that will enable operatives to differentiate among normal activities in a given area or situation and activities that should be considered suspicious. The program aims to develop technologies to analyze, model, and understand human movements, individual behavious in a scene, and crowd behavior, using ... multisensor data... including video, agile sensors, low power radar, infrared, and radio frequency tags.
Well it's new, and they're only going to spend $5 million on this this year. Sounds like Agent Sally can turn on her DiscoFeed and the ARM program will highlight that arms-akimbo Funk dancer, identified by agile sensors his partner has surreptitiously donned, as a terrorist.
7.Next Generation Face Recognition (NGFR)
This wins the award for worst acronym. The program is about the ...
systematic development and evaluation of new approaches to face recognition; maturing of prototype systems at operational sites; experimentation on databases of at least one million individuals; and collection of a large database of facial imagery, which includes the variations in facial imagery found in unstructured environments.
Face vendor recognition tests (FRVTs) were completed in 2000 and 2002, and showed that face recognition technologies sucked in, among other venues, most outdoor environments.
-------------------
The problem that motivated the creation of TIA is real, and acute;
More than ever before, attempts to "connect the dots" quickly overwhelm unassisted human abilities. The potentially important data sets are massive. The patterns sought are sparse, yet they may be anywhere in huge temporal and spatial regious. Frequently, analsyts do not know what they are looking for.
But the Pentagon envisions TIA as doing much more than creating tools which can help an agent with future investigations. The Pentagon imagines separate, dedicated teams building entirely new scenarios that would in turn generate the rules and algorithms that all these tools would rely upon to separate out terrorists and their activities from the rest of human kind;
Teams of very experienced analysts and other experts (a red team) would imagine the types of terrorist attack .... They would develop scenarious for these attacks .... which scenarios (models) would be based on historical examples, estimated capabilities, and imagination about how these tactics might be adapted ...
These scenarious, once fully formed, would specify the patterns that all the technologies above would look for across the myriad data sets about people, from videos on street corners to tickets and medical records, that will be regularly collected and combed in the coming years.
The report addresses privacy concerns with statements throughout such as;
Procedures and techniques would be in place to protect the security of sensitive intelligence sources and, where applicable, the anonymity of U.S. persons if access to these types of databases were ever contemplated.
(Note to smart-mobians; the report is high on 'virtual space' technologies;
The analysts would work together using computer tools that allow them to remain with their parent organizations, yet meet in virtual spaces (something like an Internet chat room) to reason about a particular problem and share ideas and informaiton related to the problem
Perhaps they'll blog, too)
Although the document does not focus on the downsides of all this, beyond concerns about privacy, the authors do acknowledge, in a stunning-to-parse sentence, that;
The major problem in measuring added value in a system/network such as TIA is we seldom know the actual truth of the situation.
The first TIA steps have already been taken with tests of experimental software (including some commercial software) run on a VPN set up on one of the classified DoD operational networks. Guantanamo is providing guinea pigs; tests have already been run against data from detainees from Afghanistan. In what will surely not stand as a prime recommendation of the initiative to-date, the tests have also been;
Assessing various intelligence aspects including weapons of mass destruction in the Iraqi situation.
(Perhaps 'situation' is the term that will replace 'war' for historians of the Iraqi ... situation...)
DoD is most excited about the results of highly-classified tests against 'foreign intelligence data'. One wonders, of course, about how all that simultaneous translation technology is working in those instances...
The problem of false positives, or 'false alarms' as the Pentagon puts it, is acknowledged;
DARPA is faced with a very difficult problem and only through research will DARPA be able to determine whether it is possible to find these sparse pieces of evidence in the vast amount of information about transactions with an accuracy that can be managed successfully in later stages of the analysis.
The phrasing does not focus on the false-positive-ee. Just as the initial flurry to pass the Patriot Act after 9/11 did not focus on how exactly the hundreds of immigrant detainees would be vetted and, while being investigated, treated.
A basic tenet of TIA's defense about privacy concerns is that
TIA does not, in and of itself, raise any particular concerns about the accuracy of individually identifiable information. On the contrary, TIA is conceived of as simply a tool for more efficiently inquiring about data in the hands of others, and in theory these inquiries currently could be made by more labor-intensive human efforts..
More specifically, the report argues that privacy will be protected as TIA and the various component technologies that will build it as a system are subject to:
1. Existing privacy laws
2. Demonstration of their efficacy and accuracy. DARPA argues that TIA's search tools won't be deployed until an internal oversight board (comprised of senior DoD and Intelligence Communicty officials) have set policies and procedures for testing TIA stuff and determined that they comply with existing laws and policies and are sufficiently accurate (the 'false-positive' problem, presumably).
3. Built-in operational safeguards. It appears that the primary safeguards will be built into the software architecture itself. These could include automated audit trails, anonymization of sources of data and persons mentioned in the data; selective relevation of data based on permissions; and rigorous access controls and permissioning.
4.Substantial security measures to protect against unauthorized access. These would be at the architecture and at the access policy levels.
5. Pre-deployment legal review per agency contemplating the use of TIA's tools. This review would be codified in a memorandum of agreement between TIA and its user.
6. In-place policies on effective oversight at each agency contemplating use before it is deployed.
7. Pre-deployment review to ensure compliance with the strictures of the Fourth Amendment.
DoD has expressed its commitment to the rule of law in this endeavor and views the protection of privacy and civil liberties as an integral and paramount goal in the development of counterterrorism technologies.
The Report exists because Congress insisted on its publication prior to funding the TIA initiative. While it is better that the Defense Department has taken the time to explicitly and publically think through the privacy implications of this work, the project is, nonetheless, a classic governmental R&D effort to rapidly and heavily fund a wide range of unproven technologies. While it is compelling to imagine Agent Sally in Wichita touch-screening her way, in a live-time collaborative futures-market war-game with a virtual network of other agents, through beautiful 3D blobs of fresh, and completely relevant data to discover that sleeper cell X has just woken up, the probabilities of that happening by virtue of these initiatives don't seem higher than the probability that, for example, some of those identified as sleeping cell members .... aren't.
Perhaps the greatest probability is that Sally or her field office's general counsel will never trust all the data magic sufficiently to use it much in the first place. To add to the acronym soup, GIGA; Garbage-in, Garbage-out; end-users tend to discover quickly and viscerally whether a new program really works.
However, big money spent quickly on cutting-edge technologies that will comb through and simultaneously translate gigabytes of personal data in new ways is, to paraphrase DARPA's overall rationale for TIA, riskiest precisely because of the unknown unknowns it will spawn. It is the innocent who can too quickly become confused with the accused, and the automated analysis of corrupt data that will cause absent evidence to become an excuse for incarceration, and worse.
|