The Biggest Database in the World

James Bamford has a superb review of the new book by Matthew Aid about the US National Security Agency (NSA) in the New York Review of Books this month. What seems to be causing a stir around the intelligence research (and computing) community is the reference to a report by the MITRE corporation into a the information needs of the NSA in relation to new central NSA data repository being constructed in the deserts of Utah. The report, which is being rather speculative, says that IF the trend for increasing numbers of sensors collecting all kinds of information continues, then the kind of storage capacity required would be in the range of yottabytes by 2015 – as CrunchGear blog points out: there are “a thousand gigabytes in a terabyte, a thousand terabytes in a petabyte, a thousand petabytes in an exabyte, a thousand exabytes in a zettabyte, and a thousand zettabytes in a yottabyte. In other words, a yottabyte is 1,000,000,000,000,000GB.” However CrunchGear misses the ‘ifs’ in the report as some of the comments on the story point out. There is no doubt however, that the NSA will have some technical capabilities that are way beyond what the ordinary commercial market currently provides and it’s probably useless to speculate just how far beyond. Perhaps more important in any case, are the technologies and techniques required to sort such a huge amount of information into usable data and to create meaningful categories and profiles from it – that is where the cutting edge is. The size of storage units is not really even that interesting… The other interesting thing here is the hint of competition within US intelligence that never seems to stop: just a few months back, the FBI was revealed to have its Investigative Data Warehouse (IDW) plan. Data Warehouses or repositories seem to be the current fashion in intelligence: whilst the whole rest of the world moves more towards ‘cloud computing’ and more open systems, they collect it all and lock it down.

FBI data warehouse revealed by EFF

Tenacious FoI and ‘institutional discovery’ work both in and out of the US courts by the Electronic Frontier Foundation has resulted in the FBI releasing lots of information about its enormous dataveillance program, based around the Investigative Data Warehouse (IDW). 

The clear and comprehensible report is available from EFF here, but the basic messages are that:

  •  the FBI now has a data warehouse with over a billion unique documents or seven times as many as are contained in the Library of Congress;
  • it is using content management and datamining software to connect, cross-reference and analyse data from over fifty previously separate datasets included in the warehouse. These include, by the way, both the entire US-VISIT database, the No-Fly list and other controversial post-9/11 systems.
  • The IDW will be used for both link and pattern analysis using technology connected to the Foreign Terrorist Tracking Task Force (FTTTF) prgram, in other words Knowledge Disovery in Databases (KDD) software, which will through connecting people, groups and places, will generate entirely ‘new’ data and project links forward in time as predictions.

EFF conclude that datamining is the future for the IDW. This is true, but I would also say that it was the past and is the present too. Datamining is not new for the US intelligence services, indeed many of the techniques we now call datamining were developed by the National Security Agency (NSA). There would be no point in the FBI just warehousing vast numbers of documents without techniques for analysing and connecting them. KDD may well be more recent for the FBI and this phildickian ‘pre-crime’ is most certainly the future in more ways than one…

There is a lot that interests me here (and indeed, I am currently trying to write a piece about the socio-techncial history of these massive intelligence data analysis systems), but one issue is whether this complex operation will ‘work’ or whether it will throw up so many random and worthless ‘connections’ (the ‘six-degrees of Kevin Bacon’ syndrome) that it will actually slow-down or damage actual investigations into real criminal activities. That all depends on the architecture of the system, and that is something we know little about, although there are a few hints in the EFF report…

(thanks to Rosamunde van Brakel for the link)