9 May 2004. Thanks to J.
Translation by Babelfish.
Le Monde, May 7, 2004
A redacted document recently distributed by the White House has been recovered. The method could be applied to good number declassified documents.
"I was 'sitting bored' in front of television, the weekend of Easter, when the CIA memo to George Bush was published," recalls David Naccache, a specialist in data coding for the French company Gemplus.
"I telephoned at once Claire Whelan, a coed at Dublin City University, whose thesis I advised, to propose an attack on the redacted passages."
Mission accomplished, or almost. The "memo" in question, addressed on August 6, 2001 by the CIA to President Bush and entitled "Ben Laden determined to strike in US", was declassified by the White House. This was done to show that the precision of warnings by the intelligence services was not sufficient for the president to prevent the attacks of September 11.
But five passages specifying the sources of the collected information had been redacted. For the cryptologist David Naccache, these illegible fragments were so many red flags.
The result of his efforts - "conducted in private" to leave his employer out of the effort - was presented Tuesday May 4 during the Eurocrypt 2004 conference which took place through May 6 in Interlaken, Switzerland.
"The demonstration made a strong impression," in the judgment of Jean-Jacques Quisquater (University of Leuwen-the-New), a specialist in the field, who applauded this example of "reverse engineering of a censored document."
David Naccache and his pupil indeed succeeded in discovering censored words. The term "Egyptian" seems the only possibility for one.
They want to polish their method before issuing their verdict on a longer passage, in order not to discredit it. And they straightforwardly admit the technique worked for an isolated word, which is not an adequate proof.
The technology employed is, at first sight, nothing revolutionary. The two researchers measured the inclination of the text, deformed at the time of its digital reproduction - the inclination was an angle of 0.52°.
They then used a character recognition software to determine the width of the Arial-font text which provides the number of letters per unit of length. Simple recourse to an English dictionary then helped establish a list of possible words.
"1,530 words corresponded," David Naccache said. But the article "an" preceding the mystery word implied that it necessarily started with a vowel, which made it possible to reduce the list to 346 words. In French, an index provided by articles like "un" or "une," would have narrowed the search in the same way.
The selection was also helped by the fact that the spacing of a character is "proportional," and that that the "spacing" of letters varies. The space occupied by an "I" differs from that taken by "W," which can give additional clues, compared to the text-spacing known as "monospace," like that of often used in e-mailwhere all the letters have the same spacing.
"Among the remaining words, five or six made sense, but only Egyptian corresponded to the context," the cryptologist said.
This last stage depends more on human intelligence than the geometry of the text. To choose among Ukrainian, univited, unofficial, incursive, Egyptian, indebted and Ugandan, the two researchers relied on their best interpretation. Uganda and Ukraine seemed too far away from the theater of the operations to be retained, for example.
Undoubtedly analysis of the CIA "memo" does not reveal the true "secret source," David Naccache recognizes. But the method systematizes research. In another "memo," it revealed that civilian helicopters militarized by the Iraqis had been bought in South Korea. And nothing prevents automated application of this technique with all declassified documents, in which it could reveal "isolated words, even groups of two or three words," in the opinion of the critics.
The New York Times, May 10, 2004
By JOHN MARKOFF
European researchers at a security conference in Switzerland last week demonstrated computer-based techniques that can identify blacked-out words and phrases in confidential documents.
The researchers showed their software at the conference, the Eurocrypt, by analyzing a presidential briefing memorandum released in April to the commission investigating the Sept. 11 attacks. After analyzing the document, they said they had high confidence the word "Egyptian" had been blacked out in a passage describing the source of an intelligence report stating that Osama Bin Ladin was planning an attack in the United States.
The researchers, David Naccache, the director of an information security lab for Gemplus S.A., a Luxembourg-based maker of banking and security cards, and Claire Whelan, a computer science graduate student at Dublin City University in Ireland, also applied the technique to a confidential Defense Department memorandum on Iraqi military use of Hughes helicopters.
They said that although the name of a country had been blacked out in that memorandum, their software showed that it was highly likely the document named South Korea as having helped the Iraqis.
The challenge of identifying blacked-out words came to Mr. Naccache as he watched television news on Easter weekend, he said in a telephone interview last Friday.
"The pictures of the blacked-out words appeared on my screen, and it piqued my interest as a cryptographer," he said. He then discussed possible solutions to the problem with Ms. Whelan, whom he is supervising as a graduate adviser, and she quickly designed a series of software programs to use in analyzing the documents.
Although Mr. Naccache is the director of Gemplus, a large information security laboratory, he said that the research was done independently from his work there.
The technique he and Ms. Whelan developed involves first using a program to realign the document, which had been placed on a copying machine at a slight angle. They determined that the document had been tilted by about half a degree.
By realigning the document it was possible to use another program Ms. Whelan had written to determine that it had been formatted in the Arial font. Next, they found the number of pixels that had been blacked out in the sentence: "An Egyptian Islamic Jihad (EIJ) operative told an xxxxxxxx service at the same time that Bin Ladin was planning to exploit the operative's access to the US to mount a terrorist strike." They then used a computer to determine the pixel length of words in the dictionary when written in the Arial font.
The program rejected all of the words that were not within three pixels of the length of the word that was probably under the blackened-out area in the document.
The software then reduced the number of possible words to just 7 from 1,530 by using semantic guidelines, including the grammatical context. The researchers selected the word "Egyptian" from the seven possible words, rejecting "Ukrainian" and "Ugandan," because those countries would be less likely to have such information.
After the presentation at Eurocrypt, the researchers discussed possible measures that government agencies could take to make identifying blacked-out words more difficult, Mr. Naccache said in the phone interview. One possibility, he said, would be for agencies to use optical character recognition technology to rescan documents and alter fonts.
In January, the State Department required that its documents use a more modern font, Times New Roman, instead of Courier, Mr. Naccache said. Because Courier is a monospace font, in which all letters are of the same width, it is harder to decipher with the computer technique. There is no indication that the State Department knew that.
Experts on the Freedom of Information Act said they feared the computer technique might be used as an excuse by government agencies to release even more restricted versions of documents.
"They have exposed a technique that may now become less and less useful as a result," said Steven Aftergood, a senior research analyst at the Federation of American Scientists, of the research project. "We care because there are all kinds of things withheld by government agencies improperly."