11 September 2001: Add messages from Brandt and Google.
10 September 2001
Date: Mon, 10 Sep 2001 13:24:20 EDT
Subject: Fraudulent Google deletions
CC: firstname.lastname@example.org (Cryptome), Ralphwmcgehee@aol.com
I am e-mailing you on behalf of Ralph McGehee. Mr. McGehee, who lives in Florida, is a retired CIA officer who has written and lectured extensively about CIA misdeeds since 1981.
In recent years, Mr. McGehee says he has been harassed by unknown persons, presumably associated with the U.S. intelligence community.
Recently, all of Mr. McGehee's posts on the Google Groups Usenet archive since May 1998 were deleted. The headers remain, but the body has been replaced with "--".
I called Mr. McGehee this morning to ask him if he requested or authorized Google to do this. He said no, and was not even aware that Google had acquired the Deja archive. Mr. McGehee trusts me because I worked with him to develop software for his database from 1985-1993. I believe him when he says he had nothing to do with this.
I advised Mr. McGehee to contact email@example.com and find out why they were deleted. But I am also following up myself with this e-mail, because I am more Google-aware than he is, and can represent the situation in a manner that is clear to Google employees.
A description of the vanishing McGehee Google posts has been posted on the Cryptome site (http://cryptome.org/mc-gehee.htm), should you wish further information on this.
I'd appreciate a report on what happened regarding Mr. McGehee's posts. Since Mr. McGehee did not request this, we apparently have a clear-cut case of criminal fraud on the part of whomever requested the deletion. The alternative -- that Google made the deletions at the request of "higher authority" -- is something that I don't believe Google would ever do.
PIR founder and president
Public Information Research, PO Box 680635, San Antonio TX
Tel:210-509-3160 Fax:210-509-3161 Nonprofit publisher of NameBase
Date: Mon, 10 Sep 2001 20:26:39 EDT
Subject: Improved Google analysis
OKAY TO POST
I haven't heard from Google yet, but I am now convinced that the stubbed posts from Ralph McGehee are due to a Google anti-spam algorithm.
First of all, there is a difference between posts after May, 1998 and those from 1995 to May, 1998. That's most likely because Google integrated the oldest portion of the archive several months after the more current portion was made available. The two blocks of data are treated differently. The deleted posts show up in the more recent block, while the earlier block seems immune.
More interestingly, there is a huge difference when the comparisons are made specific to the sort of group in which the post appears.
Of all groups in a search for author: ralph mcgehee in an advanced search, 2,190 hits are shown.
I didn't count them, but it looks like about a third of them are stubbed out with "--".
But if you break it down by specifying which group should be searched, the ratio of stubbed posts to total posts are as follows:
misc.activism.progressive 1 stub out of 191 posts
bit.listserv.cloaks-daggers 20 stubs out of 247 posts
alt.conspiracy* (wildcard accepts any trailing characters in the group name) 99 stubs out of 310 posts
More significantly, all but one of the posts since May 1998 in the alt.conspiracy* count are stubbed. It's almost a clean sweep.
What does this mean? Here's my speculation:
The misc.activism.progressive and bit.listserv.cloaks-daggers groups are both moderated. The spam is already gone. But as any newsgroup surfer knows, the alt.conspiracy* groups are full of spam, silly comments, advertisments, and flames.
Mr. McGehee has been in the habit of extensively cross-posting his messages. Most appear in three different newsgroups.
I believe that Google is experimenting with anti-spam algorithms. They use vector mathematics to identify posts that are simliar or almost similar. It's all done by software; the 200 employees only look at something manually after they are alerted to a possible problem.
I suspect that Google has identified those specific newsgroups where spam is a particular problem, and for these groups the threshold for flagging spam is lowered.
Basically, Google is only guilty of an imperfect algorithm, in my opinion. I think that Mr. McGehee was inadvertently snared by this algorithm. Google needs to improve it, because Mr. McGehee isn't the only person with something to say who used cross-posting. It's a tricky programming problem.
While you can find nearly every message Mr. McGehee has ever posted on at least one newsgroup, this isn't good enough. If you're following a thread in alt.conspiracy*, you should have access to what he posted, the same as if you're following it on misc.activism.progressive, without doing any fancy searching.
Google is experimenting with some very difficult filtering problems. Certainly we all benefit from the fact that they keep as much spam out of the archive as they (and Deja before them) have managed to do.
I'll let you know if and when I hear from Google. They have only said that they'll look into it, and they may not be eager to admit that their filtering has unintended consequences.
The bottom line is that I don't believe anything sinister is going on. I have a fair amount of experience with Google behavior on the Web, and they use many of their same algorithms on their Groups. It's not unusual to see your results change due to Google's latest experimental tweak of this or that algorithm. It was good that this issue came up, because it's something that Google should address. But that's a lot different than suggesting a conspiracy is at work.
If they fix this and restore Mr. McGehee's posts, I'll be satisfied -- even if they decide not to reply to my e-mail.
-- Daniel Brandt
Date: Mon, 10 Sep 2001 21:18:55 EDT
Subject: Reply from Google
To: firstname.lastname@example.org, Ralphwmcgehee@aol.com
I received a reply from Google. I don't know enough about mime headers to elaborate, or to even challenge their analysis. Perhaps someone out there can help. Here is their reply:
In a message dated 9/10/01 7:52:20 PM Central Daylight Time, email@example.com writes:
> Thank you for your note. It appears that Mr. McGehee is posting mime
> multipart documents. We only index the text/plain part of multiparts.
> In Mr. McGehee's case, depending on his settings when he posted,
> the text/plain part is often blank.
> The Google Team
Date: Mon, 10 Sep 2001 22:39:32 EDT
Subject: More info on Google
To: firstname.lastname@example.org, Ralphwmcgehee@aol.com
I asked someone who knows more than I do. She said:
Google's explanation makes sense, but why they don't strip off the crap and just post 7-bit text is a mystery. I guess their algorithm is to strip off the MIME and leave whatever is left. In the cases in question: nothing but the tearline delimiter "--"
Rich Winkel does that in misc.activism.progressive -- in fact, he only sends out 7-bit text, which is why it's a pain to post anything with accent marks to his list. (He uses the old listserv program.)
That's why only 1 post in his group got chopped.
How does McGehee do it? He probably is using Netscape mail, or some other mailer like NS, which allows you to choose:
"send HTML mail ONLY" (McGehee's choice)
"send HTML and plain text" -- they caution this makes it bigger
"send plain text only"
For newsgroups and mailing lists, it makes sense to choose option 3. Most people choose option 2 for anything they send. Looks like McGehee chose Option 1.