26 May 2005

The New York Times, May 24, 2005

I.B.M. Software Aims to Provide Security Without Sacrificing Privacy


International Business Machines is introducing software today that is intended to let companies share and compare information with other companies or government agencies without identifying the people connected to it.

Security specialists familiar with the technology say that, if truly effective, it could help tackle many security and privacy problems in handling personal information in fields like health care, financial services and national security.

"There is real promise here," said Fred H. Cate, director of the Center for Applied Cybersecurity Research at Indiana University. "But we'll have to see how well it works in all kinds of settings."

The technology for anonymous data-matching has been under development by S.R.D. (Systems Research and Development), a start-up company that I.B.M. acquired this year.

Much of the company's early financial backing came from In-Q-Tel, a venture capital firm financed by the Central Intelligence Agency that invests in companies whose technologies have government security uses.

S.R.D., now I.B.M.'s Entity Analytics unit, has worked for years on specialized software for quickly detecting relationships within vast storehouses of data. Its early market was in Las Vegas, where casinos used the company's technology to help prevent fraud or employee theft. The matching software might sift through databases of known felons, for example, to find any links to casino employees.

By the late 1990's, United States intelligence agencies had discovered S.R.D. and the potential to use its technology for winnowing leads in pursuing terrorists or spies. After 9/11, the government's interest increased, and today most of the company's business comes from government contracts.

The new product goes beyond finding relationships in different sets of data. The software, which I.B.M. calls DB2 Anonymous Resolution, enables companies or government agencies to share personal information on customers or citizens without identifying them.

For example, say the government were looking for suspected terrorists on cruise ships. The government had a "watch list," but it did not want to give that list to a cruise line, fearing it might leak out. Similarly, the cruise lines did not want to hand over their entire customer lists to the government, out of privacy concerns.

The I.B.M. software would convert data on a person into a string of seemingly random characters, using a technique known as a one-way hash function. No names, addresses or Social Security numbers, for example, would be embedded within the character string.

The strings would be fed through a program to detect a matching pattern of characters. In the case of the cruise line and the government, an alert would be sent to both sides that a match had been detected.

"But what you get is a message that there is a match on record Number 678 or whatever, and then the government can ask the cruise line for that specific record, not a whole passenger list," explained Jeff Jonas, the founder of S.R.D. and now chief scientist of I.B.M.'s Entity Analytics unit. "What you get is discovery without disclosure."

To date, the software for anonymously sharing and matching data has been tested in a few projects, but I.B.M. is aiming for day-to-day use in several industries.

In health care, for example, more secure and anonymous handling of patient information could alleviate privacy concerns in the shift to electronic health records, potentially increasing efficiency and reducing costs, analysts said.

The technology, specialists noted, could also reduce the risk of identity theft, especially if personal data held by companies were made anonymous.

From: Linda Casals <lindac@dimacs.rutgers.edu>
Subject: [Sy-nextgen-global] DIMACS Short Course: Statistical De-identification of Confidential Health Data with
Date: Wed, 25 May 2005 09:47:43 -0400 (EDT)

Application to the HIPAA Privacy Regulations


DIMACS Short Course: Statistical De-identification of Confidential
Health Data with Application to the HIPAA Privacy Regulations

October 18 - 20, 2005
DIMACS Center, CoRE Building, Rutgers University


Larry Cox, ljtcox at aol.com 
Daniel Barth-Jones, Wayne State University, dbjones@med.wayne.edu

Presented under the auspices of the Special Focus on Communication
Security and Information Privacy and Special Focus on Computational 
and Mathematical Epidemiology. 
Workshop Announcement:

This DIMACS short course will provide researchers, analysts and
managers with an overview of the federal HIPAA Privacy regulations and
an introduction to the principles and methods of statistical
disclosure limitation that can be used to statistically de-identify
healthcare data to meet privacy regulations.  


The Health Insurance Portability and Accountability Act of 1996
(HIPAA) established the Standards for the Privacy of Individually
Identifiable Health Information (i.e., HIPAA Privacy Rule), which
provides privacy protections for the personal health information (PHI)
of individuals. These federal regulations became effective April 14,
2003 and have wide reaching implications for many important uses of
healthcare information.

Prior to the implementation of the privacy rule, epidemiologic,
healthcare systems and other types of biomedical research had been
routinely conducted with administrative healthcare data, with such
analyses demonstrating considerable utility and value. The recent
implementation of the HIPAA privacy standards, however, has
necessitated dramatic changes in the process of conducting many
analyses with administrative data. The privacy rule "safe-harbor"
provision requires the removal of 18 types of identifying information
before the resulting "de-identified" data can be used without
restriction. This safe-harbor approach necessitates the removal of
specific dates of patient care and lower level geographic information
(such as 5 digit zip codes), which can greatly diminish the utility of
such data for many analytic purposes. An alternative approach
permitted under the privacy rule is the "statistical
de-identification" of PHI certified by an expert
statistician. Conducting analyses with statistically de-identified
healthcare data is an attractive option because such data can be used
without privacy rule restrictions.

In order for data to be considered statistically de-identified,
"statistical disclosure" analyses must be conducted and documented
which determine that the re-identification risks for the data are
"very small". The principles and methods of statistical disclosure
analysis and disclosure limitation address the risk that persons might
be identifiable from information about them in data sets and provide a
variety of methods by which risks of disclosure can be measured and
reduced to acceptably low levels.  

Course Objectives

This two-and-a-half day short course will provide participants with a
detailed overview of the HIPAA privacy regulations, theory and methods
for statistical disclosure limitation, and applied experience with
disclosure limitation methods. Participants completing the course
should be able to: 1) understand the permissible uses of healthcare
data for various purposes under the HIPAA regulations; 2)
conceptualize and document data intrusion scenarios; 3) conduct and
document statistical disclosure analyses measuring disclosure risks;
4) select and use appropriate disclosure limitation methods; 5)
evaluate the associated trade-offs between disclosure risks and
statistical information quality. Development of these skills should
enable participants to supervise and work successfully with an expert
certifying statistician.

Participants will learn about statistical disclosure for both tabular
data sets and microdata files, but the primary focus will be on
statistical disclosure for microdata in healthcare databases. While
statistical disclosure theory will be covered in some detail, the
course orientation will be practical and applied, focusing primarily
on providing participants with the knowledge and experience needed to
statistically de-identify healthcare datasets in accordance with the
HIPAA privacy rule and to identify confidentiality problems of
potential concern. Upon completion of the course, it is expected that
participants would be able to implement or supervise the
implementation of basic disclosure limitation analyses and methods on
their own and would be prepared to undertake further learning in
statistical disclosure on their own.

Participants will be provided with lecture slides, classroom notes,
and simulated example datasets. The course will include hands-on
computer-based instruction in conducting disclosure analyses and
implementing disclosure control methods.

Who Should Attend

Researchers (epidemiologists, biostatisticians, medical informatics
and health systems scientists, etc.), analytic professionals (from
business, marketing, pharmaceutical industry, etc.) and the managers
who supervise staff in these fields will benefit from this short
course. Technical and management personnel in the pharmaceutical and
healthcare information industries will find the course particularly
useful. Participants should have some prior background in mathematics,
statistics, and data/information management. Knowledge of SAS
statistical software will be desirable for the in-class computer
instruction, but participants with experience in other statistical
packages (SPSS, etc.) should also be able to complete the computer
instruction portions of the class.


Seating is limited to the first 40 participants. This course must be
prepaid in advance by check or credit card in order to hold your
place. We cannot guarantee a place for you unless we have received

Note: Our usual policies for fee waivers and reductions do not apply
to this course.  However, limited financial support might be

available.  Please see website for additional registration



Sy-nextgen-global mailing list