15 November 1999. Thanks to JA.
Source:
http://www.patents.ibm.com/details?&pn=US05937422__
See 11 TIF images of patent with text and diagrams:
http://cryptome.org/nsa-vox-pat.zip
(319K)
See related news report:
http://www.independent.co.uk/news/Digital/Features/spies151199.shtml
Nelson; Douglas J. , Columbia, MD
Schone; Patrick John , Elkridge, MD
Bates; Richard Michael , Greenbelt, MD
Applicant(s):
The United States of America as represented by the National Security Agency,
Washington, DC
Issued/Filed Dates:
Aug. 10, 1999 / April 15, 1997
Application Number:
US1997000834263
IPC Class:
G06F 017/30;
Class:
707/531; 707/004; 707/532; 707/535; 707/512;
Field of Search:
704/010 707/512,532,535,531,3-5,7
Abstract:
A method of automatically generating a topical description of text by receiving
the text containing input words; stemming each input word to its root form;
assigning a user-definable part-of-speech score to each input word; assigning
a language salience score to each input word; assigning an input-word score
to each input word; creating a tree structure under each input word, where
each tree structure contains the definition of the corresponding input word;
assigning a definition-word score to each definition word; collapsing each
tree structure to a corresponding tree-word list; assigning a tree-word-list
score to each entry in each tree-word list; combining the tree-word lists
into a final word list; assigning each word in the final word list a
final-word-list score; and choosing the top N scoring words in the final
word list as the topic description of the input text. Document searching
and sorting may be accomplished by performing the method described above
on each document in a database and then comparing the similarity of the resulting
topical descriptions.
Attorney, Agent, or Firm:
Morelli; Robert D.;
Primary/Assistant Examiners:
Amsbury; Wayne; Channavajjala; Srirama
U.S. References:
(No patents reference this one)
Patent | Issued | Inventor(s) | Title |
---|---|---|---|
US4965763 | 10 /1990 | Computer method for automatic extraction of commonly specified information from business correspondence | |
US5371673 | 12 /1994 | Information processing analysis system for sorting and scoring text | |
US5384703 | 1 /1995 | Method and apparatus for summarizing documents according to theme | |
US5434962 | 7 /1995 | Method and system for automatically generating logical structures of electronic documents | |
US5619410 | 4 /1997 | Keyword extraction apparatus for Japanese texts | |
US5845278 | 12 /1998 | Method for automatically selecting collections to search in full text searches | |
US5873660 | 2 /1999 | Morphological search and replace |
CLAIMS:
What is claimed is:
1. A method of automatically generating
a topical description of text, comprising the steps of:
2. The method of claim
1, wherein said step of receiving the text, is comprised of the step
of receiving text wherein said text is selected from the group consisting
of speech-based text, optical-character-read text, stop-word-filtered text,
stutter-phrase-filtered text, and lexical-collocation-filtered text.
17. The method of claim
1, further comprising the steps of:
18. The method of claim
2, wherein said step of assigning a language salience score Si
to each input word is comprised of the step of determining the language salience
score for each input word from the frequency count fi of each
word in a large corpus of text as follows:
31. The method of claim
30, further comprising the steps of:
none
(No patents reference this one)
See 11 TIF images of patent with text and diagrams:
3. The method of claim
1, wherein said step of assigning a language salience score Si
to each input word is comprised of the step of determining the language salience
score for each input word from the frequency count fi of each
word in a large corpus of text as follows:
Si =0, if fi
>fmax ;
Si =log (fmax /(fi
-T2 +T)), if T2
Si =log (fmax /T), if
T
Si =.epsilon.+((fi
/T)(log(fmax /T)-.epsilon.)), if fi<=T,
where .epsilon. and T are user-definable values, and where fmax
represents a point where the sum of frequencies of occurrence above the point
equals the sum of frequencies of occurrence below the point.
4. The method of claim
3, wherein said step of assigning a language salience score Si
to each input word further comprises the step of allowing the user to over-ride
the language salience score for a particular word with a user-definable language
salience score.
5. The method of claim
1, wherein said step of assigning an input-word score to each input word
is comprised of the step of assigning an input-word score where said input-word
score is selecting from the group consisting of mSi
ßi and (Si m)ßi, where m is
the number of times the corresponding input word occurs in the text.
6. The method of claim
1, wherein said step of creating a tree structure under each input word
is comprised of creating a tree structure under each input word using a
recursively closed dictionary.
7. The method of claim
1, wherein said step of creating a tree structure under each input word
is comprised of creating a tree structure under each input word using a database
selected from a group consisting of a thesaurus, an encyclopedia, and a
word-based relational database.
8. The method of claim
1, wherein said step of creating a tree structure under each input word
is comprised of creating a tree structure under each input word using a
recursively closed dictionary that is in a different language than the text.
9. The method of claim
1, wherein said step of assigning a definition-word score to each definition
word in each tree structure is comprised of assigning a definition-word score
to each definition word as follows: Ai,t
[j]=W(ßj,t).SIGMA.Ai,t-1 [k]Rk,j,
where Ri,j =Dj /.SIGMA.Dk), where
.SIGMA.Dk represents the sum of the dictionary saliences of the
words in the definition of word wi, where Dj
=ßj (Sj log(dmax /dj))
0.5, where dt is the number of dictionary terms that use the
corresponding word in its definition, and where dmax is the number
of times the most frequently used word in the dictionary is used.
10. The method of claim
1, wherein said step of assigning a definition-word score to each definition
word in each tree structure is comprised of assigning a definition-word score
to each definition word as follows: Ai,t
[j]=W(ßj,t).SIGMA.Ai,t-1 [k]Rk,j,
where Ri,j =Dj /.SIGMA.Dk), where
.SIGMA.Dk represents the sum of the dictionary saliences of the
words in the definition of word wi, where Dj
=ßj (Sj log(dm /.DELTA.j))
0.5, where .DELTA.j =max(dj, .epsilon.), and
dm is chosen such that a fixed percentage of the observed values
of the dj 's are larger than dm.
11. The method of claim
1, wherein said step of assigning a definition-word score is comprised
of the step of assigning a score to each definition word that is user-definable.
12. The method of claim
1, wherein said step of collapsing each tree structure is comprised of
collapsing each tree structure to a corresponding tree-word list, where each
tree-word list contains only salient input words and definition words in
a particular tree structure having the highest score while ignoring lower
scoring definition words in that tree structure even if the lower scoring
definition words score higher than definition words contained in other tree
structures.
13. The method of claim
1, wherein said step of assigning a tree-word-list score to each word
in each tree-word list is comprised of assigning a tree-word-list score that
is the sum of the scores associated with the word in its corresponding tree
structure.
14. The method of claim
1, wherein said step of assigning a final word list score is comprised
of the step of assigning a final word list score according to the following
equation
Afi [j]=((Dj (f(Ai
[j]))).SIGMA.Ai [j]).
15. The method of claim
1, further comprising the step of translating the topic description into
a language different from the input text and the language of the dictionary.
16. The method of claim
1, further comprising the steps of:
Si =0, if fi
>fmax ;
Si =log (fmax /(fi
-T2 +T)), if T2
Si =log (fmax /T), if
T
Si =.epsilon.+((fi
/T)(log(fmax /T)-.epsilon.)), if fi <=T,
where .epsilon. and T are user-definable values, and where fmax
represents a point where the sum of frequencies of occurrence above the point
equals the sum of frequencies of occurrence below the point.
19. The method of claim
18, wherein said step of assigning a language salience score
Si to each input word further comprises the step of allowing the
user to over-ride the language salience score for a particular word with
a user-definable language salience score.
20. The method of claim
19, wherein said step of assigning an input-word score to each input
word is comprised of the step of assigning an input-word score where said
input-word score is selecting from the group consisting of mSi
ßi and (Si m)ßi, where m is
the number of times the corresponding input word occurs in the text.
21. The method of claim
20, wherein said step of creating a tree structure under each input word
is comprised of creating a tree structure under each input word using a
recursively closed dictionary.
22. The method of claim
21, wherein said step of creating a tree structure under each input word
is comprised of creating a tree structure under each input word using a
recursively closed dictionary that is in a different language than the text.
23. The method of claim
22, wherein said step of assigning a definition-word score to each definition
word in each tree structure is comprised of assigning a definition-word score
to each definition word as follows: Ai,t
[j]=W(ßj, t).SIGMA.Ai,t-1 [k]Rk,j,
where Ri,j =Dj /.SIGMA.Dk), where
.SIGMA.Dk represents the sum of the dictionary saliences of the
words in the definition of word wi, where Dj
=ßj (Sj log(dmax /dj))
0.5, where dt is the number of dictionary terms that use the
corresponding word in its definition, and where dmax is the number
of times the most frequently used word in the dictionary is used.
24. The method of claim
23, wherein said step of assigning a definition-word score to each definition
word in each tree structure is comprised of assigning a definition-word score
to each definition word as follows: Ai,t
[j]=W(ßj,t).SIGMA.Ai,t-1 [k]Rk, j,
where Ri, j =Dj /.SIGMA.Dk), where
.SIGMA.Dk represents the sum of the dictionary saliences of the
words in the definition of word wi, where Dj
=ßj (Sj log (dm /.DELTA.j))
0.5, where .DELTA.j =max(dj, .epsilon.), and
dm is chosen such that a fixed percentage of the observed values
of the dj 's are larger than dm.
25. The method of claim
24, wherein said step of assigning a definition-word score is comprised
of the step of assigning a score to each definition word that is user-definable.
26. The method of claim
25, wherein said step of collapsing each tree structure is comprised
of collapsing each tree structure to a corresponding tree-word list, where
each tree-word list contains only salient input words and definition words
in a particular tree structure having the highest score while ignoring lower
scoring definition words in that tree structure even if the lower scoring
definition words score higher than definition words contained in other tree
structures.
27. The method of claim
26, wherein said step of assigning a tree-word-list score to each word
in each tree-word list is comprised of assigning a tree-word-list score that
is the sum of the scores associated with the word in its corresponding tree
structure.
28. The method of claim
27, wherein said step of assigning a final word list score is comprised
of the step of assigning a final word list score according to the following
equation
Afi [j]=((Dj (f(Ai
[j]))).SIGMA.Ai [j]).
29. The method of claim
28, further comprising the step of translating the topic description
into a language different from the input text and the language of the dictionary.
30. The method of claim
29, further comprising the steps of:
Foreign References:
http://cryptome.org/nsa-vox-pat.zip
(319K)