15 November 1999. Thanks to JA.
Source: http://www.patents.ibm.com/details?&pn=US05937422__

See 11 TIF images of patent with text and diagrams:
     http://cryptome.org/nsa-vox-pat.zip  (319K)

See related news report:
     http://www.independent.co.uk/news/Digital/Features/spies151199.shtml


US5937422: Automatically generating a topic description for text and searching and sorting text by topic using the same


Nelson; Douglas J. , Columbia, MD
Schone; Patrick John , Elkridge, MD
Bates; Richard Michael , Greenbelt, MD

Applicant(s):

The United States of America as represented by the National Security Agency, Washington, DC

Issued/Filed Dates:

Aug. 10, 1999 / April 15, 1997

Application Number:

US1997000834263

IPC Class:

G06F 017/30;

Class:

707/531; 707/004; 707/532; 707/535; 707/512;

Field of Search:

704/010 707/512,532,535,531,3-5,7

Abstract:

A method of automatically generating a topical description of text by receiving the text containing input words; stemming each input word to its root form; assigning a user-definable part-of-speech score to each input word; assigning a language salience score to each input word; assigning an input-word score to each input word; creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word; assigning a definition-word score to each definition word; collapsing each tree structure to a corresponding tree-word list; assigning a tree-word-list score to each entry in each tree-word list; combining the tree-word lists into a final word list; assigning each word in the final word list a final-word-list score; and choosing the top N scoring words in the final word list as the topic description of the input text. Document searching and sorting may be accomplished by performing the method described above on each document in a database and then comparing the similarity of the resulting topical descriptions.

Attorney, Agent, or Firm:

Morelli; Robert D.;

Primary/Assistant Examiners:

Amsbury; Wayne; Channavajjala; Srirama

U.S. References:

(No patents reference this one)

Patent   Issued   Inventor(s) Title
US4965763 10 /1990 Zamora Computer method for automatic extraction of commonly specified information from business correspondence
US5371673 12 /1994 Fan Information processing analysis system for sorting and scoring text
US5384703 1 /1995 Withgott et al. Method and apparatus for summarizing documents according to theme
US5434962 7 /1995 Kyojima et al. Method and system for automatically generating logical structures of electronic documents
US5619410 4 /1997 Emori et al. Keyword extraction apparatus for Japanese texts
US5845278 12 /1998 Kirsch et al. Method for automatically selecting collections to search in full text searches
US5873660 2 /1999 Walsh et al. Morphological search and replace

CLAIMS:

What is claimed is:
    1. A method of automatically generating a topical description of text, comprising the steps of:

    2. The method of claim 1, wherein said step of receiving the text, is comprised of the step of receiving text wherein said text is selected from the group consisting of speech-based text, optical-character-read text, stop-word-filtered text, stutter-phrase-filtered text, and lexical-collocation-filtered text.
    3. The method of claim 1, wherein said step of assigning a language salience score Si to each input word is comprised of the step of determining the language salience score for each input word from the frequency count fi of each word in a large corpus of text as follows:
    Si =0, if fi >fmax ;

    Si =log (fmax /(fi -T2 +T)), if T2 i max ;

    Si =log (fmax /T), if Ti2 ; and

    Si =.epsilon.+((fi /T)(log(fmax /T)-.epsilon.)), if fi<=T,
where .epsilon. and T are user-definable values, and where fmax represents a point where the sum of frequencies of occurrence above the point equals the sum of frequencies of occurrence below the point.
    4. The method of claim 3, wherein said step of assigning a language salience score Si to each input word further comprises the step of allowing the user to over-ride the language salience score for a particular word with a user-definable language salience score.
    5. The method of claim 1, wherein said step of assigning an input-word score to each input word is comprised of the step of assigning an input-word score where said input-word score is selecting from the group consisting of mSi ßi and (Si m)ßi, where m is the number of times the corresponding input word occurs in the text.
    6. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary.
    7. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a database selected from a group consisting of a thesaurus, an encyclopedia, and a word-based relational database.
    8. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary that is in a different language than the text.
    9. The method of claim 1, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: Ai,t [j]=W(ßj,t).SIGMA.Ai,t-1 [k]Rk,j, where Ri,j =Dj /.SIGMA.Dk), where .SIGMA.Dk represents the sum of the dictionary saliences of the words in the definition of word wi, where Djj (Sj log(dmax /dj)) 0.5, where dt is the number of dictionary terms that use the corresponding word in its definition, and where dmax is the number of times the most frequently used word in the dictionary is used.
    10. The method of claim 1, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: Ai,t [j]=W(ßj,t).SIGMA.Ai,t-1 [k]Rk,j, where Ri,j =Dj /.SIGMA.Dk), where .SIGMA.Dk represents the sum of the dictionary saliences of the words in the definition of word wi, where Djj (Sj log(dm /.DELTA.j)) 0.5, where .DELTA.j =max(dj, .epsilon.), and dm is chosen such that a fixed percentage of the observed values of the dj 's are larger than dm.
    11. The method of claim 1, wherein said step of assigning a definition-word score is comprised of the step of assigning a score to each definition word that is user-definable.
    12. The method of claim 1, wherein said step of collapsing each tree structure is comprised of collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains only salient input words and definition words in a particular tree structure having the highest score while ignoring lower scoring definition words in that tree structure even if the lower scoring definition words score higher than definition words contained in other tree structures.
    13. The method of claim 1, wherein said step of assigning a tree-word-list score to each word in each tree-word list is comprised of assigning a tree-word-list score that is the sum of the scores associated with the word in its corresponding tree structure.
    14. The method of claim 1, wherein said step of assigning a final word list score is comprised of the step of assigning a final word list score according to the following equation
    Afi [j]=((Dj (f(Ai [j]))).SIGMA.Ai [j]).

    15. The method of claim 1, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.
    16. The method of claim 1, further comprising the steps of:

    17. The method of claim 1, further comprising the steps of:

    18. The method of claim 2, wherein said step of assigning a language salience score Si to each input word is comprised of the step of determining the language salience score for each input word from the frequency count fi of each word in a large corpus of text as follows:
    Si =0, if fi >fmax ;

    Si =log (fmax /(fi -T2 +T)), if T2 i max ;

    Si =log (fmax /T), if Ti2 ;and

    Si =.epsilon.+((fi /T)(log(fmax /T)-.epsilon.)), if fi <=T,
where .epsilon. and T are user-definable values, and where fmax represents a point where the sum of frequencies of occurrence above the point equals the sum of frequencies of occurrence below the point.
    19. The method of claim 18, wherein said step of assigning a language salience score Si to each input word further comprises the step of allowing the user to over-ride the language salience score for a particular word with a user-definable language salience score.
    20. The method of claim 19, wherein said step of assigning an input-word score to each input word is comprised of the step of assigning an input-word score where said input-word score is selecting from the group consisting of mSi ßi and (Si m)ßi, where m is the number of times the corresponding input word occurs in the text.
    21. The method of claim 20, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary.
    22. The method of claim 21, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary that is in a different language than the text.
    23. The method of claim 22, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: Ai,t [j]=W(ßj, t).SIGMA.Ai,t-1 [k]Rk,j, where Ri,j =Dj /.SIGMA.Dk), where .SIGMA.Dk represents the sum of the dictionary saliences of the words in the definition of word wi, where Djj (Sj log(dmax /dj)) 0.5, where dt is the number of dictionary terms that use the corresponding word in its definition, and where dmax is the number of times the most frequently used word in the dictionary is used.
    24. The method of claim 23, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: Ai,t [j]=W(ßj,t).SIGMA.Ai,t-1 [k]Rk, j, where Ri, j =Dj /.SIGMA.Dk), where .SIGMA.Dk represents the sum of the dictionary saliences of the words in the definition of word wi, where Djj (Sj log (dm /.DELTA.j)) 0.5, where .DELTA.j =max(dj, .epsilon.), and dm is chosen such that a fixed percentage of the observed values of the dj 's are larger than dm.
    25. The method of claim 24, wherein said step of assigning a definition-word score is comprised of the step of assigning a score to each definition word that is user-definable.
    26. The method of claim 25, wherein said step of collapsing each tree structure is comprised of collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains only salient input words and definition words in a particular tree structure having the highest score while ignoring lower scoring definition words in that tree structure even if the lower scoring definition words score higher than definition words contained in other tree structures.
    27. The method of claim 26, wherein said step of assigning a tree-word-list score to each word in each tree-word list is comprised of assigning a tree-word-list score that is the sum of the scores associated with the word in its corresponding tree structure.
    28. The method of claim 27, wherein said step of assigning a final word list score is comprised of the step of assigning a final word list score according to the following equation
    Afi [j]=((Dj (f(Ai [j]))).SIGMA.Ai [j]).

    29. The method of claim 28, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.
    30. The method of claim 29, further comprising the steps of:

    31. The method of claim 30, further comprising the steps of:



Foreign References:

none

(No patents reference this one)


See 11 TIF images of patent with text and diagrams:

http://cryptome.org/nsa-vox-pat.zip  (319K)