US5937422: Automatically generating a topic description for text and searching and sorting text by topic using the same

Nelson; Douglas J. , Columbia, MD
Schone; Patrick John , Elkridge, MD
Bates; Richard Michael , Greenbelt, MD

Applicant(s):

The United States of America as represented by the National Security Agency, Washington, DC

Issued/Filed Dates:

Aug. 10, 1999 / April 15, 1997

Application Number:

US1997000834263

IPC Class:

G06F 017/30;

Class:

707/531; 707/004; 707/532; 707/535; 707/512;

Field of Search:

704/010 707/512,532,535,531,3-5,7

Abstract:

A method of automatically generating a topical description of text by receiving the text containing input words; stemming each input word to its root form; assigning a user-definable part-of-speech score to each input word; assigning a language salience score to each input word; assigning an input-word score to each input word; creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word; assigning a definition-word score to each definition word; collapsing each tree structure to a corresponding tree-word list; assigning a tree-word-list score to each entry in each tree-word list; combining the tree-word lists into a final word list; assigning each word in the final word list a final-word-list score; and choosing the top N scoring words in the final word list as the topic description of the input text. Document searching and sorting may be accomplished by performing the method described above on each document in a database and then comparing the similarity of the resulting topical descriptions.

Attorney, Agent, or Firm:

Morelli; Robert D.;

Primary/Assistant Examiners:

Amsbury; Wayne; Channavajjala; Srirama

U.S. References:

(No patents reference this one)

Patent Issued Inventor(s) Title

US4965763 10 /1990 Zamora Computer method for automatic extraction of commonly specified information from business correspondence

US5371673 12 /1994 Fan Information processing analysis system for sorting and scoring text

US5384703 1 /1995 Withgott et al. Method and apparatus for summarizing documents according to theme

US5434962 7 /1995 Kyojima et al. Method and system for automatically generating logical structures of electronic documents

US5619410 4 /1997 Emori et al. Keyword extraction apparatus for Japanese texts

US5845278 12 /1998 Kirsch et al. Method for automatically selecting collections to search in full text searches

US5873660 2 /1999 Walsh et al. Morphological search and replace

Patent	Issued	Inventor(s)	Title
US4965763	10 /1990	Zamora	Computer method for automatic extraction of commonly specified information from business correspondence
US5371673	12 /1994	Fan	Information processing analysis system for sorting and scoring text
US5384703	1 /1995	Withgott et al.	Method and apparatus for summarizing documents according to theme
US5434962	7 /1995	Kyojima et al.	Method and system for automatically generating logical structures of electronic documents
US5619410	4 /1997	Emori et al.	Keyword extraction apparatus for Japanese texts
US5845278	12 /1998	Kirsch et al.	Method for automatically selecting collections to search in full text searches
US5873660	2 /1999	Walsh et al.	Morphological search and replace

CLAIMS:

What is claimed is:
1. A method of automatically generating a topical description of text, comprising the steps of:

a) receiving the text, where the text consists of one or more input words;
b) stemming each input word to its root form;
c) assigning a user-definable part-of-speech score ß_i to each input word;
d) assigning a language salience score S_i to each input word;
e) assigning an input-word score to each input word that is a function of the corresponding input word's part-of-speech score ß_i, language salience score S_i, and the number of times the corresponding input word appears in the text;
f) creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word, where each definition word may be further defined to a user-definable number of levels;
g) assigning a definition-word score A_i,t [j] to each definition word in each tree structure based on the definition word's part-of-speech score ß_j, the language salience score of the word the definition word defines, a relational salience score R_k,j, and a user-definable factor W;
h) collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains the unique words contained in the corresponding tree structure;
i) assigning a tree-word-list score to each word in each tree-word list, where each tree-word-list score is a function of the scores of the corresponding word that existed in the corresponding uncollapsed tree structure;
j) combining the tree-word lists into a final word list, where the final word list contains the unique words contained in the tree-word lists;
k) assigning a final-word-list score A_fi [j] to each word in the final word list, where A_fi [j] is a function of the corresponding word's dictionary salience and tree-word-list scores; and
l) choosing the top N scoring words in the final word list as the topic description of the input text, where the value N may be defined by the user.

    2. The method of claim 1, wherein said step of receiving the text, is comprised of the step of receiving text wherein said text is selected from the group consisting of speech-based text, optical-character-read text, stop-word-filtered text, stutter-phrase-filtered text, and lexical-collocation-filtered text.
    3. The method of claim 1, wherein said step of assigning a language salience score S_i to each input word is comprised of the step of determining the language salience score for each input word from the frequency count f_i of each word in a large corpus of text as follows:
    S_i =0, if f_i >f_max ;

    S_i =log (f_max /(f_i -T² +T)), if T² i max ;

    S_i =log (f_max /T), if Ti2 ; and

    S_i =.epsilon.+((f_i /T)(log(f_max /T)-.epsilon.)), if f_i<=T,
where .epsilon. and T are user-definable values, and where f_max represents a point where the sum of frequencies of occurrence above the point equals the sum of frequencies of occurrence below the point.
    4. The method of claim 3, wherein said step of assigning a language salience score S_i to each input word further comprises the step of allowing the user to over-ride the language salience score for a particular word with a user-definable language salience score.
    5. The method of claim 1, wherein said step of assigning an input-word score to each input word is comprised of the step of assigning an input-word score where said input-word score is selecting from the group consisting of mS_i ß_i and (S_i m)ß_i, where m is the number of times the corresponding input word occurs in the text.
    6. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary.
    7. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a database selected from a group consisting of a thesaurus, an encyclopedia, and a word-based relational database.
    8. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary that is in a different language than the text.
    9. The method of claim 1, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: A_i,t [j]=W(ß_j,t).SIGMA.A_i,t-1 [k]R_k,j, where R_i,j =D_j /.SIGMA.D_k), where .SIGMA.D_k represents the sum of the dictionary saliences of the words in the definition of word w_i, where D_j =ß_j (S_j log(d_max /d_j)) 0.5, where d_t is the number of dictionary terms that use the corresponding word in its definition, and where d_max is the number of times the most frequently used word in the dictionary is used.
    10. The method of claim 1, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: A_i,t [j]=W(ß_j,t).SIGMA.A_i,t-1 [k]R_k,j, where R_i,j =D_j /.SIGMA.D_k), where .SIGMA.D_k represents the sum of the dictionary saliences of the words in the definition of word w_i, where D_j =ß_j (S_j log(d_m /.DELTA._j)) 0.5, where .DELTA._j =max(d_j, .epsilon.), and d_m is chosen such that a fixed percentage of the observed values of the d_j 's are larger than d_m.
    11. The method of claim 1, wherein said step of assigning a definition-word score is comprised of the step of assigning a score to each definition word that is user-definable.
    12. The method of claim 1, wherein said step of collapsing each tree structure is comprised of collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains only salient input words and definition words in a particular tree structure having the highest score while ignoring lower scoring definition words in that tree structure even if the lower scoring definition words score higher than definition words contained in other tree structures.
    13. The method of claim 1, wherein said step of assigning a tree-word-list score to each word in each tree-word list is comprised of assigning a tree-word-list score that is the sum of the scores associated with the word in its corresponding tree structure.
    14. The method of claim 1, wherein said step of assigning a final word list score is comprised of the step of assigning a final word list score according to the following equation
    A_fi [j]=((D_j (f(A_i [j]))).SIGMA.A_i [j]).

    15. The method of claim 1, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.
    16. The method of claim 1, further comprising the steps of:

a) receiving a plurality of documents, where one of said plurality of documents is identified as the document of interest;
b) determining a topic description for each of said plurality of documents;
c) comparing the topic descriptions of each of said plurality of documents to the topic description of said document of interest; and
d) returning each of said plurality of documents that has a topic description that is sufficiently similar to the topic description of said document of interest.

17. The method of claim 1, further comprising the steps of:

a) receiving a plurality of documents;
b) determining a topic description for each of said plurality of documents;
c) comparing the topic descriptions of each of said plurality of documents to each other of said plurality of documents; and
d) sorting said plurality of documents by topic description.

    18. The method of claim 2, wherein said step of assigning a language salience score S_i to each input word is comprised of the step of determining the language salience score for each input word from the frequency count f_i of each word in a large corpus of text as follows:
    S_i =0, if f_i >f_max ;

    S_i =log (f_max /(f_i -T² +T)), if T² i max ;

    S_i =log (f_max /T), if Ti2 ;and

    S_i =.epsilon.+((f_i /T)(log(f_max /T)-.epsilon.)), if f_i <=T,
where .epsilon. and T are user-definable values, and where f_max represents a point where the sum of frequencies of occurrence above the point equals the sum of frequencies of occurrence below the point.
    19. The method of claim 18, wherein said step of assigning a language salience score S_i to each input word further comprises the step of allowing the user to over-ride the language salience score for a particular word with a user-definable language salience score.
    20. The method of claim 19, wherein said step of assigning an input-word score to each input word is comprised of the step of assigning an input-word score where said input-word score is selecting from the group consisting of mS_i ß_i and (S_i m)ß_i, where m is the number of times the corresponding input word occurs in the text.
    21. The method of claim 20, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary.
    22. The method of claim 21, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary that is in a different language than the text.
    23. The method of claim 22, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: A_i,t [j]=W(ß_j, t).SIGMA.A_i,t-1 [k]R_k,j, where R_i,j =D_j /.SIGMA.D_k), where .SIGMA.D_k represents the sum of the dictionary saliences of the words in the definition of word w_i, where D_j =ß_j (S_j log(d_max /d_j)) 0.5, where d_t is the number of dictionary terms that use the corresponding word in its definition, and where d_max is the number of times the most frequently used word in the dictionary is used.
    24. The method of claim 23, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: A_i,t [j]=W(ß_j,t).SIGMA.A_i,t-1 [k]R_k, j, where R_i, j =D_j /.SIGMA.D_k), where .SIGMA.D_k represents the sum of the dictionary saliences of the words in the definition of word w_i, where D_j =ß_j (S_j log (d_m /.DELTA._j)) 0.5, where .DELTA._j =max(d_j, .epsilon.), and d_m is chosen such that a fixed percentage of the observed values of the d_j 's are larger than d_m.
    25. The method of claim 24, wherein said step of assigning a definition-word score is comprised of the step of assigning a score to each definition word that is user-definable.
    26. The method of claim 25, wherein said step of collapsing each tree structure is comprised of collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains only salient input words and definition words in a particular tree structure having the highest score while ignoring lower scoring definition words in that tree structure even if the lower scoring definition words score higher than definition words contained in other tree structures.
    27. The method of claim 26, wherein said step of assigning a tree-word-list score to each word in each tree-word list is comprised of assigning a tree-word-list score that is the sum of the scores associated with the word in its corresponding tree structure.
    28. The method of claim 27, wherein said step of assigning a final word list score is comprised of the step of assigning a final word list score according to the following equation
    A_fi [j]=((D_j (f(A_i [j]))).SIGMA.A_i [j]).

    29. The method of claim 28, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.
    30. The method of claim 29, further comprising the steps of:

a) receiving a plurality of documents, where one of said plurality of documents is identified as the document of interest;
b) determining a topic description for each of said plurality of documents;
c) comparing the topic descriptions of each of said plurality of documents to the topic description of said document of interest; and
d) returning each of said plurality of documents that has a topic description that is sufficiently similar to the topic description of said document of interest.

31. The method of claim 30, further comprising the steps of:

a) receiving a plurality of documents;
b) determining a topic description for each of said plurality of documents;
c) comparing the topic descriptions of each of said plurality of documents to each other of said plurality of documents; and
d) sorting said plurality of documents by topic description.

Foreign References:

none

(No patents reference this one)

See 11 TIF images of patent with text and diagrams:

http://cryptome.org/nsa-vox-pat.zip (319K)