US7197449B2 - Method for extracting name entities and jargon terms using a suffix tree data structure - Google Patents
Method for extracting name entities and jargon terms using a suffix tree data structure Download PDFInfo
- Publication number
- US7197449B2 US7197449B2 US10/017,408 US1740801A US7197449B2 US 7197449 B2 US7197449 B2 US 7197449B2 US 1740801 A US1740801 A US 1740801A US 7197449 B2 US7197449 B2 US 7197449B2
- Authority
- US
- United States
- Prior art keywords
- phrase
- frequently occurring
- phrases
- rules
- filtering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000001914 filtration Methods 0.000 claims description 18
- 238000003058 natural language processing Methods 0.000 claims description 11
- 230000000877 morphologic effect Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 abstract description 15
- 239000012634 fragment Substances 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 5
- 230000001419 dependent effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Definitions
- This invention relates generally to natural language processing, and more specifically, to an improved technique for the extraction of name entities and jargon terms.
- Natural language processing encompasses computer understanding, analysis, manipulation, and generation of natural language. From simplistic natural language processing applications, such as string manipulation (e.g., stemming) to higher-level tasks such as machine translation and question answering, the ability to identify and extract entity names and jargon terms in a text corpus is very important. Being able to identify proper names in the text is important to understanding and using the text. For example, in a Chinese-English machine translation system, if a person name is identified, it can be converted to pinyin (system for transliterating Chinese characters into the Latin alphabet) rather than being directly translated.
- pinyin system for transliterating Chinese characters into the Latin alphabet
- Entity names include the names of people, places, organizations, dates, times, monetary amounts and percentages, for example.
- Name entity and jargon term extraction involves identifying named entities in the context of a text corpus. For example, a name entity extraction must differentiate between “white house” as an adjective-noun combination, and “White House” as a named organization or a named location. In English the use of uppercase and lowercase letters may be indicative, but cannot be relied on to substantially determine name entities and jargon terms. Moreover, case does not aid name entity and jargon term recognition and extraction in languages in which case does not indicate proper nouns (e.g., Chinese) or in non-text modalities (e.g., speech).
- proper nouns e.g., Chinese
- non-text modalities e.g., speech
- FIG. 1 is a process flow digram in accordance with one embodiment of the present invention
- FIGS. 2–4 illustrate examples the incremental addition of clauses to the suffix tree in accordance with one embodiment of the present invention
- FIG. 5 illustrates examples of filtering high frequency phrases to obtain more probable entity name and jargon term candidates in accordance with one embodiment of the present invention
- FIG. 6 is a diagram illustrating an exemplary computing system 600 for implementing the name entity and jargon term recognition and extraction process of the present invention.
- An embodiment of the present invention includes the creation of a suffix tree data structure as part of a process to perform entity name and jargon term extraction on a text corpus.
- the text to be analyzed is preprocessed.
- the form and extent of preprocessing is typically language dependent, for example, Chinese may require spaces added between words.
- the text is then separated into clauses and a suffix tree is created for the text.
- the suffix tree is used to determine repetitious segments.
- the set of repetitious segments is then filtered to obtain a set of possible entity names and jargon terms.
- the set of possible entity names and jargon terms is then analyzed and filtered using known natural language processing techniques for entity name and jargon term recognition and extraction.
- An embodiment of the present invention is based on the fact that an unrecognized text fragment, occurring with a high frequency, has a comparably high probability of being a name entity or jargon term.
- the use of a suffix tree structure to efficiently and accurately determine the text fragment frequencies may greatly improve name entity and jargon term recognition and extraction.
- the phrase “Jack and Jill” may be a name entity referring to a poem or book title, or may be a jargon term referring to a couple.
- a typical name entity extraction technique may inaccurately identify “Jack” and “Jill” as separate name entities, and may discard “and” because it is a common, and frequently occurring word.
- it may be determined that “Jack” is connected to “Jill” with “and” exclusively, or with high frequency, throughout the document. Such analysis would indicate that “Jack and Jill” may be a name entity.
- a suffix tree is a type of data structure used to simplify and accelerate text-string searches in extensive text corpuses.
- Suffix tree algorithms have evolved and become more efficient over the past twenty-five years.
- Suffix trees allow a one-time commitment of processing resources to construct a suffix tree.
- a fast and efficient search may then be made of any patterns/substrings within the suffix tree.
- Suffix trees may be applied to a wide variety text-string problems occurring in text editing, free-text search, and other search pattern applications involving a large amount of text.
- suffix trees are primarily used for searching text.
- An embodiment of the present invention employs the suffix tree concept as part of a name entity and jargon term recognition and extraction process.
- a brief explanation of how a suffix tree is created and used in accordance with one embodiment of the present invention is included below in reference to FIGS. 2–4 .
- FIG. 1 is a process flow diagram in accordance with one embodiment of the present invention.
- the process 100 begins at operation 105 in which the text corpus is preprocessed. This form and extent of preprocessing is language dependent. The intended result of the preprocessing is to have the text corpus separated into clauses. Some languages, such as Chinese, in which there are no spaces between words, may require that the text be separated into words and spaces inserted between the words. Separating the text into clauses, facilitates the construction of a suffix tree.
- a suffix tree is created by adding all clauses to the suffix tree incrementally.
- FIGS. 2–4 illustrate examples of how clauses may be added, incrementally, to the suffix tree in accordance with one embodiment of the present invention.
- the process involves two parameters, a startnode, an existing portion of the suffix tree and a string suffix, to be added to the suffix tree.
- FIG. 2 illustrates the addition of a string suffix having no overlap with the existing suffix tree structure.
- Structure 205 comprises fork node 1 and leaf node 2 representing string edge “a” that may represent a word or phrase.
- Structure 210 shows the addition of string suffix “b” having no common elements.
- node 3 has been created to represent the string suffix “b” added to the suffix tree. What is represented by the suffix tree, then, are those string edges obtained by traversing from a fork node, or series of fork node, to a leaf node.
- FIG. 3 illustrates the addition of a string suffix, “ab”, to string edge “a”.
- the string edge and the string suffix share the common element “a”.
- fork node 1 and leaf node 2 represent the string edge “a”.
- nodes 1 and 2 are fork nodes and leaf node 3 has been created.
- FIG. 4 illustrates a more complex addition to the suffix tree.
- structure 405 comprises fork node 1 and leaf node 2 in which string edge “abc” is represented.
- each element (“a”, “b”, “c”,) of string edge “abc” may represent a word in a phrase, for example, “George Bush said” where “a” represents “George”, “b” represents “Bush”, and “c” represents “said”.
- Structure 410 shows the incremental addition of a clause to the suffix tree.
- the clause added to the suffix tree is string suffix “ac”, in this case representing the phrase “George said”.
- the string edge “abc” has been split at the point of overlap (i.e., after element “a”).
- a new fork node, node 3 has been created as well as a new leaf node, node 4 .
- Traversal from fork node 1 , through fork node 3 , to leaf node 4 represents the string suffix to be added, “ac”.
- the original string edge, “abc” is represented by traversing from fork node 1 , through fork node 3 , to leaf node 2 .
- An exemplary algorithm for constructing a suffix tree in accordance with the present invention is included as Appendix A. This algorithm may be replaced with faster or more efficient suffix tree algorithms known in the art. Because suffix trees are general data structures that are independent of language, the method of one embodiment of the present invention is language independent. This is an important advantage in improving the performance for entity name extraction algorithms for languages such as Chinese that are less structured, and therefore more difficult to process, than, for example, English.
- the repetitious phrases are determined from the suffix tree.
- the frequency of each phrase is stored in each corresponding fork node as the suffix tree is created.
- Those phrases that are unrecognized (as compared, for example, to a dictionary) and have high frequency occurrence are collected.
- the high-frequency phrases are then sorted in inverse lexicographical order.
- An embodiment of the invention is based on the fact that unrecognized text strings that occur at an unusually high frequency have a correspondingly high probability of being entity names or jargon terms. Of course not all high frequency phrases are entity names or jargon terms. Therefore the set of high frequency phrases is filtered to produce a smaller and more concise set (a set of high frequency phrases with a greater likelihood of being entity names or jargon terms) at operation 120 .
- FIG. 5 illustrates three examples of filtering high frequency phrases to obtain more likely entity name and jargon term candidates in accordance with one embodiment of the present invention.
- a phrase “AB” comprised of component phrases “A” and “B” appears with a comparable frequency to its component phrases.
- the component phrases e.g., “A” and “B”
- the phrase “AB” corresponds to “Bill Clinton”, with “A” corresponding to “Bill” and “B” corresponding to “Clinton”.
- Example 2 “B”, a component of “AB” appears with a frequency much higher than “A” or “AB”, although “A” and “AB” appear frequently. This indicates that “A” and “B” may be name entities or jargon terms, but that “AB” may be safely discarded.
- “AB” may represent a phrase such as “George Bush said” with “A” representing “George Bush” and “B” representing “said”. This indicates that “AB” (“George Bush said”) is probably not a name entity, but that separately “A” (“George Bush”) and “B” (“said”) may be.
- “said” is not actually a name entity and will be eliminated as a candidate through comparison to a dictionary as will all common words.
- Example 3 similar to Example 1, similar frequencies between “AB” and “B” indicate that “B” is a substring of “AB” and may be discarded as a name entity or jargon term candidate. Thus it is possible, using relative frequencies, to substantially, and with a high degree of confidence, reduce the number of high frequency phrases that are likely name entity or jargon term candidates.
- FIG. 6 is a diagram illustrating an exemplary computing system 600 for implementing the name entity and jargon term recognition and extraction process of the present invention.
- the text processing, creation of a suffix tree, and candidate set filtering described herein can be implemented and utilized within computing system 600 , which can represent a general-purpose computer, portable computer, or other like device.
- the components of computing system 600 are exemplary in which one or more components can be omitted or added.
- one or more memory devices can be utilized for computing system 600 .
- computing system 600 includes a central processing unit 602 and a signal processor 603 coupled to a display circuit 605 , main memory 604 , static memory 606 , and mass storage device 607 via bus 601 .
- Computing system 600 can also be coupled to a display 621 , keypad input 622 , cursor control 623 , hard copy device 624 , input/output (I/O) devices 625 , and audio/speech device 626 via bus 601 .
- I/O input/output
- Bus 601 is a standard system bus for communicating information and signals.
- CPU 602 and signal processor 603 are processing units for computing system 600 .
- CPU 602 or signal processor 603 or both can be used to process information and/or signals for computing system 600 .
- CPU 602 includes a control unit 631 , an arithmetic logic unit (ALU) 632 , and several registers 633 , which are used to process information and signals.
- Signal processor 603 can also include similar components as CPU 602 .
- Main memory 604 can be, e.g., a random access memory (RAM) or some other dynamic storage device, for storing information or instructions (program code), which are used by CPU 602 or signal processor 603 .
- Main memory 604 may store temporary variables or other intermediate information during execution of instructions by CPU 602 or signal processor 603 .
- Static memory 606 can be, e.g., a read only memory (ROM) and/or other static storage devices, for storing information or instructions, which can also be used by CPU 602 or signal processor 603 .
- Mass storage device 607 can be, e.g., a hard or floppy disk drive or optical disk drive, for storing information or instructions for computing system 600 .
- Display 621 can be, e.g., a cathode ray tube (CRT) or liquid crystal display (LCD). Display device 621 displays information or graphics to a user.
- Computing system 600 can interface with display 621 via display circuit 605 .
- Keypad input 622 is a alphanumeric input device with an analog to digital converter.
- Cursor control 623 can be, e.g., a mouse, a trackball, or cursor direction keys, for controlling movement of an object on display 621 .
- Hard copy device 624 can be, e.g., a laser printer, for printing information on paper, film, or some other like medium.
- a number of input/output devices 625 can be coupled to computing system 600 .
- CPU 602 or signal processor 603 can execute code or instructions stored in a machine-readable medium, e.g., main memory 604 .
- the machine-readable medium may include a mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine such as computer or digital processing device.
- a machine-readable medium may include a read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices.
- the code or instructions may be represented by carrier-wave signals, infrared signals, digital signals, and by other like signals.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
APPENDIX A |
FOR all clauses in the document |
FOR all clauses in the document |
FOR all suffixes in the clause |
Add_a_Suffix (headnode, Suffix); |
Add_a_Suffix (NODE *startNode, char * strSuffix) |
{ |
Find out whether there ixists an edge strEdge that has same prefix with input string |
strSuffix; |
IF exists the edge | |
{ |
Let strOverlap = strEdge ∩ strSuffix, | |
(Here the strOverlap is the prefix shared between strEdge and strSuffix, | |
e.g. if strEdge is ‘abc’, strSuffix is ‘ab’, then the strOverlap is ‘ab’.) |
strEdgeLeft = strSuffix − strOverlap; | |
(here strEdgeLeft is the part of strEdge that is left when removing | |
strOverlap from it, e.g. in last example, it is ‘c’.) | |
strSentLeft = strSuffix − strOverlap; | |
(here strSentLeft is the part of strSuffix that is left when removing | |
strOverlap from it, e.g. in last example, it is NULL.) |
IF strEdgeLeft = = NULL |
Add_a_Suffix (edge->end, strSentLeft); |
ELSE | |
Create a new fork node, and split the edge; |
} | |
Create a new leaf node; |
} |
Claims (27)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/017,408 US7197449B2 (en) | 2001-10-30 | 2001-10-30 | Method for extracting name entities and jargon terms using a suffix tree data structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/017,408 US7197449B2 (en) | 2001-10-30 | 2001-10-30 | Method for extracting name entities and jargon terms using a suffix tree data structure |
Publications (2)
Publication Number | Publication Date |
---|---|
US20030083862A1 US20030083862A1 (en) | 2003-05-01 |
US7197449B2 true US7197449B2 (en) | 2007-03-27 |
Family
ID=21782421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/017,408 Expired - Fee Related US7197449B2 (en) | 2001-10-30 | 2001-10-30 | Method for extracting name entities and jargon terms using a suffix tree data structure |
Country Status (1)
Country | Link |
---|---|
US (1) | US7197449B2 (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050222837A1 (en) * | 2004-04-06 | 2005-10-06 | Paul Deane | Lexical association metric for knowledge-free extraction of phrasal terms |
US20070143282A1 (en) * | 2005-03-31 | 2007-06-21 | Betz Jonathan T | Anchor text summarization for corroboration |
US20070150800A1 (en) * | 2005-05-31 | 2007-06-28 | Betz Jonathan T | Unsupervised extraction of facts |
US20070198600A1 (en) * | 2006-02-17 | 2007-08-23 | Betz Jonathan T | Entity normalization via name normalization |
US20070198597A1 (en) * | 2006-02-17 | 2007-08-23 | Betz Jonathan T | Attribute entropy as a signal in object normalization |
US20080162113A1 (en) * | 2006-12-28 | 2008-07-03 | Dargan John P | Method and Apparatus for for Predicting Text |
US7792837B1 (en) * | 2007-11-14 | 2010-09-07 | Google Inc. | Entity name recognition |
US7966291B1 (en) | 2007-06-26 | 2011-06-21 | Google Inc. | Fact-based object merging |
US7970766B1 (en) | 2007-07-23 | 2011-06-28 | Google Inc. | Entity type assignment |
US7991797B2 (en) | 2006-02-17 | 2011-08-02 | Google Inc. | ID persistence through normalization |
US8122026B1 (en) | 2006-10-20 | 2012-02-21 | Google Inc. | Finding and disambiguating references to entities on web pages |
US20120109638A1 (en) * | 2010-10-27 | 2012-05-03 | Hon Hai Precision Industry Co., Ltd. | Electronic device and method for extracting component names using the same |
US8239350B1 (en) | 2007-05-08 | 2012-08-07 | Google Inc. | Date ambiguity resolution |
US8260785B2 (en) | 2006-02-17 | 2012-09-04 | Google Inc. | Automatic object reference identification and linking in a browseable fact repository |
US8347202B1 (en) | 2007-03-14 | 2013-01-01 | Google Inc. | Determining geographic locations for place names in a fact repository |
US8650175B2 (en) | 2005-03-31 | 2014-02-11 | Google Inc. | User interface for facts query engine with snippets from information sources that include query terms and answer terms |
US8682913B1 (en) | 2005-03-31 | 2014-03-25 | Google Inc. | Corroborating facts extracted from multiple sources |
US8738643B1 (en) | 2007-08-02 | 2014-05-27 | Google Inc. | Learning synonymous object names from anchor texts |
US8812435B1 (en) | 2007-11-16 | 2014-08-19 | Google Inc. | Learning objects and facts from documents |
US8996470B1 (en) | 2005-05-31 | 2015-03-31 | Google Inc. | System for ensuring the internal consistency of a fact repository |
US20160085742A1 (en) * | 2014-09-23 | 2016-03-24 | Kaybus, Inc. | Automated collective term and phrase index |
US20160140104A1 (en) * | 2005-05-05 | 2016-05-19 | Cxense Asa | Methods and systems related to information extraction |
US9508054B2 (en) | 2011-07-19 | 2016-11-29 | Slice Technologies, Inc. | Extracting purchase-related information from electronic messages |
US9563904B2 (en) | 2014-10-21 | 2017-02-07 | Slice Technologies, Inc. | Extracting product purchase information from electronic messages |
US9575958B1 (en) * | 2013-05-02 | 2017-02-21 | Athena Ann Smyros | Differentiation testing |
US9641474B2 (en) | 2011-07-19 | 2017-05-02 | Slice Technologies, Inc. | Aggregation of emailed product order and shipping information |
US9875486B2 (en) | 2014-10-21 | 2018-01-23 | Slice Technologies, Inc. | Extracting product purchase information from electronic messages |
US11032223B2 (en) | 2017-05-17 | 2021-06-08 | Rakuten Marketing Llc | Filtering electronic messages |
US20230118640A1 (en) * | 2020-03-25 | 2023-04-20 | Metis Ip (Suzhou) Llc | Methods and systems for extracting self-created terms in professional area |
US11803883B2 (en) | 2018-01-29 | 2023-10-31 | Nielsen Consumer Llc | Quality assurance for labeled training data |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8266215B2 (en) | 2003-02-20 | 2012-09-11 | Sonicwall, Inc. | Using distinguishing properties to classify messages |
US7299261B1 (en) | 2003-02-20 | 2007-11-20 | Mailfrontier, Inc. A Wholly Owned Subsidiary Of Sonicwall, Inc. | Message classification using a summary |
US7289956B2 (en) * | 2003-05-27 | 2007-10-30 | Microsoft Corporation | System and method for user modeling to enhance named entity recognition |
JP2006277103A (en) * | 2005-03-28 | 2006-10-12 | Fuji Xerox Co Ltd | Document translating method and its device |
US8131722B2 (en) * | 2006-11-20 | 2012-03-06 | Ebay Inc. | Search clustering |
US8620936B2 (en) * | 2008-05-05 | 2013-12-31 | The Boeing Company | System and method for a data dictionary |
US10291492B2 (en) | 2012-08-15 | 2019-05-14 | Evidon, Inc. | Systems and methods for discovering sources of online content |
CN105224520B (en) * | 2015-09-28 | 2018-03-13 | 北京信息科技大学 | A kind of Chinese patent document term automatic identifying method |
US10049108B2 (en) * | 2016-12-09 | 2018-08-14 | International Business Machines Corporation | Identification and translation of idioms |
US9916307B1 (en) * | 2016-12-09 | 2018-03-13 | International Business Machines Corporation | Dynamic translation of idioms |
US10055401B2 (en) * | 2016-12-09 | 2018-08-21 | International Business Machines Corporation | Identification and processing of idioms in an electronic environment |
CN110032630B (en) * | 2019-03-12 | 2023-04-18 | 创新先进技术有限公司 | Dialectical recommendation device and method and model training device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5384703A (en) * | 1993-07-02 | 1995-01-24 | Xerox Corporation | Method and apparatus for summarizing documents according to theme |
US5638543A (en) * | 1993-06-03 | 1997-06-10 | Xerox Corporation | Method and apparatus for automatic document summarization |
US6098034A (en) * | 1996-03-18 | 2000-08-01 | Expert Ease Development, Ltd. | Method for standardizing phrasing in a document |
US20030014448A1 (en) * | 2001-07-13 | 2003-01-16 | Maria Castellanos | Method and system for normalizing dirty text in a document |
US7020587B1 (en) * | 2000-06-30 | 2006-03-28 | Microsoft Corporation | Method and apparatus for generating and managing a language model data structure |
-
2001
- 2001-10-30 US US10/017,408 patent/US7197449B2/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5638543A (en) * | 1993-06-03 | 1997-06-10 | Xerox Corporation | Method and apparatus for automatic document summarization |
US5384703A (en) * | 1993-07-02 | 1995-01-24 | Xerox Corporation | Method and apparatus for summarizing documents according to theme |
US6098034A (en) * | 1996-03-18 | 2000-08-01 | Expert Ease Development, Ltd. | Method for standardizing phrasing in a document |
US7020587B1 (en) * | 2000-06-30 | 2006-03-28 | Microsoft Corporation | Method and apparatus for generating and managing a language model data structure |
US20030014448A1 (en) * | 2001-07-13 | 2003-01-16 | Maria Castellanos | Method and system for normalizing dirty text in a document |
Non-Patent Citations (11)
Title |
---|
Chien, Lee-Feng. "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval", Annual ACM Conference on Research and Development in Information Retrieval, 1997, pp. 50-58. * |
Edward M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the Association for Computing Machinery, vol. 23, No. 2, Apr. 1976, pp. 262-272. |
Esko Ukkonen, Constructing Suffix Trees On-line In Linear Time, University of Helinski, Helinski, Finland. |
Esko Ukkonen, On-line Construction of suffix trees, Algorithmica, University of Helinski, Finland. |
Hsin-Hsi Chen, et al., Description Of The NTU System Used For MET2, National Taiwan University, Taipei, Taiwan. |
Jagadish, H. Ng, R. Srivastava, D. "Substring selectivity estimation" Symposium on Principles of Database Systems pp. 249-260, 1999. * |
Mark Nelson, Fast String Searching With Suffix Trees, Dr. Dobb's Journal, Aug. 1996. |
Shihong Yu, et al., Description of The Kent Ridge Digital Labs System Used For MUC-7, Kent Ridge Digital Labs, Singapore. |
Walter Daelemans, et al., Rapid Development Of NLP Modules With Memory-Based Learning, ILK Computational Linguistics, Tilburg University, Tilburg, The Netherlands. |
Walter Daelemans, et al., TiMBL: Tilburg Memory-Based Learner, version 5.1, Reference Guide, ILK Technical Report-ILK 04-02, Tilburg University, Dec. 31, 2004, Tilburg, The Netherlands. |
Yamamoto, M. Church, K. "Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus" Computational Linguistics vol. 27, issue 1, Mar. 2001, pp. 1-30. * |
Cited By (50)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8078452B2 (en) * | 2004-04-06 | 2011-12-13 | Educational Testing Service | Lexical association metric for knowledge-free extraction of phrasal terms |
US20050222837A1 (en) * | 2004-04-06 | 2005-10-06 | Paul Deane | Lexical association metric for knowledge-free extraction of phrasal terms |
US7739103B2 (en) * | 2004-04-06 | 2010-06-15 | Educational Testing Service | Lexical association metric for knowledge-free extraction of phrasal terms |
US20100250238A1 (en) * | 2004-04-06 | 2010-09-30 | Educational Testing Service | Lexical Association Metric for Knowledge-Free Extraction of Phrasal Terms |
US20070143282A1 (en) * | 2005-03-31 | 2007-06-21 | Betz Jonathan T | Anchor text summarization for corroboration |
US8682913B1 (en) | 2005-03-31 | 2014-03-25 | Google Inc. | Corroborating facts extracted from multiple sources |
US8650175B2 (en) | 2005-03-31 | 2014-02-11 | Google Inc. | User interface for facts query engine with snippets from information sources that include query terms and answer terms |
US9208229B2 (en) | 2005-03-31 | 2015-12-08 | Google Inc. | Anchor text summarization for corroboration |
US9672205B2 (en) * | 2005-05-05 | 2017-06-06 | Cxense Asa | Methods and systems related to information extraction |
US20160140104A1 (en) * | 2005-05-05 | 2016-05-19 | Cxense Asa | Methods and systems related to information extraction |
US8996470B1 (en) | 2005-05-31 | 2015-03-31 | Google Inc. | System for ensuring the internal consistency of a fact repository |
US9558186B2 (en) | 2005-05-31 | 2017-01-31 | Google Inc. | Unsupervised extraction of facts |
US8825471B2 (en) | 2005-05-31 | 2014-09-02 | Google Inc. | Unsupervised extraction of facts |
US20070150800A1 (en) * | 2005-05-31 | 2007-06-28 | Betz Jonathan T | Unsupervised extraction of facts |
US9092495B2 (en) | 2006-01-27 | 2015-07-28 | Google Inc. | Automatic object reference identification and linking in a browseable fact repository |
US8244689B2 (en) | 2006-02-17 | 2012-08-14 | Google Inc. | Attribute entropy as a signal in object normalization |
US20070198600A1 (en) * | 2006-02-17 | 2007-08-23 | Betz Jonathan T | Entity normalization via name normalization |
US9710549B2 (en) | 2006-02-17 | 2017-07-18 | Google Inc. | Entity normalization via name normalization |
US8260785B2 (en) | 2006-02-17 | 2012-09-04 | Google Inc. | Automatic object reference identification and linking in a browseable fact repository |
US20070198597A1 (en) * | 2006-02-17 | 2007-08-23 | Betz Jonathan T | Attribute entropy as a signal in object normalization |
US7991797B2 (en) | 2006-02-17 | 2011-08-02 | Google Inc. | ID persistence through normalization |
US10223406B2 (en) | 2006-02-17 | 2019-03-05 | Google Llc | Entity normalization via name normalization |
US8682891B2 (en) | 2006-02-17 | 2014-03-25 | Google Inc. | Automatic object reference identification and linking in a browseable fact repository |
US8700568B2 (en) | 2006-02-17 | 2014-04-15 | Google Inc. | Entity normalization via name normalization |
US8751498B2 (en) | 2006-10-20 | 2014-06-10 | Google Inc. | Finding and disambiguating references to entities on web pages |
US9760570B2 (en) | 2006-10-20 | 2017-09-12 | Google Inc. | Finding and disambiguating references to entities on web pages |
US8122026B1 (en) | 2006-10-20 | 2012-02-21 | Google Inc. | Finding and disambiguating references to entities on web pages |
US20080162113A1 (en) * | 2006-12-28 | 2008-07-03 | Dargan John P | Method and Apparatus for for Predicting Text |
US8195448B2 (en) * | 2006-12-28 | 2012-06-05 | John Paisley Dargan | Method and apparatus for predicting text |
US9892132B2 (en) | 2007-03-14 | 2018-02-13 | Google Llc | Determining geographic locations for place names in a fact repository |
US8347202B1 (en) | 2007-03-14 | 2013-01-01 | Google Inc. | Determining geographic locations for place names in a fact repository |
US8239350B1 (en) | 2007-05-08 | 2012-08-07 | Google Inc. | Date ambiguity resolution |
US7966291B1 (en) | 2007-06-26 | 2011-06-21 | Google Inc. | Fact-based object merging |
US7970766B1 (en) | 2007-07-23 | 2011-06-28 | Google Inc. | Entity type assignment |
US8738643B1 (en) | 2007-08-02 | 2014-05-27 | Google Inc. | Learning synonymous object names from anchor texts |
US7792837B1 (en) * | 2007-11-14 | 2010-09-07 | Google Inc. | Entity name recognition |
US8812435B1 (en) | 2007-11-16 | 2014-08-19 | Google Inc. | Learning objects and facts from documents |
US20120109638A1 (en) * | 2010-10-27 | 2012-05-03 | Hon Hai Precision Industry Co., Ltd. | Electronic device and method for extracting component names using the same |
US9508054B2 (en) | 2011-07-19 | 2016-11-29 | Slice Technologies, Inc. | Extracting purchase-related information from electronic messages |
US9641474B2 (en) | 2011-07-19 | 2017-05-02 | Slice Technologies, Inc. | Aggregation of emailed product order and shipping information |
US9563915B2 (en) | 2011-07-19 | 2017-02-07 | Slice Technologies, Inc. | Extracting purchase-related information from digital documents |
US9846902B2 (en) | 2011-07-19 | 2017-12-19 | Slice Technologies, Inc. | Augmented aggregation of emailed product order and shipping information |
US9575958B1 (en) * | 2013-05-02 | 2017-02-21 | Athena Ann Smyros | Differentiation testing |
US20160085742A1 (en) * | 2014-09-23 | 2016-03-24 | Kaybus, Inc. | Automated collective term and phrase index |
US9864741B2 (en) * | 2014-09-23 | 2018-01-09 | Prysm, Inc. | Automated collective term and phrase index |
US9875486B2 (en) | 2014-10-21 | 2018-01-23 | Slice Technologies, Inc. | Extracting product purchase information from electronic messages |
US9563904B2 (en) | 2014-10-21 | 2017-02-07 | Slice Technologies, Inc. | Extracting product purchase information from electronic messages |
US11032223B2 (en) | 2017-05-17 | 2021-06-08 | Rakuten Marketing Llc | Filtering electronic messages |
US11803883B2 (en) | 2018-01-29 | 2023-10-31 | Nielsen Consumer Llc | Quality assurance for labeled training data |
US20230118640A1 (en) * | 2020-03-25 | 2023-04-20 | Metis Ip (Suzhou) Llc | Methods and systems for extracting self-created terms in professional area |
Also Published As
Publication number | Publication date |
---|---|
US20030083862A1 (en) | 2003-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7197449B2 (en) | Method for extracting name entities and jargon terms using a suffix tree data structure | |
US5890103A (en) | Method and apparatus for improved tokenization of natural language text | |
US8660834B2 (en) | User input classification | |
US6269189B1 (en) | Finding selected character strings in text and providing information relating to the selected character strings | |
Cohen et al. | Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods | |
Kiss et al. | Unsupervised multilingual sentence boundary detection | |
US8510097B2 (en) | Region-matching transducers for text-characterization | |
US20070011132A1 (en) | Named entity translation | |
JPH06110948A (en) | How to identify, search and classify documents | |
Ofazer et al. | Bootstrapping morphological analyzers by combining human elicitation and machine learning | |
Doush et al. | A novel Arabic OCR post-processing using rule-based and word context techniques | |
Boros et al. | Assessing the impact of OCR noise on multilingual event detection over digitised documents | |
U Rahman | Towards Sindhi corpus construction | |
Patil et al. | Issues and challenges in marathi named entity recognition | |
Zhang et al. | A trainable method for extracting Chinese entity names and their relations | |
Hassler et al. | Text preparation through extended tokenization | |
Wong et al. | iSentenizer‐μ: Multilingual Sentence Boundary Detection Model | |
US20070179932A1 (en) | Method for finding data, research engine and microprocessor therefor | |
Ando et al. | Mostly-unsupervised statistical segmentation of Japanese kanji sequences | |
Azmi et al. | Light diacritic restoration to disambiguate homographs in modern Arabic texts | |
JP2003323425A (en) | Bilingual dictionary creation device, translation device, bilingual dictionary creation program, and translation program | |
Alian et al. | Arabic real time entity resolution using inverted indexing | |
Weiss et al. | From textual information to numerical vectors | |
Dave et al. | A Systematic Review of Stemmers of Indian and Non-Indian Vernacular Languages | |
Chidiebere et al. | Analysis and representation of Igbo text document for a text-based system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTEL CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HU, ZENGJIAN;ZHANG, YIMIN;ZHOU, JOE F.;REEL/FRAME:012823/0880;SIGNING DATES FROM 20020227 TO 20020312 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
REMI | Maintenance fee reminder mailed | ||
FPAY | Fee payment |
Year of fee payment: 4 |
|
SULP | Surcharge for late payment | ||
FPAY | Fee payment |
Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20190327 |