David Graus

PhD Candidate Semantic Search in eDiscovery

Simple keyword extraction in Python: choices, choices.

Monday, September 12, 2011
5,664 views
5 comments

As explained in an earlier post, I am working on a simple method of extracting ‘important words’ from a text-entry. The methods I am using at the moment are frequency distributions and word collocations. I’ve bumped into some issues regarding finetuning my methods. Read on for a short explanation of my approaches, and some issues regarding them.

Frequency Distribution: POS-tagging y/n?

Extracting keywords by frequency distribution is nothing more than counting words and sorting the list of words by occurrence. Before doing this, I filter stopwords from the text entry. The short explanation on how I’m doing this (sourcecode available at github):

» Tokenize the text (using NLTK’s WordPunctTokenizer)
» Lowercase all the words
» ‘Clean’ the list by removing common stopwords from the list (using NLTK’s English stopwords-list)

This is straightforward enough, an example of the results (from the WikiPedia page of ‘Apoptosis‘):

>>> cyttron.wikiGet('Apoptosis')
Apoptosis in wikiTxt
>>> freqWords(cyttron.wikiTxt,15)
['apoptosis', 'cell', '160', 'apoptotic', 'cells', 'caspase', 'death',
'.&#', 'proteins', 'tnf', 'bcl', 'protein', 'also', 'caspases']

Earlier I was thinking about using POS-tagging (Part-Of-Speech tagging to identify word-types) in order to only  extract frequently occurring nouns. I figured losing relevant adjectives (such as ‘red’ in red blood cell) could be compensated by the word collocations extraction. POS-tagging the tokenized text, and retrieving only the most frequent nouns results in:

>>> freqNouns(cyttron.wikiTxt,15)
['apoptosis', 'cell', 'caspase', 'death', 'protein', 'tnf', 'pathway',
'activation', 'membrane', 'p53', 'response', 'family', 'gene', 'greek']

My problem here is I’m not sure which is ‘better’ (if any of those two), or if I should maybe use a combination of both. Also, I haven’t decided yet how to handle non-alphabetic words. Initially I planned on using regular expressions to filter non-alphabetic strings, but I figured later that it wouldn’t make sense in my case. In the above example, this would omit retrieving ‘p53′: a tumor suppressor protein (P53), which is very relevant.

With earlier playing around with POS-tagging I noticed the precision was not quite high enough to perform chunk extractions (by looking for specific phrases / grammatical constructions). Extracting only nouns does seem to do quite the job, even if I still miss some and get some false positives.

Word Collocations: Stopword filtering y/n?

Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. I generate bi- and trigram word collocations, which mean ’2-word strings’ and ’3-word strings’. My issue here is whether or not to use stopword filtering. Here are the results of the word collocation function on the same WikiPedia page, the 1st list being the bigram collocations, the 2nd being the trigrams. Example without stopword filtering:

>>> wordCollo(cyttron.wikiTxt,10,clean=False)
['such as', 'cell death', 'of the', 'due to', 'leads to', 'programmed cell',
'has been', 'bone marrow', 'have been', 'an increase']
['adp ribose polymerase', 'amino acid composition', 'anatomist walther flemming',
'boston biologist robert', 'break itself down', 'combining forms preceded',
'count falls below', 'german scientist carl', 'homologous antagonist killer',
'mdm2 complexes displaces']

Example with stopword filtering:

>>> wordCollo(cyttron.wikiTxt,10,clean=True)
['cell death', 'programmed cell', 'bone marrow', 'university aberdeen',
'calcium concentration', 'adenovirus e1b', 'british journal', 'citation needed',
'highly conserved', 'nitric oxide']
['adp ribose polymerase', 'agar gel electrophoresis', 'amino acid composition',
'anatomist walther flemming', 'appearance agar gel', 'awarded sydney brenner',
'boston biologist robert', 'carl vogt first', 'ceases respire aerobically',
'closely enough warrant']

As you can see, lots of garbage in the first example, but still some collocations that do not appear in the cleaned version. Similar to the noun-extraction issue with the previous approach, I wonder if I should choose for one of the two, or combine them.

In other news

Finding Gensim has been a life-saver! Instead of using Difflib to compare two strings, I now use a proper text-similarity metric, namely cosine similarity measurement. I do so by creating a TF-IDF weighted corpus out of the (stopwords-cleaned) descriptions of ontology-terms I use, and calculating the cosine similarity between an input string and each entry in the corpus. Gensim makes this all a breeze to do. An example of the ouput:

>>> wikiGet('alzheimer')
alzheimer in wikiTxt
>>> descMatch(wikiTxt,5)
Label: Alzheimer's disease
Similarity: 0.236387
Description: A dementia that results in progressive memory loss, impaired thinking, disorientation, and changes in personality and mood starting in late middle age and leads in advanced cases to a profound decline in cognitive and physical functioning and is marked histologically by the degeneration of brain neurons especially in the cerebral cortex and by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized by memory lapses, confusion, emotional instability and progressive loss of mental ability.

Label: vascular dementia
Similarity: 0.192565
Description: A dementia that involves impairments in cognitive function caused by problems in blood vessels that feed the brain.

Label: dementia
Similarity: 0.157553
Description: A cognitive disease resulting from a loss of brain function affecting memory, thinking, language, judgement and behavior.

Label: cognitive disease
Similarity: 0.13909
Description: A disease of mental health that affects cognitive functions including memory processing, perception and problem solving.

Label: encephalitis
Similarity: 0.138719
Description: Encephalitis is a nervous system infectious disease characterized as an acute inflammation of the brain. The usual cause is a viral infection, but bacteria can also cause it. Cases can range from mild to severe. For mild cases, you could have flu-like symptoms. Serious cases can cause severe headache, sudden fever, drowsiness, vomiting, confusion and seizures.

I’m not sure if the similarity numbers it produces indicate I’m doing something wrong (there’s no high similarity), but intuitively I would say the results do make sense.

  • http://twitter.com/seinecle Clement Levallois

    Hi David,
    I’m working on these issues right now, and find it fascinating how the art of text cleaning can be fine tuned! I’m very happy with the solution I came up with, here it is in pseudo code:

    1. for each document of the corpus:
    —- tokenize
    —- toLowerCase, trim
    — delete all non printable characters with a regex
    — trim again
    — delete multiple white spaces
    — 1.1. Looping on each token of this document:
    ——–  lemmatization by replacing plurals by singulars using simple heuristics (it takes me just 15 lines of code)
    — append the resulting string to the global string containing all documents.

    2. extract n-grams (unigrams, bigrams, trigrams, 4-grams) of the global string and count their frequency

    3. remove n-grams with length < 3

    4. remove n-grams which appear just once or twice (unjustified but reasonable absolute cut-off, help to clean a lot!)

    5. remove stop words
    //there are many criteria here, but the main ones are:
    — if it is a unigram, remove it if it is in the list of stopwords
    — if it is a bi-gram or above, remove it IF some of its token belongs to the list of stopwords

    6. keep only the n most frequent n-grams (n depends on the size of your corpus and your goals)

    7. remove redundant n-grams
    // eg: if a = "University of" and b= "University of Amsterdam" are both in the list of most frequent n-grams, //remove a because it is contained in b, and because it is not n times more frequent than b (I found that a n of 2 or 3 works fine). 

    8. count the occurrences of each remaining n-gram in each document

    9. proceed with the rest of the text analysis!

    As you see, I found that the stopwords removal should arrive quite late in the steps, and in any case after the detection of n-grams. Also, I improved a lot my results by fine tuning my list of stopwords, and by creating several lists of stopwords. You should not hesitate to use a long list I believe, in my case I have excellent results with about 5500 stopwords. This is a topic in itself, we should discuss it some time!

    • http://bkvirendra.github.com/ Virendra

      A tutorial for this will be really appreciated :)

  • Jayakrishnan B

    selection of noun phrases will not guarantee the term to be a key term. Is there a mechanism to make sure that the noun phrase selected is a key term?

    • csrgxtu

      use TF-IDF

      • Jayakrishnan

        but will TFIDF alone ensure that the selected term is key phrase.?
        can we apply any sort of machine learning techniques for doing this?