Skip to content →

A basic keyword extractor using NLTK

While analysing a given text, pretty often we have to extract keywords from the given text to understand the underlying context of the document. Here is a sample using NLTK, where we remove all the stopwords (is, on, of, or, the) and punctuation (coma, apostrophe, brackets etc) from the text and show the twenty most occurring words.

import nltk
from string import punctuation
f = open('dnd.txt')
raw = f.read().decode('utf-8')
tokens = nltk.word_tokenize(raw)
fdist = nltk.FreqDist(ch.lower() for ch in tokens if ch.isalpha and ch not in nltk.corpus.stopwords.words('english') and ch not in punctuation)
print fdist.most_common(20)

Published in NLP Python

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.