Language Processing

class litstudy.nlp.Corpus(docs, filters, max_tokens)

Contains the word-frequency vectors for a set of documents. See build_corpus for more information.

dictionary: The dictionary that maps indices to words (gensim.corpora.Dictionary).

frequencies: List of word frequency vectors. Each vector corresponds to one document and consists of (word_index, frequency) tuples.

litstudy.nlp.build_corpus(docs: DocumentSet, *, remove_words=None, min_word_length=3, min_docs=5, max_docs_ratio=0.75, max_tokens=5000, replace_words=None, custom_bigrams=None, ngram_threshold=None) → Corpus

Build a Corpus object.

This function takes the words from the title/abstract of the given documents, preprocesses the tokens, and returns a corpus consisting of a word frequency vector for each document. This preprocessing stage is highly customizable, thus it is advised to experiment with the many parameters.

Please notice that a small document set with no Abstracts available, might not yield a Corpus, since there is a higher chance of words not achieving a ocorrency in more than one document.

Parameters:

remove_words -- list of words that should be ignored while building the word frequency vectors.
min_word_length -- Words shorter than this are ignored.
min_docs -- Words that occur in fewer than this many documents are ignored.
max_docs_ratio -- Words that occur in more than this document are ignored. Should be ratio between 0 and 1.
max_tokens -- Only the top most common tokens are preserved.
replace_words -- Replace words by other words. Must be a dict containing original word to replacement word pairs.
custom_bigrams -- Add custom bigrams. Must be a dict where keys are (first, second) tuples and values are replacements. For example, the key can be ("Big", "Data") and the value "BigData".
ngram_threshold -- Threshold used for n-gram detection. Is passed to gensim.models.phrases.Phrases to detect common n-grams.

Returns:

a Corpus object.

class litstudy.nlp.TopicModel(dictionary, doc2topic, topic2token)

Topic model trained by one of the train_*_model functions.

doc2topic: N x T matrix that stores the weights towards each of the T topics for the N documents.

topic2token: T x M matrix that stores the weights towards each of the M tokens for each of the T topics

best_documents_for_topic(topic_id: int, limit=5) → List[int]: Returns the documents that most strongly belong to the given topic.

document_topics(doc_id: int): Returns a numpy array indicating the weights towards the different topic for the given document. These weight sum up to one.

best_token_weights_for_topic(topic_id: int, limit=5): Returns a list of (token, weight) tuples for the tokens that most strongly belong to the given topic.

best_tokens_for_topic(topic_id: int, limit=5): Returns the top tokens that most strongly belong to the given topic.

best_token_for_topic(topic_id: int) → str: Returns the token that most strongly belongs to the given topic.

best_topic_for_token(token) → int: Returns the topic index that most strongly belongs to the given token.

best_topic_for_documents() → List[int]: Returns the topic for each document that most strongly belongs to that document.

litstudy.nlp.train_nmf_model(corpus: Corpus, num_topics: int, seed=0, max_iter=500) → TopicModel

Train a topic model using NMF.

Parameters:

num_topics -- The number of topics to train.
seed -- The seed used for random number generation.
max_iter -- The maximum number of iterations to use for training. More iterations mean better results, but longer training times.

litstudy.nlp.train_lda_model(corpus: Corpus, num_topics, seed=0, **kwargs) → TopicModel

Train a topic model using LDA.

Parameters:

num_topics -- The number of topics to train.
seed -- The seed used for random number generation.
kwargs -- Arguments passed to gensim.models.lda.LdaModel (gensim3) or gensim.models.ldamodel.LdaModel (gensim4).

litstudy.nlp.train_elda_model(corpus: Corpus, num_topics, num_models=4, seed=0, **kwargs) → TopicModel

Train a topic model using ensemble LDA.

Parameters:

num_topics -- The number of topics to train.
num_models -- The number of models to train.
seed -- The seed used for random number generation.
kwargs -- Arguments passed to gensim.models.ensemblelda.EnsembleLda (gensim4).

litstudy.nlp.compute_word_distribution(corpus: Corpus, *, limit=None) → DataFrame: Returns dataframe that indicates, for each word, the number of documents that mention that word.

litstudy.nlp.generate_topic_cloud(model: TopicModel, topic_id: int, cmap=None, max_font_size=75, background_color='white') → WordCloud

Generate a word cloud for the given topic from the given topic model.

Parameters:

cmap -- The color map used to color the words.
max_font_size -- Size of the word which most strongly belongs to the topic. The other words are scaled accordingly.
background_color -- Background color.

litstudy.nlp.calculate_embedding(corpus: Corpus, *, rank=2, svd_dims=50, perplexity=30, seed=0)

Calculate a document embedding that assigns each document in the corpus a N-d position based on the word usage.

Returns:: A list of N-d tuples for the documents in the corpus.