Language Processing
- class litstudy.nlp.Corpus(docs, filters, max_tokens)
Contains the word-frequency vectors for a set of documents. See build_corpus for more information.
- dictionary
The dictionary that maps indices to words (gensim.corpora.Dictionary).
- frequencies
List of word frequency vectors. Each vector corresponds to one document and consists of (word_index, frequency) tuples.
- litstudy.nlp.build_corpus(docs: DocumentSet, *, remove_words=None, min_word_length=3, min_docs=5, max_docs_ratio=0.75, max_tokens=5000, replace_words=None, custom_bigrams=None, ngram_threshold=None) Corpus
Build a Corpus object.
This function takes the words from the title/abstract of the given documents, preprocesses the tokens, and returns a corpus consisting of a word frequency vector for each document. This preprocessing stage is highly customizable, thus it is advised to experiment with the many parameters.
Please notice that a small document set with no Abstracts available, might not yield a Corpus, since there is a higher chance of words not achieving a ocorrency in more than one document.
- Parameters:
remove_words -- list of words that should be ignored while building the word frequency vectors.
min_word_length -- Words shorter than this are ignored.
min_docs -- Words that occur in fewer than this many documents are ignored.
max_docs_ratio -- Words that occur in more than this document are ignored. Should be ratio between 0 and 1.
max_tokens -- Only the top most common tokens are preserved.
replace_words -- Replace words by other words. Must be a dict containing original word to replacement word pairs.
custom_bigrams -- Add custom bigrams. Must be a dict where keys are (first, second) tuples and values are replacements. For example, the key can be ("Big", "Data") and the value "BigData".
ngram_threshold -- Threshold used for n-gram detection. Is passed to gensim.models.phrases.Phrases to detect common n-grams.
- Returns:
a Corpus object.
- class litstudy.nlp.TopicModel(dictionary, doc2topic, topic2token)
Topic model trained by one of the train_*_model functions.
- doc2topic
N x T matrix that stores the weights towards each of the T topics for the N documents.
- topic2token
T x M matrix that stores the weights towards each of the M tokens for each of the T topics
- best_documents_for_topic(topic_id: int, limit=5) List[int]
Returns the documents that most strongly belong to the given topic.
- document_topics(doc_id: int)
Returns a numpy array indicating the weights towards the different topic for the given document. These weight sum up to one.
- best_token_weights_for_topic(topic_id: int, limit=5)
Returns a list of (token, weight) tuples for the tokens that most strongly belong to the given topic.
- best_tokens_for_topic(topic_id: int, limit=5)
Returns the top tokens that most strongly belong to the given topic.
- best_token_for_topic(topic_id: int) str
Returns the token that most strongly belongs to the given topic.
- best_topic_for_token(token) int
Returns the topic index that most strongly belongs to the given token.
- best_topic_for_documents() List[int]
Returns the topic for each document that most strongly belongs to that document.
- litstudy.nlp.train_nmf_model(corpus: Corpus, num_topics: int, seed=0, max_iter=500) TopicModel
Train a topic model using NMF.
- Parameters:
num_topics -- The number of topics to train.
seed -- The seed used for random number generation.
max_iter -- The maximum number of iterations to use for training. More iterations mean better results, but longer training times.
- litstudy.nlp.train_lda_model(corpus: Corpus, num_topics, seed=0, **kwargs) TopicModel
Train a topic model using LDA.
- Parameters:
num_topics -- The number of topics to train.
seed -- The seed used for random number generation.
kwargs -- Arguments passed to gensim.models.lda.LdaModel (gensim3) or gensim.models.ldamodel.LdaModel (gensim4).
- litstudy.nlp.train_elda_model(corpus: Corpus, num_topics, num_models=4, seed=0, **kwargs) TopicModel
Train a topic model using ensemble LDA.
- Parameters:
num_topics -- The number of topics to train.
num_models -- The number of models to train.
seed -- The seed used for random number generation.
kwargs -- Arguments passed to gensim.models.ensemblelda.EnsembleLda (gensim4).
- litstudy.nlp.compute_word_distribution(corpus: Corpus, *, limit=None) DataFrame
Returns dataframe that indicates, for each word, the number of documents that mention that word.
- litstudy.nlp.generate_topic_cloud(model: TopicModel, topic_id: int, cmap=None, max_font_size=75, background_color='white') WordCloud
Generate a word cloud for the given topic from the given topic model.
- Parameters:
cmap -- The color map used to color the words.
max_font_size -- Size of the word which most strongly belongs to the topic. The other words are scaled accordingly.
background_color -- Background color.