Data Types
There are two core datatypes in litstudy: Document and DocumentSet.
Document is an abstract base class (ABC) that provides access to the metadata of documents in a unified way. Different backends provide their own implements of this class (for example, ScopusDocument, BibTexDocument, etc.)
DocumentSet is set of Document objects. All set operations are supported, making it possible to create a new set from existing sets. For instance, it is possible to load documents from two sources (obtaining two DocumentSets) and merge them (obtaining one large DocumentSet).
- class litstudy.types.DocumentSet(docs, data=None)
 Represents a set of documents.
DocumentSet stores a list of Document objects. Optionally, a pandas data frame can be provided which stores additional properties on the documents.
All set operations are accepted by DocumentSet (union, intersection, difference), allowing for new sets to be created from existing sets.
Note that a DocumentSet is immutable and its content cannot be changed. Instead, most methods below return a new DocumentSet instead of performing modifications in-place.
- add_property(name: str, values) DocumentSet
 Returns a new set which has an additional property added.
- Parameters:
 name -- Name of the new property.
values -- List of values. Should be the same length as the number of documents in this set.
- Returns:
 The new document set.
- remove_property(name: str) DocumentSet
 Returns a new set which has the given property removed.
- Parameters:
 name -- Name of the property.
- Returns:
 The new document set.
- filter_docs(predicate) DocumentSet
 Returns a new set for which the provided predicate returned True.
- Parameters:
 predicate -- A function Document -> bool.
- Returns:
 The new document set.
- filter(predicate) DocumentSet
 Returns a new set for which the provided predicate returned True.
- Parameters:
 predicate -- A function Document, dict -> bool. The provided dict stores the properties of the document.
- Returns:
 The new document set.
- select(indices) DocumentSet
 Returns a new set which contains only the documents at the provided indices.
- Parameters:
 indices -- Any input accepted by pandas.DataFrame.iloc such as a list of integer.
- Returns:
 The new document set.
- intersect(other: DocumentSet) DocumentSet
 Returns a new set which contains the documents provided in both self and other. This is also available as the & operator.
- Returns:
 The new document set.
- difference(other: DocumentSet) DocumentSet
 Returns a new set which contains the documents provided in self but not in other. This is also available as the - operator.
- Returns:
 The new document set.
- union(other: DocumentSet) DocumentSet
 Returns a new set which contains the documents provided in either self and other. Duplicate documents in other that also appear in self are discarded. This is also available as the | operator.
- Returns:
 The new document set.
- concat(other: DocumentSet) DocumentSet
 Returns a new set which does contain the documents provided in either self and other. Duplicate documents are not removed, see union instead. This is also available as the + operator.
- Returns:
 The new document set.
- unique() DocumentSet
 Returns a new set which has all duplicate documents removed.
- Returns:
 The new document set.
- sample(n, seed=0) DocumentSet
 Returns a new set which contains n randomly chosen documents from self.
- Returns:
 The new document set.
- itertuples()
 Returns an iterator over (Document, dict) tuples, where the dict contains the properties of this document.
- class litstudy.types.DocumentIdentifier(title, **attr)
 Represents an identifier for a document.
Uniquely identifing an scientific document is often difficult since a single document might have multiple identifiers assigned to it (e.g., DOI, PubMed ID, Scopus ID, SemanticScholar ID) and not all data sources might provide all these identifiers. This class stores all possible identifiers that a document has.
- property title: str | None
 Returns the title.
- property doi: str | None
 Returns the DOI (example: 10.1093/ajae/aaq063).
- property pubmed: str | None
 Returns the PubMed ID.
- property arxivid: str | None
 Returns the arXiv ID.
- property scopusid: str | None
 Returns the Scopus ID.
- property s2id: str | None
 Returns the Semantic Scholar ID.
- matches(other: DocumentIdentifier) bool
 Returns True iff these two identifiers are equivalent
Two documents are considered to be equivalent if all identifiers they have in common are equal. For example, if both documents have a DOI then these should be the same. If two documents have not a single identifier in common, a fuzzy match based on the title is performed.
- merge(other) DocumentIdentifier
 Returns a new DocumentIdentifier which adds the identifiers others to self.
- class litstudy.types.Document(identifier: DocumentIdentifier)
 Stores the metadata of a document.
This is an interface which provides several methods which can be overridden by child classes. All methods can thus return None in case that method is not overridden.
- property id: DocumentIdentifier
 The DocumentIdentifier of this document.
- abstract property title: str
 The title of this document.
- property affiliations: List[Affiliation] | None
 The affiliations associated with the authors of this document.
- property publisher: str | None
 The publisher of this document.
- property language: str | None
 The language this document is written in.
- property publication_date: date | None
 The data of publication.
- property publication_year: int | None
 The year of publication.
- property publication_source: str | None
 The name of the publication source (i.e., journal name, conference name, etc.)
- property source_type: str | None
 The type of publication source (i.e., journal, conference proceedings, book, etc.)
- property keywords: List[str] | None
 The keywords of this document. What exactly consistutes as keywords depends on the data source (author keywords, generated keywords, topic categories), but is should be a list of strings.
- property abstract: str | None
 The abstract of this document.
- property citation_count: int | None
 The number of citations that this document received.
- property references: List[DocumentIdentifier] | None
 The list of other documents that are cited by this document.
- property citations: List[DocumentIdentifier] | None
 The list of other documents that cite this document.
- mentions(term: str) bool
 Returns True if this document mentions the given term in the title, abstract, or keywords.
- class litstudy.types.Affiliation
 Represents the affiliation of an author
- abstract property name: str
 Name of the affiliation
- property city: str | None
 City the affiliation is located in.
- property country: str | None
 Country the affiliation is located in.
- class litstudy.types.Author
 Represents the author of a document.
- abstract property name: str
 The name of the author.
- property orcid: str | None
 The ORCID of the author.
- property s2id: str | None
 The SemanticScholar ID of the author.
- property affiliations: list[Affiliation] | None
 The affiliations this author is associated with.