Data Types
There are two core datatypes in litstudy: Document and DocumentSet.
Document is an abstract base class (ABC) that provides access to the metadata of documents in a unified way. Different backends provide their own implements of this class (for example, ScopusDocument, BibTexDocument, etc.)
DocumentSet is set of Document objects. All set operations are supported, making it possible to create a new set from existing sets. For instance, it is possible to load documents from two sources (obtaining two DocumentSets) and merge them (obtaining one large DocumentSet).
- class litstudy.types.DocumentSet(docs, data=None)
Represents a set of documents.
DocumentSet stores a list of Document objects. Optionally, a pandas data frame can be provided which stores additional properties on the documents.
All set operations are accepted by DocumentSet (union, intersection, difference), allowing for new sets to be created from existing sets.
Note that a DocumentSet is immutable and its content cannot be changed. Instead, most methods below return a new DocumentSet instead of performing modifications in-place.
- add_property(name: str, values) DocumentSet
Returns a new set which has an additional property added.
- Parameters:
name -- Name of the new property.
values -- List of values. Should be the same length as the number of documents in this set.
- Returns:
The new document set.
- remove_property(name: str) DocumentSet
Returns a new set which has the given property removed.
- Parameters:
name -- Name of the property.
- Returns:
The new document set.
- filter_docs(predicate) DocumentSet
Returns a new set for which the provided predicate returned True.
- Parameters:
predicate -- A function Document -> bool.
- Returns:
The new document set.
- filter(predicate) DocumentSet
Returns a new set for which the provided predicate returned True.
- Parameters:
predicate -- A function Document, dict -> bool. The provided dict stores the properties of the document.
- Returns:
The new document set.
- select(indices) DocumentSet
Returns a new set which contains only the documents at the provided indices.
- Parameters:
indices -- Any input accepted by pandas.DataFrame.iloc such as a list of integer.
- Returns:
The new document set.
- intersect(other: DocumentSet) DocumentSet
Returns a new set which contains the documents provided in both self and other. This is also available as the & operator.
- Returns:
The new document set.
- difference(other: DocumentSet) DocumentSet
Returns a new set which contains the documents provided in self but not in other. This is also available as the - operator.
- Returns:
The new document set.
- union(other: DocumentSet) DocumentSet
Returns a new set which contains the documents provided in either self and other. Duplicate documents in other that also appear in self are discarded. This is also available as the | operator.
- Returns:
The new document set.
- concat(other: DocumentSet) DocumentSet
Returns a new set which does contain the documents provided in either self and other. Duplicate documents are not removed, see union instead. This is also available as the + operator.
- Returns:
The new document set.
- unique() DocumentSet
Returns a new set which has all duplicate documents removed.
- Returns:
The new document set.
- sample(n, seed=0) DocumentSet
Returns a new set which contains n randomly chosen documents from self.
- Returns:
The new document set.
- itertuples()
Returns an iterator over (Document, dict) tuples, where the dict contains the properties of this document.
- class litstudy.types.DocumentIdentifier(title, **attr)
Represents an identifier for a document.
Uniquely identifing an scientific document is often difficult since a single document might have multiple identifiers assigned to it (e.g., DOI, PubMed ID, Scopus ID, SemanticScholar ID) and not all data sources might provide all these identifiers. This class stores all possible identifiers that a document has.
- property title: str | None
Returns the title.
- property doi: str | None
Returns the DOI (example: 10.1093/ajae/aaq063).
- property pubmed: str | None
Returns the PubMed ID.
- property arxivid: str | None
Returns the arXiv ID.
- property scopusid: str | None
Returns the Scopus ID.
- property s2id: str | None
Returns the Semantic Scholar ID.
- matches(other: DocumentIdentifier) bool
Returns True iff these two identifiers are equivalent
Two documents are considered to be equivalent if all identifiers they have in common are equal. For example, if both documents have a DOI then these should be the same. If two documents have not a single identifier in common, a fuzzy match based on the title is performed.
- merge(other) DocumentIdentifier
Returns a new DocumentIdentifier which adds the identifiers others to self.
- class litstudy.types.Document(identifier: DocumentIdentifier)
Stores the metadata of a document.
This is an interface which provides several methods which can be overridden by child classes. All methods can thus return None in case that method is not overridden.
- property id: DocumentIdentifier
The DocumentIdentifier of this document.
- abstract property title: str
The title of this document.
- property affiliations: List[Affiliation] | None
The affiliations associated with the authors of this document.
- property publisher: str | None
The publisher of this document.
- property language: str | None
The language this document is written in.
- property publication_date: date | None
The data of publication.
- property publication_year: int | None
The year of publication.
- property publication_source: str | None
The name of the publication source (i.e., journal name, conference name, etc.)
- property source_type: str | None
The type of publication source (i.e., journal, conference proceedings, book, etc.)
- property keywords: List[str] | None
The keywords of this document. What exactly consistutes as keywords depends on the data source (author keywords, generated keywords, topic categories), but is should be a list of strings.
- property abstract: str | None
The abstract of this document.
- property citation_count: int | None
The number of citations that this document received.
- property references: List[DocumentIdentifier] | None
The list of other documents that are cited by this document.
- property citations: List[DocumentIdentifier] | None
The list of other documents that cite this document.
- mentions(term: str) bool
Returns True if this document mentions the given term in the title, abstract, or keywords.
- class litstudy.types.Affiliation
Represents the affiliation of an author
- abstract property name: str
Name of the affiliation
- property city: str | None
City the affiliation is located in.
- property country: str | None
Country the affiliation is located in.
- class litstudy.types.Author
Represents the author of a document.
- abstract property name: str
The name of the author.
- property orcid: str | None
The ORCID of the author.
- property s2id: str | None
The SemanticScholar ID of the author.
- property affiliations: list[Affiliation] | None
The affiliations this author is associated with.