Data Types

There are two core datatypes in litstudy: Document and DocumentSet.

Document is an abstract base class (ABC) that provides access to the metadata of documents in a unified way. Different backends provide their own implements of this class (for example, ScopusDocument, BibTexDocument, etc.)

DocumentSet is set of Document objects. All set operations are supported, making it possible to create a new set from existing sets. For instance, it is possible to load documents from two sources (obtaining two DocumentSets) and merge them (obtaining one large DocumentSet).

class litstudy.types.DocumentSet(docs, data=None)

Represents a set of documents.

DocumentSet stores a list of Document objects. Optionally, a pandas data frame can be provided which stores additional properties on the documents.

All set operations are accepted by DocumentSet (union, intersection, difference), allowing for new sets to be created from existing sets.

Note that a DocumentSet is immutable and its content cannot be changed. Instead, most methods below return a new DocumentSet instead of performing modifications in-place.

add_property(name: str, values) → DocumentSet

Returns a new set which has an additional property added.

Parameters:

name -- Name of the new property.
values -- List of values. Should be the same length as the number of documents in this set.

Returns:

The new document set.

remove_property(name: str) → DocumentSet

Returns a new set which has the given property removed.

Parameters:: name -- Name of the property.
Returns:: The new document set.

filter_docs(predicate) → DocumentSet

Returns a new set for which the provided predicate returned True.

Parameters:: predicate -- A function Document -> bool.
Returns:: The new document set.

filter(predicate) → DocumentSet

Returns a new set for which the provided predicate returned True.

Parameters:: predicate -- A function Document, dict -> bool. The provided dict stores the properties of the document.
Returns:: The new document set.

select(indices) → DocumentSet

Returns a new set which contains only the documents at the provided indices.

Parameters:: indices -- Any input accepted by pandas.DataFrame.iloc such as a list of integer.
Returns:: The new document set.

intersect(other: DocumentSet) → DocumentSet

Returns a new set which contains the documents provided in both self and other. This is also available as the & operator.

Returns:: The new document set.

difference(other: DocumentSet) → DocumentSet

Returns a new set which contains the documents provided in self but not in other. This is also available as the - operator.

Returns:: The new document set.

union(other: DocumentSet) → DocumentSet

Returns a new set which contains the documents provided in either self and other. Duplicate documents in other that also appear in self are discarded. This is also available as the | operator.

Returns:: The new document set.

concat(other: DocumentSet) → DocumentSet

Returns a new set which does contain the documents provided in either self and other. Duplicate documents are not removed, see union instead. This is also available as the + operator.

Returns:: The new document set.

unique() → DocumentSet

Returns a new set which has all duplicate documents removed.

Returns:: The new document set.

sample(n, seed=0) → DocumentSet

Returns a new set which contains n randomly chosen documents from self.

Returns:: The new document set.

itertuples(): Returns an iterator over (Document, dict) tuples, where the dict contains the properties of this document.

class litstudy.types.DocumentIdentifier(title, **attr)

Represents an identifier for a document.

Uniquely identifing an scientific document is often difficult since a single document might have multiple identifiers assigned to it (e.g., DOI, PubMed ID, Scopus ID, SemanticScholar ID) and not all data sources might provide all these identifiers. This class stores all possible identifiers that a document has.

property title: str | None: Returns the title.

property doi: str | None: Returns the DOI (example: 10.1093/ajae/aaq063).

property pubmed: str | None: Returns the PubMed ID.

property arxivid: str | None: Returns the arXiv ID.

property scopusid: str | None: Returns the Scopus ID.

property s2id: str | None: Returns the Semantic Scholar ID.

matches(other: DocumentIdentifier) → bool

Returns True iff these two identifiers are equivalent

Two documents are considered to be equivalent if all identifiers they have in common are equal. For example, if both documents have a DOI then these should be the same. If two documents have not a single identifier in common, a fuzzy match based on the title is performed.

merge(other) → DocumentIdentifier: Returns a new DocumentIdentifier which adds the identifiers others to self.

class litstudy.types.Document(identifier: DocumentIdentifier)

Stores the metadata of a document.

This is an interface which provides several methods which can be overridden by child classes. All methods can thus return None in case that method is not overridden.

property id: DocumentIdentifier: The DocumentIdentifier of this document.

abstract property title: str: The title of this document.

abstract property authors: List[Author] | None: The authors of this document.

property affiliations: List[Affiliation] | None: The affiliations associated with the authors of this document.

property publisher: str | None: The publisher of this document.

property language: str | None: The language this document is written in.

property publication_date: date | None: The data of publication.

property publication_year: int | None: The year of publication.

property publication_source: str | None: The name of the publication source (i.e., journal name, conference name, etc.)

property source_type: str | None: The type of publication source (i.e., journal, conference proceedings, book, etc.)

property keywords: List[str] | None: The keywords of this document. What exactly consistutes as keywords depends on the data source (author keywords, generated keywords, topic categories), but is should be a list of strings.

property abstract: str | None: The abstract of this document.

property citation_count: int | None: The number of citations that this document received.

property references: List[DocumentIdentifier] | None: The list of other documents that are cited by this document.

property citations: List[DocumentIdentifier] | None: The list of other documents that cite this document.

mentions(term: str) → bool: Returns True if this document mentions the given term in the title, abstract, or keywords.

class litstudy.types.Affiliation

Represents the affiliation of an author

abstract property name: str: Name of the affiliation

property city: str | None: City the affiliation is located in.

property country: str | None: Country the affiliation is located in.

class litstudy.types.Author

Represents the author of a document.

abstract property name: str: The name of the author.

property orcid: str | None: The ORCID of the author.

property s2id: str | None: The SemanticScholar ID of the author.

property affiliations: list[Affiliation] | None: The affiliations this author is associated with.