Data Types

There are two core datatypes in litstudy: Document and DocumentSet.

Document is an abstract base class (ABC) that provides access to the metadata of documents in a unified way. Different backends provide their own implements of this class (for example, ScopusDocument, BibTexDocument, etc.)

DocumentSet is set of Document objects. All set operations are supported, making it possible to create a new set from existing sets. For instance, it is possible to load documents from two sources (obtaining two DocumentSets) and merge them (obtaining one large DocumentSet).

class litstudy.types.DocumentSet(docs, data=None)

Represents a set of documents.

DocumentSet stores a list of Document objects. Optionally, a pandas data frame can be provided which stores additional properties on the documents.

All set operations are accepted by DocumentSet (union, intersection, difference), allowing for new sets to be created from existing sets.

Note that a DocumentSet is immutable and its content cannot be changed. Instead, most methods below return a new DocumentSet instead of performing modifications in-place.

add_property(name: str, values) DocumentSet

Returns a new set which has an additional property added.

Parameters:
  • name -- Name of the new property.

  • values -- List of values. Should be the same length as the number of documents in this set.

Returns:

The new document set.

remove_property(name: str) DocumentSet

Returns a new set which has the given property removed.

Parameters:

name -- Name of the property.

Returns:

The new document set.

filter_docs(predicate) DocumentSet

Returns a new set for which the provided predicate returned True.

Parameters:

predicate -- A function Document -> bool.

Returns:

The new document set.

filter(predicate) DocumentSet

Returns a new set for which the provided predicate returned True.

Parameters:

predicate -- A function Document, dict -> bool. The provided dict stores the properties of the document.

Returns:

The new document set.

select(indices) DocumentSet

Returns a new set which contains only the documents at the provided indices.

Parameters:

indices -- Any input accepted by pandas.DataFrame.iloc such as a list of integer.

Returns:

The new document set.

intersect(other: DocumentSet) DocumentSet

Returns a new set which contains the documents provided in both self and other. This is also available as the & operator.

Returns:

The new document set.

difference(other: DocumentSet) DocumentSet

Returns a new set which contains the documents provided in self but not in other. This is also available as the - operator.

Returns:

The new document set.

union(other: DocumentSet) DocumentSet

Returns a new set which contains the documents provided in either self and other. Duplicate documents in other that also appear in self are discarded. This is also available as the | operator.

Returns:

The new document set.

concat(other: DocumentSet) DocumentSet

Returns a new set which does contain the documents provided in either self and other. Duplicate documents are not removed, see union instead. This is also available as the + operator.

Returns:

The new document set.

unique() DocumentSet

Returns a new set which has all duplicate documents removed.

Returns:

The new document set.

sample(n, seed=0) DocumentSet

Returns a new set which contains n randomly chosen documents from self.

Returns:

The new document set.

itertuples()

Returns an iterator over (Document, dict) tuples, where the dict contains the properties of this document.

class litstudy.types.DocumentIdentifier(title, **attr)

Represents an identifier for a document.

Uniquely identifing an scientific document is often difficult since a single document might have multiple identifiers assigned to it (e.g., DOI, PubMed ID, Scopus ID, SemanticScholar ID) and not all data sources might provide all these identifiers. This class stores all possible identifiers that a document has.

property title: str | None

Returns the title.

property doi: str | None

Returns the DOI (example: 10.1093/ajae/aaq063).

property pubmed: str | None

Returns the PubMed ID.

property arxivid: str | None

Returns the arXiv ID.

property scopusid: str | None

Returns the Scopus ID.

property s2id: str | None

Returns the Semantic Scholar ID.

matches(other: DocumentIdentifier) bool

Returns True iff these two identifiers are equivalent

Two documents are considered to be equivalent if all identifiers they have in common are equal. For example, if both documents have a DOI then these should be the same. If two documents have not a single identifier in common, a fuzzy match based on the title is performed.

merge(other) DocumentIdentifier

Returns a new DocumentIdentifier which adds the identifiers others to self.

class litstudy.types.Document(identifier: DocumentIdentifier)

Stores the metadata of a document.

This is an interface which provides several methods which can be overridden by child classes. All methods can thus return None in case that method is not overridden.

property id: DocumentIdentifier

The DocumentIdentifier of this document.

abstract property title: str

The title of this document.

abstract property authors: List[Author] | None

The authors of this document.

property affiliations: List[Affiliation] | None

The affiliations associated with the authors of this document.

property publisher: str | None

The publisher of this document.

property language: str | None

The language this document is written in.

property publication_date: date | None

The data of publication.

property publication_year: int | None

The year of publication.

property publication_source: str | None

The name of the publication source (i.e., journal name, conference name, etc.)

property source_type: str | None

The type of publication source (i.e., journal, conference proceedings, book, etc.)

property keywords: List[str] | None

The keywords of this document. What exactly consistutes as keywords depends on the data source (author keywords, generated keywords, topic categories), but is should be a list of strings.

property abstract: str | None

The abstract of this document.

property citation_count: int | None

The number of citations that this document received.

property references: List[DocumentIdentifier] | None

The list of other documents that are cited by this document.

property citations: List[DocumentIdentifier] | None

The list of other documents that cite this document.

mentions(term: str) bool

Returns True if this document mentions the given term in the title, abstract, or keywords.

class litstudy.types.Affiliation

Represents the affiliation of an author

abstract property name: str

Name of the affiliation

property city: str | None

City the affiliation is located in.

property country: str | None

Country the affiliation is located in.

class litstudy.types.Author

Represents the author of a document.

abstract property name: str

The name of the author.

property orcid: str | None

The ORCID of the author.

property s2id: str | None

The SemanticScholar ID of the author.

property affiliations: list[Affiliation] | None

The affiliations this author is associated with.