Example of using litstudy

This notebook shows an example of how to use litstudy from inside a Jupyter notebook. It shows how to load a dataset, plot statistics, perform topic modeling, do network analysis, and some more advanced features.

This notebook focuses on the topic of programming model for GPUs. GPUs (Graphic Processing Units) are specialized processors that are used in many data centers and supercomputers for data processing and machine learning. However, programming these devices remaining difficult, which is why there is a plethora of research on developing programming models for GPUs.

Imports

[1]:

# Import other libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbs

# Options for plots
plt.rcParams['figure.figsize'] = (10, 6)
sbs.set('paper')

# Import litstudy
path = os.path.abspath(os.path.join('..'))
if path not in sys.path:
    sys.path.append(path)

import litstudy

Collecting the dataset

For this example, we have queried both IEEE Xplore and Springer Link for "GPU" and "programming model". IEEE Xplore gives 5 CSV files (1 per page) and Springer Link gives a single CSV file. We load all files document sets and merge the resulting document sets.

[2]:

# Load the CSV files
docs1 = litstudy.load_ieee_csv('data/ieee_1.csv')
docs2 = litstudy.load_ieee_csv('data/ieee_2.csv')
docs3 = litstudy.load_ieee_csv('data/ieee_3.csv')
docs4 = litstudy.load_ieee_csv('data/ieee_4.csv')
docs5 = litstudy.load_ieee_csv('data/ieee_5.csv')
docs_ieee = docs1 | docs2 | docs3 | docs4 | docs5
print(len(docs_ieee), 'papers loaded from IEEE')

docs_springer = litstudy.load_springer_csv('data/springer.csv')
print(len(docs_springer), 'papers loaded from Springer')

# Merge the two document sets
docs_csv = docs_ieee | docs_springer
print(len(docs_csv), 'papers loaded from CSV')

441 papers loaded from IEEE
1000 papers loaded from Springer
1441 papers loaded from CSV

We can also exclude some papers that we are not interested in. Here, we load a document set from a RIS file and subtract these documents from our original document set.

[3]:

docs_exclude = litstudy.load_ris_file('data/exclude.ris')
docs_remaining = docs_csv - docs_exclude

print(len(docs_exclude), 'papers were excluded')
print(len(docs_remaining), 'paper remaining')

1 papers were excluded
1440 paper remaining

The amount metadata provided by the CSV files is minimal. To enhance the metadata, we can find the corresponding articles on Scopus using refine_scopus. This function returns two sets: the set of documents that were found on Scopus and the set of original documents not were not found. We have two options on how to handle these two sets: (1) merge the two sets back into one set or (2) discard the documents that were not found. We chose the second option here for simplicity.

[4]:

import logging
logging.getLogger().setLevel(logging.CRITICAL)

docs_scopus, docs_notfound = litstudy.refine_scopus(docs_remaining)

print(len(docs_scopus), 'papers found on Scopus')
print(len(docs_notfound), 'papers were not found and were discarded')

100%|██████████| 1440/1440 [00:03<00:00, 361.20it/s]

1387 papers found on Scopus
53 papers were not found and were discarded

Next, we plot the number of documents per publication source.

[5]:

litstudy.plot_year_histogram(docs_scopus);

In this example, we discover that one document was published in 1997. This document should not be in our set since GPUs were not used for general purpose computing before 2006. We can remove this document by filtering on year of publication.

[6]:

docs = docs_scopus.filter_docs(lambda d: d.publication_year >= 2000)

Print how many papers are left

[7]:

print(len(docs), 'papers remaining')

1386 papers remaining

General statistics

litstudy supports plot many general statistics of the document set as histograms. We show some simple examples below.

[8]:

litstudy.plot_year_histogram(docs, vertical=True);

[9]:

litstudy.plot_affiliation_histogram(docs, limit=15);

[10]:

litstudy.plot_author_histogram(docs);

[11]:

litstudy.plot_language_histogram(docs);

[12]:

litstudy.plot_number_authors_histogram(docs);

[13]:

# This names are long, which is why a short abbreviation is provided.
mapping = {
    "IEEE International parallel and distributed processing symposium IPDPS": "IEEE IPDPS",
    "IEEE International parallel and distributed processing symposium workshops IPDPSW": "IEEE IPDPS Workshops",
}

litstudy.plot_source_histogram(docs, mapper=mapping, limit=15);

[14]:

litstudy.plot_country_histogram(docs, limit=15);

[15]:

litstudy.plot_continent_histogram(docs);

Network analysis

The network below shows an example of a co-citation network. This is a type of network where nodes represent documents and edges represent pairs of documents that have been cited together simulatenously by other papers. The strength of the edges indicates how often two documents have been cited together. Two papers with a high co-citation strength (i.e., stronger edge) are usually highly related.

[16]:

litstudy.plot_cocitation_network(docs, max_edges=500)

100%|██████████| 1000/1000 [00:00<00:00, 1752.38it/s]

BarnesHut Approximation  took  0.14  seconds
Repulsion forces  took  0.32  seconds
Gravitational forces  took  0.01  seconds
Attraction forces  took  0.01  seconds
AdjustSpeedAndApplyForces step  took  0.04  seconds

[16]:

Topic modeling

litstudy supports automatic topic discovery based on the words used in documents abstracts. We show an example below. First, we need to build a corpus from the document set. Note that build_corpus supports many arguments to tweak the preprocessing stage of building the corpus. In this example, we pass ngram_threshold=0.85. This argument adds commonly used n-grams (i.e., frequent consecutive words) to the corpus. For instance, artificial and intelligence is a bigram, so a token artificial_intelligence is added to the corpus.

[17]:

corpus = litstudy.build_corpus(docs, ngram_threshold=0.8)

We can compute a word distribution using litstudy.compute_word_distribution which shows how often each word occurs across all documents. In this example, we focus only on n-grams by selecting tokens that contain a _. We see that words such as artificial intelligence and trade offs indeed have been recognized as common bigrams.

[18]:

litstudy.compute_word_distribution(corpus).filter(like='_', axis=0).sort_index()

[18]:

	count
artificial_intelligence	13
author_exclusive	41
berlin_heidelberg	83
chinese_academy	6
coarse_grained	16
...	...
synthetic_aperture	7
trade_offs	10
unified_device	108
xeon_phi	21
zhejiang_university	6

63 rows × 1 columns

Let’s visualize the word distribution from this corpus.

[19]:

plt.figure(figsize=(20, 3))
litstudy.plot_word_distribution(corpus, limit=50, title="Top words", vertical=True, label_rotation=45);

This word distribution looks normal. Next, we train an NMF topic model. Topic modeling is a technique from natural language processing for discovering abstract "topics" in a set of document. We need to manually select the number of desired topics. Here we choose 15 topics. It is recommended to experiment with more or less topics to obtain topics that are more fine-grained or more coarse-grained

[20]:

num_topics = 15
topic_model = litstudy.train_nmf_model(corpus, num_topics, max_iter=250)

To understand the result of NMF, we can print the top 3 words for each topic.

[21]:

for i in range(num_topics):
    print(f'Topic {i+1}:', topic_model.best_tokens_for_topic(i))

Topic 1: ['cluster', 'mpi', 'node', 'hybrid', 'communication']
Topic 2: ['mapreduce', 'big', 'data', 'hadoop', 'cloud']
Topic 3: ['simulation', 'particle', 'numerical', 'fluid', 'flow']
Topic 4: ['learning', 'network', 'deep', 'deep_learning', 'training']
Topic 5: ['fpga', 'memory', 'access', 'cache', 'bandwidth']
Topic 6: ['openacc', 'compiler', 'openmp', 'directive', 'language']
Topic 7: ['image', 'segmentation', 'algorithm', 'medical', 'sensing']
Topic 8: ['sequence', 'alignment', 'protein', 'database', 'search']
Topic 9: ['video', 'decoding', 'encoding', 'ldpc', 'motion']
Topic 10: ['gpgpu', 'cuda', 'code', 'general_purpose', 'general']
Topic 11: ['energy', 'heterogeneous', 'power', 'consumption', 'systems']
Topic 12: ['graph', 'vertex', 'framework', 'analytics', 'edge']
Topic 13: ['scheduling', 'task', 'heterogeneous', 'resources', 'execution']
Topic 14: ['intel', 'matrix', 'phi', 'xeon', 'cloud']
Topic 15: ['opencl', 'portability', 'benchmark', 'platforms', 'sycl']

An alternative way to visualize the output of NMF is to plot each discovered topic as a word cloud. The size of each word in a cloud indicate the importance of that word for that topic.

[22]:

plt.figure(figsize=(15, 5))
litstudy.plot_topic_clouds(topic_model, ncols=5);

These 15 topics look promising. For example, there is one topic on graphs, one on OpenACC (the open accelerators programming standard), one on OpenCL (the open compute language), one on FPGAs (field-programmable gate array), etc.

We can visualize the results as a "landscape" plot. This is a visual appealing way to place documents on 2D plane. The documents are placed such that similar documents are located closed to each other. However, this is a non-linear embedding so the distances between the documents are not linear.

[23]:

plt.figure(figsize=(20, 20))
litstudy.plot_embedding(corpus, topic_model);

Advanced topic modeling

We can combine the results of topic modeling with the plotting of statistics. Here we show we a simple example.

One of the topics appears to be on "deep_learning". First, we find the topic id for the topic that most strongly belongs to "deep_learning".

[24]:

topic_id = topic_model.best_topic_for_token('deep_learning')

Let’s print the top 10 papers that most stongly belong to this topic to check the results. We see that these are indeed documents on the topic of deep learning.

[25]:

for doc_id in topic_model.best_documents_for_topic(topic_id, limit=10):
    print(docs[int(doc_id)].title)

High performance networked computing in media, services and information management
What do Programmers Discuss about Deep Learning Frameworks
Sparse evolutionary deep learning with over one million artificial neurons on commodity hardware
Deep learning for intelligent traffic sensing and prediction: recent advances and future challenges
SOLAR: Services-oriented learning architectures: Deep learning as a service
Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment
Network Management 2030: Operations and Control of Network 2030 Services
SOLAR: Services-Oriented Deep Learning Architectures-Deep Learning as a Service
DLPlib: A Library for Deep Learning Processor
Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Next, we annotate the document set with a "dl_topic" tag for document that strongly belong to this topic (i.e., weight above a certain threshold).

After this, we define two groups: documents that have the tag "dl_topic" and documents that do not have this tag. Now we can, for instance, print the publications over the years to see if interest in deep learning has increased or decreased over the years.

[26]:

threshold = 0.2
dl_topic = topic_model.doc2topic[:, topic_id] > threshold

docs = docs.add_property('dl_topic', dl_topic)


groups = {
    'deep learning related': 'dl_topic',
    'other': 'not dl_topic',
}

litstudy.plot_year_histogram(docs, groups=groups, stacked=True);

The histogram shows that interest in deep learning has clearly risen over the years. We can even calculate the exact amount by calculating the percentage of documents on deep learning each year. The example below shows that this percentage has increased from just 3.4% in 2011 to 13.6% in 2021.

[27]:

table = litstudy.compute_year_histogram(docs, groups=groups)
table.div(table.sum(axis=1), axis=0) * 100

[27]:

	deep learning related	other
2005	0.000000	100.000000
2006	0.000000	100.000000
2007	0.000000	100.000000
2008	6.250000	93.750000
2009	0.000000	100.000000
2010	2.127660	97.872340
2011	3.409091	96.590909
2012	2.941176	97.058824
2013	3.738318	96.261682
2014	3.305785	96.694215
2015	9.565217	90.434783
2016	7.272727	92.727273
2017	3.448276	96.551724
2018	7.200000	92.800000
2019	9.027778	90.972222
2020	9.734513	90.265487
2021	13.600000	86.400000
2022	14.285714	85.714286

Alternatively, we can plot the two groups for the publications source. We can see that some journals/conferences have a strong focus on deep learning (e.g. "Neural Computing and Applications"), while others have no or few publications on deep learning (e.g. "Journal of Real Time Image Processing").

[28]:

plt.figure(figsize=(10, 10))
litstudy.plot_source_histogram(docs, groups=groups, limit=25, stacked=True);

We can even calculate the most popular publication venues for deep learning in our dataset using some simple Panda functions. It appears that "Neural Computing and Applications" is the most popular publication venue.

[29]:

# Compute histogram by publication venue
table = litstudy.compute_source_histogram(docs, groups=groups)

# Add column 'total'
table['total'] = table['deep learning related'] + table['other']

# Remove rare venues that have less than 5 publications
table = table[table['total'] >= 5]

# Add column 'ratio'
table['ratio'] = table['deep learning related'] / table['total'] * 100

# Sort by ratio in descending order
table.sort_values(by='ratio', ascending=False)

[29]:

	deep learning related	other	total	ratio
Neural Computing and Applications	3	3	6	50.000000
Computing	3	12	15	20.000000
IEEE International Symposium on Workload Characterization IISWC	1	4	5	20.000000
Science China Information Sciences	2	8	10	20.000000
IEEE High Performance Extreme Computing Conference HPEC	1	7	8	12.500000
European Physical Journal C	1	7	8	12.500000
Journal of Grid Computing	1	7	8	12.500000
Journal of Computer Science and Technology	3	21	24	12.500000
Multimedia Tools and Applications	3	25	28	10.714286
Journal of Supercomputing	20	205	225	8.888889
BMC Bioinformatics	2	22	24	8.333333
Journal of Big Data	1	11	12	8.333333
Cluster Computing	3	52	55	5.454545
International Journal of Parallel Programming	5	87	92	5.434783
IEEE International Parallel and Distributed Processing Symposium Workshops IPDPSW	1	21	22	4.545455
Journal of Signal Processing Systems	1	42	43	2.325581
Soft Computing	0	16	16	0.000000
BMC Research Notes	0	6	6	0.000000
Journal of the Brazilian Society of Mechanical Sciences and Engineering	0	5	5	0.000000
IEEE International Conference on Cluster Computing CLUSTER	0	5	5	0.000000
IEEE International Symposium on High Performance Computer Architecture HPCA	0	5	5	0.000000
IEEE International Conference on Parallel and Distributed Systems ICPADS	0	5	5	0.000000
Journal of Real Time Image Processing	0	57	57	0.000000
Journal of Scientific Computing	0	5	5	0.000000
Computing and Visualization in Science	0	5	5	0.000000
Visual Computer	0	7	7	0.000000
IEEE ACM International Symposium on Cluster Cloud and Grid Computing CCGrid	0	6	6	0.000000
Euromicro International Conference on Parallel Distributed and Network Based Processing PDP	0	6	6	0.000000
IEEE Transactions on Parallel and Distributed Systems	0	7	7	0.000000
International Conference for High Performance Computing Networking Storage and Analysis SC	0	7	7	0.000000
Frontiers of Computer Science	0	8	8	0.000000
VLDB Journal	0	8	8	0.000000
Computer Science Research and Development	0	10	10	0.000000
IEEE International Parallel and Distributed Processing Symposium IPDPS	0	11	11	0.000000
International Conference on High Performance Computing and Simulation HPCS	0	5	5	0.000000