Python port of the PubMatrixR R package.
For every pair of search terms (A, B), it counts how many PubMed or PMC publications mention both. Good for mapping relationships between genes, diseases, and pathways across the literature.
Based on: Becker et al. (2003) PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics 4:61. https://doi.org/10.1186/1471-2105-4-61
- Pairwise literature search — automatically searches every combination of terms from two lists
- PubMed or PMC — query MEDLINE abstracts or PMC full text via NCBI E-utilities
- Heatmap visualisation — overlap-percentage heatmaps with optional hierarchical clustering
- Export to CSV or ODS — results include clickable hyperlinks to the matching PubMed search
- Date filtering — restrict searches to a publication year range
- Flexible input — pass term lists directly, or load them from a text file
- Concurrency —
n_workersfor parallel queries, respecting NCBI rate limits - Disk caching —
cache_dirpersists query results between runs - Progress tracking — built-in progress bar for long searches
- Gene–disease association studies — explore literature connections between genes and diseases
- Pathway analysis — investigate co-occurrence of genes within or across biological pathways
- Drug–target research — analyse relationships between compounds and potential targets
- Systematic literature reviews — quantify research coverage across multiple topics
- Knowledge gap identification — find under-researched combinations of terms
- Bibliometric analysis — measure research activity in a domain over time
Install from PyPI with your package manager of choice.
pip install pubmatrixpythonuv add pubmatrixpythonpixi add --pypi pubmatrixpythonODS export requires the optional odfpy dependency:
pip install pubmatrixpython[ods]Requires uv. Install it with:
curl -LsSf https://astral.sh/uv/install.sh | shClone and install dependencies:
git clone <repo-url>
cd PubMatrixPython
uv sync --all-groupsAll uv commands must be run from the project root (PubMatrixPython/), where pyproject.toml lives.
cd /path/to/PubMatrixPython
uv run jupyter labThen open any notebook from the notebooks/ folder in the browser.
uv run pythonfrom pubmatrix import pubmatrix, plot_pubmatrix_heatmap
A = ["WNT1", "WNT2", "CTNNB1"]
B = ["obesity", "diabetes", "cancer"]
result = pubmatrix(A=A, B=B)
print(result)
plot_pubmatrix_heatmap(result, title="WNT × Disease")Create a file my_analysis.py:
from pubmatrix import pubmatrix, plot_pubmatrix_heatmap
A = ["WNT1", "WNT2", "WNT3A", "WNT5A", "CTNNB1"]
B = ["obesity", "diabetes", "cancer", "inflammation"]
result = pubmatrix(
A=A,
B=B,
database="pubmed",
daterange=[2010, 2024], # optional date filter
outfile="results",
export_format="csv", # saves results_result.csv with PubMed hyperlinks
)
print(result)
plot_pubmatrix_heatmap(
result,
title="WNT Genes × Disease",
filename="heatmap.png", # saves to file instead of displaying
)Run it with:
uv run python my_analysis.pyCreate terms.txt:
WNT1
WNT2
CTNNB1
#
obesity
diabetes
cancer
from pubmatrix import pubmatrix_from_file
result = pubmatrix_from_file("terms.txt")
print(result)uv run python my_analysis.pyQuery PubMed and return a pandas.DataFrame (rows = B, cols = A).
pubmatrix(
A, # list of str — column terms
B, # list of str — row terms
api_key=None, # NCBI API key (10 req/s vs 3 req/s default)
database="pubmed", # "pubmed" or "pmc"
daterange=None, # e.g. [2015, 2024]
outfile=None, # base filename for export
export_format=None, # None | "csv" | "ods"
n_tries=2, # retries on network failure
n_workers=1, # parallel workers for concurrent queries
timeout=30, # HTTP request timeout in seconds
cache_dir=None, # directory to cache query results on disk
)Load terms from a plain-text file and run pubmatrix().
File format:
WNT1
WNT2
#
obesity
diabetes
result = pubmatrix_from_file("terms.txt", database="pubmed")Heatmap of overlap percentages with optional hierarchical clustering. Returns (fig, ax).
fig, ax = plot_pubmatrix_heatmap(
matrix, # DataFrame from pubmatrix()
title="PubMatrix Co-occurrence Heatmap",
cluster_rows=True,
cluster_cols=True,
show_numbers=True,
color_palette=None, # list of hex colours
filename=None, # save to PNG if set
width=10, height=8,
scale_font=True,
show=False, # call plt.show() after plotting
)Quick wrapper around plot_pubmatrix_heatmap() with all defaults. Returns (fig, ax).
When outfile and export_format are set, results are written to
{outfile}_result.{extension} (.csv or .ods). Each cell contains the
publication count and a hyperlink to the matching PubMed search. Row names
come from B, column names from A.
ODS export requires the optional odfpy dependency — see Installation.
Without a key: 3 requests/second. With a key: 10 requests/second. Get one at https://account.ncbi.nlm.nih.gov/
result = pubmatrix(A=A, B=B, api_key="YOUR_KEY_HERE")- Performance notes — rate limits, caching, concurrency
- Troubleshooting — empty results, rate limiting, slow searches
- Full reference notebook — every parameter and feature, with output
This project is licensed under the MIT License — see LICENSE.md.
If you use PubMatrixPython in your research, please cite:
Becker KG, Hosack DA, Dennis G Jr, Lempicki RA, Bright TJ, Cheadle C, Engel J. PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics. 2003 Dec 10;4:61. https://doi.org/10.1186/1471-2105-4-61
Developers:
- Tyler Laird (Author, original PubMatrixR)
- Enrique Toledo (Author, maintainer)

