Initial release · snap-stanford/GreaseLM@079acd4 · GitHub
Skip to content

Commit 079acd4

Browse files
committed
Initial release
1 parent 2d83a5b commit 079acd4

22 files changed

Lines changed: 5328 additions & 0 deletions

.gitignore

Lines changed: 136 additions & 0 deletions

README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# GreaseLM: Graph REASoning Enhanced Language Models
2+
3+
This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language Models".
4+
5+
<p align="center">
6+
<img src="./figs/greaselm.png" width="600" title="GreaseLM model architecture" alt="">
7+
</p>
8+
9+
## Usage
10+
### 1. Dependencies
11+
12+
- [Python](<https://www.python.org/>) == 3.8
13+
- [PyTorch](<https://pytorch.org/get-started/locally/>) == 1.8.0
14+
- [transformers](<https://github.com/huggingface/transformers/tree/v3.4.0>) == 3.4.0
15+
- [torch-geometric](https://pytorch-geometric.readthedocs.io/) == 1.7.0
16+
17+
Run the following commands to create a conda environment (assuming CUDA 10.1):
18+
```bash
19+
conda create -y -n greaselm python=3.8
20+
conda activate greaselm
21+
pip install numpy==1.18.3 tqdm
22+
pip install torch==1.8.0+cu101 torchvision -f https://download.pytorch.org/whl/torch_stable.html
23+
pip install transformers==3.4.0 nltk spacy
24+
pip install wandb
25+
conda install -y -c conda-forge tensorboardx
26+
conda install -y -c conda-forge tensorboard
27+
28+
# for torch-geometric
29+
pip install torch-scatter==2.0.7 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
30+
pip install torch-cluster==1.5.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
31+
pip install torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
32+
pip install torch-spline-conv==1.2.1 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
33+
pip install torch-geometric==1.7.0 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
34+
```
35+
36+
37+
### 2. Download data
38+
39+
Download all the raw data -- ConceptNet, CommonsenseQA, OpenBookQA -- by
40+
```
41+
./download_raw_data.sh
42+
```
43+
44+
You can preprocess the raw data by running
45+
```
46+
CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes>
47+
```
48+
You can specify the GPU you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`. The script will:
49+
* Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types)
50+
* Convert the QA datasets into .jsonl files (e.g., stored in `data/csqa/statement/`)
51+
* Identify all mentioned concepts in the questions and answers
52+
* Extract subgraphs for each q-a pair
53+
54+
**TL;DR**. The preprocessing may take long; for your convenience, you can download all the processed data [here](https://drive.google.com/drive/folders/1T6B4nou5P3u-6jr0z6e3IkitO8fNVM6f?usp=sharing) into the top-level directory of this repo and run
55+
```
56+
unzip data_preprocessed.zip
57+
```
58+
59+
The resulting file structure should look like this:
60+
61+
```plain
62+
.
63+
├── README.md
64+
└── data/
65+
├── cpnet/ (preprocessed ConceptNet)
66+
└── csqa/
67+
├── train_rand_split.jsonl
68+
├── dev_rand_split.jsonl
69+
├── test_rand_split_no_answers.jsonl
70+
├── statement/ (converted statements)
71+
├── grounded/ (grounded entities)
72+
├── graphs/ (extracted subgraphs)
73+
├── ...
74+
```
75+
76+
### 3. Training GreaseLM
77+
To train GreaseLM on CommonsenseQA, run
78+
```
79+
CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh csqa --data_dir data/
80+
```
81+
You can specify up to 2 GPUs you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`.
82+
83+
Similarly, to train GreaseLM on OpenbookQA, run
84+
```
85+
CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh obqa --data_dir data/
86+
```
87+
88+
### 4. Pretrained model checkpoints
89+
You can download a pretrained GreaseLM model on CommonsenseQA [here](https://drive.google.com/file/d/1QPwLZFA6AQ-pFfDR6TWLdBAvm3c_HOUr/view?usp=sharing), which achieves an IH-dev acc. of `79.0` and an IH-test acc. of `74.0`.
90+
91+
You can also download a pretrained GreaseLM model on OpenbookQA [here](https://drive.google.com/file/d/1-QqyiQuU9xlN20vwfIaqYQ_uJMP8d7Pv/view?usp=sharing), which achieves an test acc. of `84.8`.
92+
93+
### 5. Evaluating a pretrained model checkpoint
94+
To evaluate a pretrained GreaseLM model checkpoint on CommonsenseQA, run
95+
```
96+
CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh csqa --data_dir data/ --load_model_path /path/to/checkpoint
97+
```
98+
Again you can specify up to 2 GPUs you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`.
99+
100+
SimilarlyTo evaluate a pretrained GreaseLM model checkpoint on OpenbookQA, run
101+
```
102+
CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh obqa --data_dir data/ --load_model_path /path/to/checkpoint
103+
```
104+
105+
### 6. Use your own dataset
106+
- Convert your dataset to `{train,dev,test}.statement.jsonl` in .jsonl format (see `data/csqa/statement/train.statement.jsonl`)
107+
- Create a directory in `data/{yourdataset}/` to store the .jsonl files
108+
- Modify `preprocess.py` and perform subgraph extraction for your data
109+
- Modify `utils/parser_utils.py` to support your own dataset
110+
111+
## Acknowledgment
112+
This repo is built upon the following work:
113+
```
114+
QA-GNN: Question Answering using Language Models and Knowledge Graphs
115+
https://github.com/michiyasunaga/qagnn
116+
```
117+
Many thanks to the authors and developers!

download_raw_data.sh

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# download ConceptNet
2+
mkdir -p data/
3+
mkdir -p data/cpnet/
4+
wget -nc -P data/cpnet/ https://s3.amazonaws.com/conceptnet/downloads/2018/edges/conceptnet-assertions-5.6.0.csv.gz
5+
cd data/cpnet/
6+
yes n | gzip -d conceptnet-assertions-5.6.0.csv.gz
7+
# download ConceptNet entity embedding
8+
wget https://csr.s3-us-west-1.amazonaws.com/tzw.ent.npy
9+
cd ../../
10+
11+
12+
13+
14+
# download CommensenseQA dataset
15+
mkdir -p data/csqa/
16+
wget -nc -P data/csqa/ https://s3.amazonaws.com/commensenseqa/train_rand_split.jsonl
17+
wget -nc -P data/csqa/ https://s3.amazonaws.com/commensenseqa/dev_rand_split.jsonl
18+
wget -nc -P data/csqa/ https://s3.amazonaws.com/commensenseqa/test_rand_split_no_answers.jsonl
19+
20+
# create output folders
21+
mkdir -p data/csqa/grounded/
22+
mkdir -p data/csqa/graph/
23+
mkdir -p data/csqa/statement/
24+
25+
26+
27+
# download OpenBookQA dataset
28+
wget -nc -P data/obqa/ https://s3-us-west-2.amazonaws.com/ai2-website/data/OpenBookQA-V1-Sep2018.zip
29+
yes n | unzip data/obqa/OpenBookQA-V1-Sep2018.zip -d data/obqa/
30+
31+
# create output folders
32+
mkdir -p data/obqa/fairseq/official/
33+
mkdir -p data/obqa/grounded/
34+
mkdir -p data/obqa/graph/
35+
mkdir -p data/obqa/statement/

eval_greaselm.sh

Lines changed: 22 additions & 0 deletions

figs/greaselm.png

134 KB
Loading

0 commit comments

Comments
 (0)