GitHub - maastrichtlawtech/awesome-legal-nlp: 📖 A curated list of LegalNLP resources from all around the web. · GitHub
Skip to content

maastrichtlawtech/awesome-legal-nlp

Folders and files

Repository files navigation

Awesome License

Legal Natural Language Processing

🗂 Datasets

Legal Judgement Prediction (LJP)

Dataset Links Domain Language Size
FSCS (Niklaus et al., 2021) 📄 🤗 💻 Swiss court judgments 🇩🇪 🇫🇷 🇮🇹 85K cases w/ 2 outcomes
ECtHR (Chalkidis et al., 2021) 📄 🤗 EU court judgments 🇬🇧 11K cases w/ 11 outcomes
ECHR (Aletras et al., 2019) 📄 💾 EU court judgments 🇬🇧 11.5K cases w/ 11 outcomes
CAIL (Xiao et al., 2018) 📄 💻 Chinese court judgements 🇨🇳 2.6M cases w/ 6 outcomes
AnnoCaseLaw (2025) 📄 💻 US Appeals Court negligence cases 🇺🇸 471 annotated cases with expert labels
IndianBailJudgments-1200 (2025) 📄 🤗 💻 Indian court bail decisions 🇮🇳 1.2K judgments with 20+ structured attributes
CaseSumm (2025) 📄 🤗 US Supreme Court opinions 🇺🇸 25.6K opinions with official syllabuses
JUSTICE (2022) 📄 💻 US Supreme Court cases 🇺🇸 Benchmark for judgment prediction
Cambridge Law Corpus (CLC) (2023) 📄 UK court cases 🇬🇧 258K+ cases (16th century–present)
Super-SCOTUS (2025) 📄 💻 US Supreme Court decisions 🇺🇸 Decision direction and related tasks

Legal Text Classification (LTC)

Dataset Links Domain Language Size
GLC (Papaloukas et al., 2021) 📄 💻 Greek legislation 🇬🇷 47.5K laws w/ 2.7K labels
CUAD (Hendrycks et al., 2021) 📄 🤗 💻 Contracts 🇬🇧 510 contracts w/ 41 classes
MultiEURLEX (Chalkidis et al., 2021) 📄 🤗 💻 EU legislation 🇬🇧 🇩🇪 🇫🇷 🇮🇹 🇪🇸 (18+) 65K laws w/ 4.5K labels
LEDGAR (Tuggener et al., 2020) 📄 💾 Contracts 🇬🇧 60.5K contracts w/ 12.6K labels
Contract Discovery (Borchmann et al., 2020) 📄 💻 Contracts 🇬🇧 2.6K clauses w/ 21 classes
EURLEX-57K (Chalkidis et al., 2019) 📄 💾 EU legislation 🇬🇧 57K laws w/ 4.3K labels
Unfair-ToS (Lippi et al., 2018) 📄 💾 Contracts 🇬🇧 9.4K sentences w/ 9 classes
Contract Elements (Chalkidis et al., 2017) 📄 💾 Contracts 🇬🇧 2.4K contracts w/ 10 classes
OPP-115 (Wilson et al., 2016) 📄 💾 Privacy laws 🇬🇧 115 policies w/ 23K labels
FairLex (2022) 📄 🤗 💻 Multi-jurisdictional legal texts 🇬🇧🇩🇪🇫🇷🇮🇹🇨🇳 Fairness-focused classification datasets
Legal Case Document Summarization (Kaggle) 📄 Legal case summaries Various Large-scale dataset
Legal Citation Text Classification Dataset (Kaggle) 📄 General legal documents 🇬🇧 25K cases with catchphrases and citations

Legal Information Retrieval (LIR)

Dataset Links Domain Language Size
BSARD (Louis et al., 2022) 📄 🤗 💻 Belgian legislation 🇫🇷 1.1K questions w/ 22.6K candidate statutory articles
EU2UK (Chalkidis et al., 2021) 📄 💾 EU & UK legislation 🇬🇧 2K query documents w/ 52.5K candidate documents
UK2EU (Chalkidis et al., 2021) 📄 💾 EU & UK legislation 🇬🇧 2.1K query documents w/ 3.9K candidate documents
COLIEE-Case-Law-Retrieval (Rabelo et al., 2020) 📄 💾 Canadian precedents 🇬🇧 650 query cases w/ 128K candidate cases
COLIEE-Statute-Law-Retrieval (Rabelo et al., 2020) 📄 💾 Japanese legislation 🇬🇧 🇯🇵 808 questions w/ 768 candidate statutory articles
CAIL2019-SCM (Xiao et al., 2019) 📄 💻 Chinese court judgements 🇨🇳 8.9K triplets of cases
CLERC (2024) 📄 🤗 💻 Legal case retrieval 🇬🇧 Large corpus for retrieval and RAG
LEAD (2024) 📄 💻 Legal case retrieval Various 100K+ pairs of similar legal cases
Legal IR Philippines (2024) 📄 Philippine legal documents 🇵🇭 Datasets with synthetic queries

Legal Question Answering (LQA)

Dataset Links Domain Language Size
CaseHOLD (Zheng et al., 2021) 📄 💻 US case holdings 🇬🇧 53.1K multiple-choice questions
JEC-QA (Zhong et al., 2019) 📄 💾 Chinese law 🇨🇳 26.3K multiple-choice questions
CJRC (Duan et al., 2019) 📄 💻 Chinese court judgements 🇨🇳 50K question-answers from 10K documents
PrivacyQA (Ravichander et al., 2019) 📄 💻 Privacy policies 🇬🇧 1.7K question-answers from 35 documents
LLeQA (2024) 📄 🤗 💻 French-Belgian statutes 🇫🇷 1,868 expert-annotated long-form QA
IndicLegalQA (2025) 📄 Indian Supreme Court judgments 🇮🇳 10K QA pairs from 1,256 judgments
GerLayQA (2024) 📄 💻 German civil law 🇩🇪 21K laymen legal Qs with lawyer answers
LEGAL-UQA (2024) 📄 Legal questions 🇵🇰 619 parallel Urdu–English QA pairs

Legal Textual Entailment (LTE)

Dataset Links Domain Language Size
COLIEE-Case-Law-Entailment (Rabelo et al., 2020) 📄 💾 Canadian precedents 🇬🇧 425 cases w/ related case
COLIEE-Statute-Law-Entailment (Rabelo et al., 2020) 📄 💾 Japanese legislation 🇬🇧 🇯🇵 808 questions w/ related statutory article
LAR-ECHR (2024) 📄 European Court of Human Rights 🇬🇧 Legal argument reasoning task dataset
δ-Stance (2025) 📄 US legal argumentation 🇺🇸 Large-scale stances and arguments

Legal Text Summarization (LTS)

Dataset Links Domain Language Size
UK-Abs (Shukla et al., 2022) 📄 💻 💾 UK court cases 🇬🇧 793 pairs of (case, abastractive summary) from the UK Supreme Court
IN-Abs (Shukla et al., 2022) 📄 💻 💾 Indian court cases 🇬🇧 7.1K pairs of (case, abastractive summary) from the Indian Supreme Court
IN-Ext (Shukla et al., 2022) 📄 💻 💾 Indian court cases 🇬🇧 50 pairs of (case, extractive summary) from the Indian Supreme Court
TOS;DR (Keymanesh et al., 2020) 📄 💻 Terms of service 🇬🇧 1.6K pairs of (agreement text, summary) from data privacy policies
BillSum (Kornilova et al., 2019) 📄 💻 💾 US Congressional bills 🇬🇧 22.2K pairs of (bill, summary)
TL;DRLegal (Manor et al., 2019) 📄 💻 Terms of service 🇬🇧 84 pairs of (agreement text, summary) from software licenses
TOS;DR (Manor et al., 2019) 📄 💻 Terms of service 🇬🇧 421 pairs of (agreement text, summary) from data privacy policies
BVA Cases (Zhong et al., 2019) 📄 💻 US court cases 🇬🇧 92 pairs of (case, summary) from the US Board of Veterans' Appeal
LCR (Galgani et al., 2012) 📄 💾 Australian court cases 🇬🇧 3.9K pairs of (case, catchphrases)
EurLexSummarization (2022) 📄 🤗 💻 EU legislation 🌍 Multilingual summarization across 24 languages
Multi-LexSum (2025) 📄 Legal documents 🇬🇧 40K+ documents with 9K+ expert summaries
CaseSumm (2025) 📄 🤗 US Supreme Court opinions 🇬🇧 25.6K opinions with official syllabuses

Legal Language Modeling (LLM)

Dataset Links Language Size
Pile of Law (Henderson et al., 2022) 📄 🤗 💻 🇬🇧 ~256GB of legal and administrative legal text
MultiLegalPile (2024) 📄 🤗 🌍 689GB multilingual legal corpus from 17 jurisdictions

Benchmarks

Dataset Task Language Tasks
FairLex (Chalkidis et al., 2022) 📄 🤗 💻 🇬🇧 🇩🇪 🇫🇷 🇮🇹 🇨🇳 Clasification (x1), legal judgement prediction (x3)
LexGLUE (Chalkidis et al., 2022) 📄 🤗 💻 🇬🇧 Classsification (x6), multiple-choice QA (x1)

🔥 Models

Model Links Language Size
Legal-HeBERT (Chriqui et al., 2022) 📄 🤗 💻 🇮🇱 110M
PoL-BERT-Large (Henderson et al., 2022) 📄 🤗 💻 🇬🇧 336M
Italian-LEGAL-BERT (Licari and Comande, 2022) 📄 🤗 🇮🇹 110M
JuriBERT (Douka et al., 2021) 📄 💾 🇫🇷 {6M, 15M, 42M, 110M}
Custom-LEGAL-BERT (Zheng et al., 2021) 📄 🤗 💻 🇬🇧 110M
LEGAL-BERT (Chalkidis et al., 2020) 📄 🤗 🇬🇧 {35M, 110M}
LEGAL-GPT-{1,2} (Borchmann et al., 2020) 📄 💻 🇬🇧 {117M, 1.5B}
MultiLegalPile Models (2024-2025) 📄 🤗 🌍 RoBERTa (multilingual + 24 monolingual), Longformer
Legal-BERT Fine-tuned (2024) 📄 🇬🇧 Domain-adapted classification models
LegalCore Models (2025) 📄 🌍 Event coreference resolution for legal texts
Legal LLaMA (2025) 📄 🇨🇳 Chinese legal domain adaptations
FairLex Domain Models (2024-2025) 🤗 🌍 Domain-specific BERT models for 4 jurisdictions

📚 Books

  • [2017] Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age, K. Ashley. [link]

  • [2024] Large Language Models and International Law, Chicago Journal of International Law [🌐]

  • [2024] Computational Legal Studies Comes of Age, SSRN [📄]

📄 Surveys

  • [2020-05] How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence, H. Zhong et al. [pdf]

  • [2019-09] A Brief History of the Changing Roles of Case Prediction in AI and Law, K. Ashley [pdf]

  • [2018-12] Deep learning in law: early adaptation and legal word embeddings trained on large corpora, I. Chalkidis et al. [pdf]

  • [2024] Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models and Challenges, F. Ariai et al. [📄]

  • [2025] Computational Law: Datasets, Benchmarks, and Ontologies, D. Küçük & F. Can [📄]

  • [2025] A Comprehensive Survey on Legal Summarization, arXiv [📄]

  • [2024] Large Language Models in Law: A Survey, J. Lai et al. [📄]

  • [2025] Large Language Models in Argument Mining: A Survey, arXiv [📄]

  • [2024] When Large Language Models Meet Law: Dual-Lens Survey, arXiv [📄]

🎙 Talks

  • [2019-06] Law as Data: The Promise and Challenges of Natural Language Processing for Legal Research, A. Dyevre. [slides]
  • [2019-04] Artificial Intelligence and Law – An Overview and History, H. Surden. [video]

🗓 Conferences & Workshops

  • The Natural Legal Language Processing (NLLP) Workshop [website]
  • The International Conference on Artificial Intelligence and Law (ICAIL) [website]
  • The International Conference on Legal Knowledge and Information Systems (JURIX) [website]
  • The EXplainable AI in Law (XAILA) Workshop [website]
  • The International Workshop on Juris-informatics (JURISIN) [website]
  • The Competition on Legal Information Extraction/Entailment (COLIEE) [website]
  • The International Workshop on Legal Information Retrieval [website]

2025 Conferences

  • NLLP 2025 - Natural Legal Language Processing Workshop (EMNLP 2025, Suzhou) [🌐]
  • RegNLP 2025 - Regulatory Natural Language Processing Workshop (COLING 2025) [🌐]
  • JURIX 2025 - 38th International Conference on Legal Knowledge and Information Systems (Turin, December 9-11, 2025) [🌐]
  • ICAIL 2025 - 20th International Conference on Artificial Intelligence and Law (Chicago, June 16-20, 2025) [🌐]
  • MWAiL 2025 - Multilingual Workshop on AI & Law Research (Chicago, June 20, 2025) [🌐]
  • LLMFinLegal 2025 - Workshop on Large Language Models for Finance and Legal (COLING 2025) [🌐]
  • 8th World Legal Tech and AI Summit (Berlin, September 18-19, 2025) [🌐]

Industry & Professional Events

  • AI Legal Summit 2025 - Various industry conferences on AI in legal practice [🌐]
  • Legal AI Conferences Online Platform - Centralized platform for legal AI events [🌐]

🧰 Tools & Evaluation

Evaluation Tools

  • Embedding Benchmarking Tools: MTEB, Hugging Face evaluate, LegalBench, COLIEE [🌐]
  • Legal Argument Mining Tools: RMU:ECHR corpus and mining models [💻]
  • Multilingual Legal Processing: Evaluation pipelines for multilingual legal LLMs [📄]

Quality Assessment Frameworks

  • LegalEval-Q: Quality evaluation for LLM-generated legal text [📄]
  • FairLex Evaluation: Bias and fairness assessment [🌐]

Last Updated: 2025-09-30 Research Coverage: 2024-01 to 2025-09 Sources: 180+ academic papers, datasets, and conference proceedings

About

📖 A curated list of LegalNLP resources from all around the web.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Contributors