GitHub - amirhpd/text_classifier: NLP project - News articles classifier · GitHub
Skip to content

amirhpd/text_classifier

Folders and files

Repository files navigation

Implementation and Evaluation of Text Classifiers

alt text

Abstract:

This project aims to implement and compare different types of text classifiers on the news text data. A dataset of labeled text will be used to train the model. The evaluation of each method will be done using the test dataset. A comparison between the performance of the classifiers will be delivered.

Steps:

  1. Dataset Collection: A combination of three datasets of news articles will be used:

    • AG's News Topic Classification Dataset [1]:
      Contains 120,000 train samples and 7,600 test samples of news texts, and labeled in 4 categories of World, Sports, Business, Science.
    • COVID-19 News Articles Open Research Dataset [2]:
      Contains more than 5,000 news test items about COVID-19.
    • Covid-19 Public Media Dataset by Anacode [3]:
      Contains over 50,000 text articles about COVID-19, from different online sources. Items of news articles will be selected for this project.

    The combination of the mentioned dataset will provide a larger dataset with 5 labels: World, Sports, Business, Science, and Corona.

  2. Text pre-processing: The following operations will be done on the text data:

    • Tokenization
    • Punctuation and number removal
    • Stop word removal
    • Convert to lower-case
    • Lemmatization

  1. Text Vectorization: Text data will be converted to vectors. Two different NLP techniques will be applied and the results of each will be compared:
    • TF-IDF
    • Word2Vec

  1. Classification: Different supervised classification techniques will be applied and results will be converted. Besides, some NLP-based techniques will be tried:
    • Naive Bays classifier
    • SVM
    • Neural networks

  1. Evaluation: Results of all different methods will be compared. A benchmark based on the parameters of train/test accuracy, memory usage, and processing time will be delivered.

Follow this guide to execute the files.
Here is the Slide set (subject to copyright)




[1]: https://www.kaggle.com/amananandrai/ag-news-classification-dataset
[2]: https://www.kaggle.com/ryanxjhan/cbc-news-coronavirus-articles-march-26
[3]: https://www.kaggle.com/jannalipenkova/covid19-public-media-dataset

About

NLP project - News articles classifier

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors