Enhancing Clickbait Headline Identification Performance Without Preprocessing Through Feature Reduction and Sentiment Analysis

Authors

  • Moch Deny Pratama Universitas Negeri Surabaya
  • Anisa Nur Azizah Universitas Wijaya Putra
  • Misbachul Falach Asy'ari Institut Teknologi Sepuluh Nopember
  • Dimas Novian Aditia Syahputra Universitas Negeri Surabaya
  • M Adamu Islam Mashuri Universitas Negeri Surabaya
  • Binti Kholifah Universitas Negeri Surabaya
  • Rifqi Abdillah Universitas Negeri Surabaya
  • Adinda Putri Pratiwi Institut Teknologi Sepuluh Nopember
  • Dina Zatusiva Haq Universitas Pembangunan Nasional “Veteran” Jawa Timur

Keywords:

Clickbait Detection, Feature Reduction, Sentiment Analysis, Embedding Technique, Machine Learning

Abstract

This study addresses the challenge of identifying clickbait headlines without relying on conventional text preprocessing, which can be resource-intensive and may degrade contextual integrity. To enhance detection performance, we examine three feature extraction methods: TF-IDF, Word2Vec, and Headline2Vec, an embedding technique designed for short texts like headlines. These features are optimized using feature selection algorithms, including Pearson Correlation Coefficient (PCC), Neighborhood Component Analysis (NCA), and Relief, to reduce dimensionality and enhance relevant signal retention. Sentiment polarity is also integrated as a complementary feature. A comparative evaluation is conducted using several machine learning classifiers, namely Support Vector Classifier (SVC), Random Forest, LightGBM, and XGBoost, across all combinations of feature extraction and selection methods. Results show that the optimal configuration Headline2Vec with Relief and SVC achieves the highest accuracy at 94.40%, outperforming other approaches. This demonstrates the effectiveness of combining semantic vectorization and feature selection for clickbait detection in the absence of traditional preprocessing. The findings support the development of streamlined and scalable classification models capable of maintaining high accuracy while reducing preprocessing overhead, making the proposed method particularly suitable for real-time and large-scale content moderation and news verification systems.

Downloads

Published

2025-07-30

Issue

Section

Articles
Abstract views: 78 , PDF Downloads: 49