IMPLEMENTASI CONVOLUTIONAL NEURAL NETWORK (CNN) DAN CONTRASTIVE LANGUAGE-IMAGE PRETRAINING (CLIP) UNTUK PREDIKSI GENRE FILM BERBASIS ANALISIS POSTER

Sebastian Kurniawan Windu Wiwaha; Kartono Pinaryanto

doi:10.46576/syntax.v6i1.6492

IMPLEMENTASI CONVOLUTIONAL NEURAL NETWORK (CNN) DAN CONTRASTIVE LANGUAGE-IMAGE PRETRAINING (CLIP) UNTUK PREDIKSI GENRE FILM BERBASIS ANALISIS POSTER

Sebastian Kurniawan Windu Wiwaha, Kartono Pinaryanto

Abstract

Abstrak— Industri perfilman terus berkembang pesat, menghasilkan ribuan film setiap tahun. Klasifikasi genre film menjadi krusial untuk pengelompokan dan sistem rekomendasi. Poster film, sebagai elemen visual utama, seringkali merepresentasikan genre melalui objek, warna, dan desain, namun informasi tekstual seperti plot juga signifikan. Penelitian ini bertujuan membandingkan performa Convolutional Neural Network (CNN) dan Contrastive Language-Image Pretraining (CLIP) dalam klasifikasi genre film multi-label menggunakan analisis poster dan plot. Dataset dari IMDb dan OMDb diproses melalui tahap preprocessing. Model CNN menggunakan arsitektur BiT-ResNet50, sementara CLIP menggunakan ViT-B/16, ViT-L/14, dan RN50x16 untuk poster, serta BERT untuk analisis plot. Eksperimen melibatkan variasi batch size, learning rate, dan optimizer. Hasil menunjukkan CLIP (ViT-L/14) lebih unggul dengan akurasi 83,2% dan Hamming Loss 0,1678, dibandingkan CNN dengan akurasi 77,9%. Integrasi analisis plot menggunakan BERT meningkatkan akurasi sekitar 5% dibandingkan metode berbasis poster saja. Studi ini membuktikan bahwa kombinasi model vision-language (CLIP) dan analisis teks (BERT) lebih efektif daripada CNN konvensional untuk klasifikasi genre film.

Kata Kunci—klasifikasi genre film, CNN, CLIP, deep learning, poster film, multi label classification.

ABSTRACT

Abstract— The film industry continues to develop rapidly, producing thousands of films annually. Film genre classification has become crucial for categorization and recommendation systems. Film posters, as primary visual elements, often represent genres through objects, colors, and design, while textual information such as plot is equally significant. This research aims to compare the performance of Convolutional Neural Network (CNN) and Contrastive Language-Image Pretraining (CLIP) in multi-label film genre classification using poster and plot analysis. The dataset from IMDb and OMDb was processed through preprocessing stages. The CNN model used BiT-ResNet50 architecture, while CLIP used ViT-B/16, ViT-L/14, and RN50x16 for posters, along with BERT for plot analysis. Experiments involved variations in batch size, learning rate, and optimizer. Results show CLIP (ViT-L/14) outperformed with 83.2% accuracy and Hamming Loss of 0.1678, compared to CNN with 77.9% accuracy. Integrating plot analysis using BERT improved accuracy by approximately 5% compared to poster-only methods. This study demonstrates that the combination of vision-language models (CLIP) and text analysis (BERT) is more effective than conventional CNN for film genre classification.

Keywords—film genre classification, CNN, CLIP, deep learning, movie posters, multi-label classification.

Full Text:

PDF (Bahasa Indonesia)

References

S. Pooranalingam, “Film Poster Design: Understanding Film Poster Designs and the Compositional Similarities within specific genres,” Spectrum, no. 12, Jan. 2024, doi: 10.29173/spectrum216.

W. T. Chu and H. J. Guo, “Movie genre classification based on poster images with deep neural networks,” in MUSA2 2017 - Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, co-located with MM 2017, Association for Computing Machinery, Inc, Oct. 2017, pp. 39–45. doi: 10.1145/3132515.3132516.

N. Hossain, M. M. Ahamad, S. Aktar, and M. A. Moni, “Movie Genre Classification with Deep Neural Network using Poster Images,” in 2021 International Conference on Information and Communication Technology for Sustainable Development, ICICT4SD 2021 - Proceedings, Institute of Electrical and Electronics Engineers Inc., Feb. 2021, pp. 195–199. doi: 10.1109/ICICT4SD50815.2021.9396778.

G. Barney and K. Kaya, "Predicting Genre from Movie Posters," CS229 Machine Learning Project Report, Stanford University, 2019. [Online]. Available: https://cs229.stanford.edu/proj2019spr/report/9.pdf

J. A. Wi, S. Jang, and Y. Kim, “Poster-Based Multiple Movie Genre Classification Using Inter-Channel Features,” IEEE Access, vol. 8, pp. 66615–66624, 2020, doi: 10.1109/ACCESS.2020.2986055.

A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” Feb. 2021. [Online]. Available: http://arxiv.org/abs/2103.00020.

S. Shen et al., “How Much Can CLIP Benefit Vision-and-Language Tasks?,” Jul. 2021. [Online]. Available: http://arxiv.org/abs/2107.06383.

I. H. Witten, E. Frank, and M. A. Hall, Data Mining Practical Machine Learning Tools and Technique. Burlington: Morgan Kaufmann Publisher, 2011.

DOI: https://doi.org/10.46576/syntax.v6i1.6492