首页 > > 详细

讲解 Math 3544 Probability and Math Statistics辅导 数据结构语言

Math 3544 Probability and Math Statistics

Mid-Term Project 2nd Report: Artistic Style Image Classification

Spring 2025

Fall 2024  (at most 7 people per group)

[Instructions: Replace text in red with appropriate information and turn it in in class on the due date.
Keep everything else as is. The write-up of the report and future reports should be typeset well.]

PROJECT GOAL AND SCOPE 

Traditional image classification models like CNNs perform. well on real-world photographs but struggle when applied to artworks. Paintings are inherently subjective and often lack the rigid structure of photographic data. They contain abstract elements, diverse brushwork, color irregularities, and symbolic components that standard classification models may misinterpret. The challenge lies in developing a model that can learn these abstract representations without overfitting to low-level patterns.

We aim to develop a reliable and interpretable image classification system that automatically categorizes artworks into different artistic styles. These include Abstract Expressionism, Analytical Cubism, Action Painting, and Art Nouveau Modern. The classification should remain robust under visual ambiguity, style. overlap, and limited labeled data.

We propose a comparative framework combining both convolution-based and attention-based deep learning methods. Specifically, we apply classical CNN architectures (ResNet50, EfficientNetB0) alongside Vision Transformer (ViT), a self-attention-based approach. Models are trained and evaluated on a curated subset of the WikiArt dataset. Evaluation metrics include classification accuracy, macro F1-score, and confusion matrix. Furthermore, we implement PCA to analyze learned features and Grad-CAM to visualize attention mechanisms.

BACKGROUND/LIT REVIEW

Image classification has been widely studied using deep neural networks. Krizhevsky et al. (2012) introduced AlexNet, which sparked the widespread use of CNNs. CNNs rely on convolutional filters to extract spatial hierarchies of features, which has proven effective in structured datasets like ImageNet. ResNet introduced residual connections, reducing the vanishing gradient problem and allowing deeper networks.

EfficientNet, proposed by Tan and Le (2019), scales width, depth, and resolution using a compound coefficient and achieves top-1 accuracy of 85.5% on ImageNet with significantly fewer parameters than ResNet.

ViT, introduced by Dosovitskiy et al. (2020), replaces convolution with self-attention, treating an image as a sequence of patches. While ViT shows excellent results, with an 84.2% top-1 accuracy with ViT-L/16, it requires a large amount of data for training and is generalized differently from CNNs.

In the context of fine art classification, He et al. (2023) combined contextual embeddings and visual features, reporting F1-scores above 0.80 using hybrid attention-CNN architectures. This supports the hypothesis that self-attention mechanisms can be effective in tasks involving abstract, symbolic images.

Our study builds upon these approaches to determine which architecture—convolutional or transformer—performs better for fine art classification under data constraints.

CASE STUDY

In this study, we draw upon Dosovitskiy et al. (2020) as a key methodological foundation due to the Vision Transformer’s ability to model long-range dependencies in image data, which is particularly well-suited for capturing the abstract and stylistic nuances present in artwork. Their approach is not only state-of-the-art in image classification tasks but also aligns with the broader goals of this course in terms of exploring emerging architectures for real-world applications.

The dataset used in our project is a carefully selected and preprocessed subset of the WikiArt collection, which includes 500 paintings evenly distributed across four artistic styles. All images were resized to 224x224 pixels and normalized. Labels were encoded using one-hot encoding, and the data was split into training and validation sets. Preliminary visualizations using PCA confirmed a degree of separability among the styles, with clusters forming in reduced-dimensional space. These visual groupings suggest that the model detects stylistic differences even at the feature level.

Our primary goal is to evaluate the comparative effectiveness of three deep learning architectures—ResNet50, EfficientNetB0, and ViT—on this task. In addition to these deep models, we employ logistic regression and random forest classifiers as baselines using bottleneck features. All models are evaluated using accuracy and F1-score, with a special emphasis on F1 due to the artistic domain’s sensitivity to false negatives, which could lead to the misrepresentation of styles.

Throughout the study, we utilize interpretability tools such as Grad-CAM to visualize what parts of each image the model attends to during classification. We also plot confusion matrices to understand better which classes are often confused. In the event of underperformance, we incorporate augmentation techniques and learning rate adjustments to improve results and consider ensemble strategies combining CNN and Transformer outputs. This methodologically diverse setup ensures that our project rigorously explores the advantages and limitations of both traditional convolutional and modern transformer-based vision models.

DATA

In this study, we draw upon Dosovitskiy et al. (2020) as a key methodological foundation due to the Vision Transformer’s ability to model long-range dependencies in image data, which is particularly well-suited for capturing the abstract and stylistic nuances present in artwork. Their approach is not only state-of-the-art in image classification tasks but also aligns with the broader goals of this course in terms of exploring emerging architectures for real-world applications.

The dataset used in our project is a carefully selected and preprocessed subset of the WikiArt collection, which includes 500 paintings evenly distributed across four artistic styles. All images were resized to 224x224 pixels and normalized. Labels were encoded using one-hot encoding, and the data was split into training and validation sets. Preliminary visualizations using PCA confirmed a degree of separability among the styles, with clusters forming in reduced-dimensional space. These visual groupings suggest that the model detects stylistic differences even at the feature level.

Our primary goal is to evaluate the comparative effectiveness of three deep learning architectures—ResNet50, EfficientNetB0, and ViT—on this task. In addition to these deep models, we employ logistic regression and random forest classifiers as baselines using bottleneck features. All models are evaluated using accuracy and F1-score, with a special emphasis on F1 due to the artistic domain’s sensitivity to false negatives, which could lead to the misrepresentation of styles.

Throughout the study, we utilize interpretability tools such as Grad-CAM to visualize what parts of each image the model attends to during classification. We also plot confusion matrices to understand better which classes are often confused. In the event of underperformance, we incorporate augmentation techniques and learning rate adjustments to improve results and consider ensemble strategies combining CNN and Transformer outputs. This methodologically diverse setup ensures that our project rigorously explores the advantages and limitations of both traditional convolutional and modern transformer-based vision models.

TAKE-HOME DELIVERABLES


We fine-tuned a ResNet50 model pretrained on ImageNet using a subset of 500 images from the WikiArt dataset, covering four distinct artistic genres. The model was trained for 10 epochs with a batch size of 8, and training performance was tracked across both training and validation sets.

Accuracy Trends
The training accuracy started at 80.86% in the first epoch and remained relatively stable, oscillating slightly but generally between 79% and 82%. The validation accuracy began at 80.0% and consistently rose to 81.0%, indicating solid generalization performance on unseen images. This consistency demonstrates the effectiveness of the pretrained ResNet50 model, even when applied to complex, highly subjective art images with stylistic variance.

Loss Behavior
Training loss increased modestly from 0.6061 to 0.6807, suggesting slight overfitting or convergence saturation. However, the validation loss decreased progressively from 0.6113 to 0.5955, which is a positive signal of improved generalization. The decreasing validation loss and rising validation accuracy confirm that the model was learning meaningful patterns rather than simply memorizing training examples.

Training Dynamics
As visualized in the training and validation curves, the training loss and accuracy curves remained smooth, with no abrupt divergence. The validation accuracy plateaued early but remained high, and the loss curve continued to decline, showing healthy optimization behavior. No early stopping was triggered, and performance remained steady throughout the 10 epochs.

These results suggest that ResNet50 performs reliably in the context of artistic image classification, achieving over 81% validation accuracy on a limited dataset. Its pretrained convolutional filters seem to transfer well to the domain of paintings, capturing stylistic nuances despite their abstract and diverse nature.

FUTURE DELIVERABLES

1. Train and Evaluate EfficientNet and ViT Models

We will implement EfficientNetB0, known for its parameter efficiency, and Vision Transformer (ViT), which captures long-range dependencies using self-attention. The same dataset and preprocessing pipeline will be used to ensure fair comparison across architectures. Metrics such as validation accuracy, F1-score, and confusion matrix will be collected to compare model performance.

2. Side-by-Side Model Comparison

We will compare all three models (ResNet50, EfficientNetB0, ViT) based on accuracy and F1-score, computational efficiency and visual interpretability like Grad-CAM

REFERENCES
[1] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS).
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR).
[4] He, T., Zhang, L., & Wang, Y. (2023). Integrating Contextual Knowledge to Visual Features for Fine Art Classification. arXiv preprint.

Dataset: https://www.kaggle.com/datasets/steubk/wikiart/data


联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!