A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (2024)

Abstract

Image classifiers for domain-specific tasks like Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) and chest X-ray classification often rely on convolutional neural networks (CNNs). These networks, while powerful, experience high latency due to the number of operations they perform, which can be problematic in real-time applications. Many image classification models are designed to work with both RGB and grayscale datasets, but classifiers that operate solely on grayscale images are less common. Grayscale image classification has critical applications in fields such as medical imaging and SAR ATR. In response, we present a novel grayscale image classification approach using a vectorized view of images. By leveraging the lightweight nature of Multi-Layer Perceptrons (MLPs), we treat images as vectors, simplifying the problem to grayscale image classification. Our approach incorporates a single graph convolutional layer in a batch-wise manner, enhancing accuracy and reducing performance variance. Additionally, we develop a customized accelerator on FPGA for our model, incorporating several optimizations to improve performance. Experimental results on benchmark grayscale image datasets demonstrate the effectiveness of our approach, achieving significantly lower latency (up to 16×16\times16 × less on MSTAR) and competitive or superior performance compared to state-of-the-art models for SAR ATR and medical image classification.

Index Terms— GCN, grayscale, MLP, low-latency

1 Introduction

As the demand and popularity of real-time systems increase, low-latency machine learning has become increasingly important. With more and more consumers interacting with machine learning models through the cloud, the speed at which those models can deliver results is critical. Consumers expect fast and accurate results; any latency can lead to a poor user experience. Moreover, low-latency machine learning is essential in real-time applications, such as autonomous vehicles or stock market trading, where decisions must be made quickly and accurately. In these scenarios, delays caused by high latency can result in severe consequences and even cause inaccurate downstream calculations[1].

A particular instance where low-latency machine learning is needed is grayscale image classification for SAR ATR. For example, a targeting system on a satellite is costly, and decisions must be made using SAR efficiently and accurately. Examples like this are where low-latency grayscale image classification comes into play. It is often the case that image classifiers work on RGB datasets and grayscale image datasets, but seldom do modern image classifiers focus solely on the grayscale setting. RGB models are overkill for the grayscale setting, as the grayscale problem allows us to focus on a single channel. Models focusing on grayscale image classification are naturally more efficient, as they can concentrate on a single channel rather than three. Thus, many image classifiers that generalize to the grayscale image classification are not truly optimized for the grayscale case. For these reasons, we present a lightweight grayscale image classifier capable of achieving up to 16×16\times16 × lower latency than other state-of-the-art machine learning models on the MSTAR dataset.

From a trustworthy visual data processing perspective, the demand for grayscale image classification requires data to be collected from various domains with high resolution and correctness so that we can train a robust machine learning model. Additionally, recent advancements in machine learning rely on convolutional neural networks, which often suffer from high computation costs, large memory requirements, and many computations needed, resulting in poor inference latency, poor scalability, and weak trustworthiness.

The inherent novelties of our model are as follows: Our proposed method is the first to vectorize an image in a fully connected manner and input the resultant into a single-layer graph convolutional network (GCN). We also find that a single GCN layer is enough to stabilize the performance of our shallow model. Additionally, our proposed method benefits from a batch-wise attention term, allowing our shallow model to capture interdependencies between images and form connections for classification. Finally, by focusing on grayscale imagery, we can focus on a streamlined method for grayscale image classification rather than concentrating on the RGB setting. A result of these novelties is extremely low latency and high throughput for SAR ATR and medical image classification.

  • We present a lightweight, graph-based neural network for grayscale image classification. Specifically, we (1) apply image vectorization, (2) construct a graph for each batch of images and apply a single graph convolution, and (3) propose a weighted-sum mechanism to capture batch-wise dependencies.

  • We implement our proposed method on FPGA, including the following design methodology: (1) a portable and parameterized hardware template using high-level synthesis, (2) layer-by-layer design to maximize runtime hardware resource utilization, and (3) a one-time data load strategy to reduce external memory accesses.

  • Experiments show that our model achieves competitive or leading accuracy with respect to other popular state-of-the-art models while vastly reducing latency and model complexity for SAR ATR and medical image classification.

  • We implement our model on a state-of-the-art FPGA board, Xilinx Alveo U200. Compared with state-of-the-art GPU implementation, our FPGA implementation achieves comparable latency and throughput with state-of-the-art GPU, with only 1/411411/411 / 41 of the peak performance and 1/101101/101 / 10 of the memory bandwidth.

2 Problem Definition

The problem is to design a lightweight system capable of handling high volumes of data with low latency. The solution should be optimized for performance and scalability while minimizing resource utilization, a necessary component of many real-time machine learning applications. The system should be able to process and respond to requests quickly, with minimal delays. High throughput and low latency are critical requirements for this system, which must handle many concurrent requests without compromising performance. We define latency and throughput in the following ways:

Throughput=Total number of images processedTotal inference timeabsentTotal number of images processedTotal inference time\displaystyle=\frac{\text{Total number of images processed}}{\text{Total %inference time}}= divide start_ARG Total number of images processed end_ARG start_ARG Total inference time end_ARG
Latency=Total time for a single inferenceabsentTotal time for a single inference\displaystyle=\text{Total time for a single inference}= Total time for a single inference

Latency refers to the total time (from start to finish) it takes to gather predictions for a model in one batch. A lightweight machine learning model aims to maximize throughput and accuracy while minimizing latency.

3 Related Work

3.1 MLP Approaches

Our model combines various components of simple models and is inherently different from current works in low-latency image classification. Some recent architectures involve simple MLP-based models. Touvron et al. introduced ResMLP[2], an image classifier based solely on MLPs. ResMLP is trained in a self-supervised manner with patches interacting together. Touvron et al. highlight their model’s high throughput properties and accuracy. ResMLP uses patches from the image and alternates linear layers where patches interact and a two-layer feed-forward network where channels interact independently per patch.Additionally, MLP-Mixer[3] uses a similar patching method, which also attains competitive accuracy on RGB image datasets compared to other CNNs and transformer models. Our proposed method uses the results from a single-layer MLP to feed into a graph neural network, during which we skip the information from the three-channel RGB setting and only consider the single-channel grayscale problem. This is inherently different from the methods mentioned earlier, as they use patching approaches while we focus on the vectorization of pixels.

3.2 Graph Image Construction Methods

The dense graph mapping that utilizes each pixel as a node in a graph is used and mentioned by[4, 5]. For this paper, we employ the same terminology. Additionally, Zhang et al. presented a novel graph neural network architecture and examined its low-latency properties on the MSTAR dataset using the dense graph[6]. Our proposed method differs from dense graph methods, as we vectorize an image rather than using the entire grid as a graph.

Han et al.[7] form a graph from the image by splitting the image into patches, much like a transformer. A deep graph neural network learns on the patches similarly to a transformer but in a graph structure. Our structure does not form a graph where each patch is a node in a graph; instead, we create a graph from the resultant of a vectorized image passed through a fully connected layer.

Mondal et al. proposed applying graph neural networks on a minibatch of images[8]. Mondal et al. claim that this method improves performance for adversarial attacks. We use the proposed method to stabilize the performance of a highly shallow model. The graph neural network, in this case, allows learning to be conducted in a graph form, connecting images containing similar qualities.

Besides the model proposed by Zhang et al., all the methods mentioned focus on the RGB setting. This is overkill for grayscale image classification. Focusing on a single channel allows us to develop a more streamlined solution rather than forcing a model to operate on RGB datasets and having the grayscale setting come as an afterthought. Doing so allows us to reduce computational costs.

4 Overview and Architecture

This section describes our model architecture (GECCO: Grayscale Efficient Channelwise Classification Operative). The overall process is summarized in Figure1.

A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (1)

Overall Architecture. Many existing methods do not focus on the latency of their design and its implications. Additionally, the vast majority of image classification models focus on the performance of their work in the RGB setting, rarely citing the performance of datasets in various domains. We address these problems by presenting a novel architecture focused on low latency and the grayscale image setting.

Our model vectorizes a batch of images, allowing us to use a fully connected layer (FC) pixel-wise for low computation time rather than relying on convolutional neural networks. We vectorize the input images and input them into a fully connected layer. Then, we use a graph convolutional layer to learn similarities between images batch-wise. We then apply a batch-wise attention term, which is inputted into an FC for classification.111We make our code publicly available at https://github.com/GECCOProject/GECCO

Image Vectorization. For each image in a batch, we view the image as a vector. For a tensor 𝐗B×H×W𝐗superscript𝐵𝐻𝑊\mathbf{X}\in\mathbb{R}^{B\times H\times W}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_H × italic_W end_POSTSUPERSCRIPT where B𝐵Bitalic_B is the batch size,H𝐻Hitalic_H and W𝑊Witalic_W are the height and width of an image; we flatten the tensor to 𝐗1B×(HW)subscript𝐗1superscript𝐵𝐻𝑊\mathbf{X}_{1}\in\mathbb{R}^{B\times(H\cdot W)}bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × ( italic_H ⋅ italic_W ) end_POSTSUPERSCRIPT.Viewing an image as a vector allows our model to skip the traditional convolutional neural network, which views the image as a grid and cuts computation time.

Fully Connected Layer. We input 𝐗1subscript𝐗1\mathbf{X}_{1}bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT into an FC layer with output dimensionality Doutsubscript𝐷𝑜𝑢𝑡D_{out}italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.Formally, 𝐗2=σ(𝐗1𝐖1+𝐛1)subscript𝐗2𝜎subscript𝐗1subscript𝐖1subscript𝐛1\mathbf{X}_{2}=\sigma(\mathbf{X}_{1}\mathbf{W}_{1}+\mathbf{b}_{1})bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_σ ( bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), where σ𝜎\sigmaitalic_σ is the ReLU function,𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a learned weight matrix, 𝐛1subscript𝐛1\mathbf{b}_{1}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a bias term, and 𝐗2B×Doutsubscript𝐗2superscript𝐵subscript𝐷𝑜𝑢𝑡\mathbf{X}_{2}\in\mathbb{R}^{B\times D_{out}}bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

After the fully connected layer, we apply a dropout layer and the ReLU function to 𝐗2subscript𝐗2\mathbf{X}_{2}bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, yielding 𝐗3subscript𝐗3\mathbf{X}_{3}bold_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT,such that the resultant dimensionality of 𝐗3subscript𝐗3\mathbf{X}_{3}bold_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is B×Doutsuperscript𝐵subscript𝐷𝑜𝑢𝑡\mathbb{R}^{B\times D_{out}}blackboard_R start_POSTSUPERSCRIPT italic_B × italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Graph Construction. We construct a graph batch-wise from 𝐗3subscript𝐗3\mathbf{X}_{3}bold_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. This means that for each batch, a vectorized image is each node in the graph with feature sizeDoutsuperscriptsubscript𝐷𝑜𝑢𝑡\mathbb{R}^{D_{out}}blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and each image is connected to every other image in a batch. Formally, we calculate the adjacency matrix𝐀𝐀\mathbf{A}bold_A as 𝐀ij=1subscript𝐀𝑖𝑗1\mathbf{A}_{ij}=1bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1, which connects all nodes.

Graph Convolution. Our single graph convolutional layer learns from similar features of images within its minibatch.Generally, a graph convolutional layer updates the representations of nodes by aggregating each node’s neighbor’s representation.We can write a graph convolutional layer as 𝒉i=fθ(𝒉i,AGGREGATE({𝒉𝒋j𝒩i}))subscriptsuperscript𝒉bold-′𝑖subscript𝑓𝜃subscript𝒉𝑖AGGREGATEconditional-setsubscript𝒉𝒋𝑗subscript𝒩𝑖\bm{h^{\prime}}_{i}=f_{\theta}\left(\bm{h}_{i},\text{\scriptsize{AGGREGATE}}%\left(\{\bm{h_{j}}\mid j\in\mathcal{N}_{i}\}\right)\right)bold_italic_h start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , AGGREGATE ( { bold_italic_h start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∣ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) ).In our case, the input for each node 𝒉𝒊subscript𝒉𝒊\bm{h_{i}}bold_italic_h start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT is the output from each vector in 𝐗3subscript𝐗3\mathbf{X}_{3}bold_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

Applying graph convolution to 𝐗3subscript𝐗3\mathbf{X}_{3}bold_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT yields 𝐗4subscript𝐗4\mathbf{X}_{4}bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT.Formally, 𝐗4=σ(𝐀𝐗3𝐖2)subscript𝐗4𝜎subscript𝐀𝐗3subscript𝐖2\mathbf{X}_{4}=\sigma\left(\mathbf{A}\mathbf{X}_{3}\mathbf{W}_{2}\right)bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_σ ( bold_AX start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where 𝐖2subscript𝐖2\mathbf{W}_{2}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a learned weight matrix and σ𝜎\sigmaitalic_σ represents the sigmoid function.After the graph convolution, we apply batch normalization and max-pooling operations to 𝐗4subscript𝐗4\mathbf{X}_{4}bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, resulting in adimensionality of B×Dout/2superscript𝐵subscript𝐷𝑜𝑢𝑡2\mathbb{R}^{B\times\lfloor D_{out}/2\rfloor}blackboard_R start_POSTSUPERSCRIPT italic_B × ⌊ italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT / 2 ⌋ end_POSTSUPERSCRIPT.

Batch-wise Attention, Residual Connections, & Output.We propose a batch-wise attention term defined as

𝐗5subscript𝐗5\displaystyle\mathbf{X}_{5}bold_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT=(σ(𝐗4𝐗4)i=1Bσ(𝐗4𝐗4)i)𝐗4absent𝜎subscript𝐗4subscriptsuperscript𝐗top4superscriptsubscript𝑖1𝐵𝜎subscriptsubscript𝐗4superscriptsubscript𝐗4top𝑖subscript𝐗4\displaystyle=\left(\frac{\sigma\left(\mathbf{X}_{4}\mathbf{X}^{\top}_{4}%\right)}{\sum_{i=1}^{B}\sigma\left(\mathbf{X}_{4}\mathbf{X}_{4}^{\top}\right)_%{i}}\right)\mathbf{X}_{4}= ( divide start_ARG italic_σ ( bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_σ ( bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT

where σ𝜎\sigmaitalic_σ is the sigmoid function. This term allows the model to capture similar features from each image to another batch-wise.

The residual connection is defined 𝐗6=𝐗5+𝐗4subscript𝐗6subscript𝐗5subscript𝐗4\mathbf{X}_{6}=\mathbf{X}_{5}+\mathbf{X}_{4}bold_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + bold_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The residual term makes the learning process easier and more stable.By multiplying a softmax-like term with the output of the previous graph convolution,we weigh the correspondence of each image compared to other similar images within similar images batch-wise. We then feed the residual term into an FC inputted into the softmax function for classification results.

Model Structure Discussion

We justify our model’s design choices by considering the following theoretical aspects.

  1. 1.

    The batch-wise attention term allows the model to further capture similar features from each image to another batch-wise. Relating similar properties from images to each other boosts accuracy in our caseof a shallow model. Additionally, our batch-wise attention term is similar in spirit to the mechanism proposed by[9], which allows the model to capture long-range dependencies across the entire image.

  2. 2.

    The batch-size hyperparameter is crucial in our model.A larger batch size allows the model to capture more dependencies across images, which is crucial for understanding complex image patterns.We refer to the work of[10] for a detailed analysis of the impact of batch size on the performance of GNNs.

  3. 3.

    If the batch size for a given dataset is 1111, the model eliminates the graph construction phase, making the term 𝐗3subscript𝐗3\mathbf{X}_{3}bold_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT fed directly into the FC and softmax for classification.

  4. 4.

    The residual connection term makes the learning process easier and more stable.we refer to[11] for a more detailed analysis of the impact of residual connections on shallow models.

5 Experiments

5.1 Datasets

Datasets from several domains are examined to gauge the effectiveness of GECCO in diverse settings. We use the SAR ATR dataset, MSTAR, and a medical imaging dataset CXR[12].

  • MSTAR is a SAR ATR dataset with a training size of 2747274727472747 and testing size of 2425242524252425 SAR images of 10101010 different vehicle categories. We resize each image in the dataset to (128,128)128128(128,128)( 128 , 128 ) pixels.

  • CXR is a chest X-ray dataset containing 5863 X-ray images and 2 categories (Pneumonia/Normal). The images are (224,224)224224(224,224)( 224 , 224 ) pixels. The training size is 5216521652165216, and the testing size is 624624624624.

Our goal is to create a real-time system. That is, we wish to minimize the inference latency and maximize the throughput of our model while maintaining leading or competitive accuracy on its respective dataset. In the following sections, we measure the inference latency and throughput, as described in section2.

5.2 Results

5.2.1 Backbone

For Table1, we choose ResConv as the backbone of our model because it has the most desirable characteristics for applying a graph convolutional layer.

Convolutional LayerTop-1 AccuracyThroughput (imgs/msimgsms\text{imgs}/\text{ms}imgs / ms)Latency (ms)
GCN[13]98.89%percent98.8998.89\%98.89 %50.04±6.85plus-or-minus50.046.8550.04\pm 6.8550.04 ± 6.855.86±0.98plus-or-minus5.860.985.86\pm 0.985.86 ± 0.98
TAGConv[14]99.05%percent99.0599.05\%99.05 %47.87±7.11plus-or-minus47.877.1147.87\pm 7.1147.87 ± 7.116.24±1.32plus-or-minus6.241.326.24\pm 1.326.24 ± 1.32
SAGEConv[15]99.08%percent99.0899.08\%99.08 %51.77±8.19plus-or-minus51.778.1951.77\pm 8.1951.77 ± 8.195.95±0.87plus-or-minus5.950.875.95\pm 0.875.95 ± 0.87
ChebConv[16]98.56%percent98.5698.56\%98.56 %45.37±5.99plus-or-minus45.375.9945.37\pm 5.9945.37 ± 5.996.83±1.27plus-or-minus6.831.276.83\pm 1.276.83 ± 1.27
ResConv[17]99.29%percent99.29\bf{99.29\%}bold_99.29 %52.98±9.04plus-or-minus52.989.04\bf{52.98\pm 9.04}bold_52.98 ± bold_9.045.22±1.03plus-or-minus5.221.03\bf{5.22\pm 1.03}bold_5.22 ± bold_1.03

We use the following hyperparameters listed in Table 2 for our experiments.

DatasetFeature LengthOptimizerBatch Size
MSTAR86868686Adam64646464
CXR112112112112Adam64646464

5.2.2 Experimental Performance

Experimental performance includes the top-1 accuracy, inference throughput, and inference latency. We perform our inference batch-wise as a means to reduce latency. These metrics vary across each dataset.

We summarize our findings in Tables3 and 4. We report the best-performing accuracy, average throughputs, and latencies with their standard deviations. Our model outperforms every other model in terms of throughput and latency across all datasets, leads accuracy on the MSTAR dataset, and performs competitively in terms of accuracy on all datasets.

We perform the remaining experiments on a state-of-the-art NVIDIA RTX A5000 GPU. Additionally, we compare our model to the top-performing variants of VGG[18], the variant of the popular ViT[19], the ViT for small-sized datasets (SS-ViT)[20], FastViT[21], Swin Transformer[22], and ResNet[23] models. We use the open-source packages PyTorch and HuggingFace for model building and the PyTorch Op-Counter for operation counting. Performing the remaining experiments on the same hardware system is vital in fostering a fair comparison for each model.

ModelTop-1 AccuracyThroughput (imgs/msimgsms\text{imgs}/\text{ms}imgs / ms)Latency (ms)
Swin-T86.04%percent86.0486.04\%86.04 %1.36±0.10plus-or-minus1.360.101.36\pm 0.101.36 ± 0.1046.98±3.20plus-or-minus46.983.2046.98\pm 3.2046.98 ± 3.20
SS-ViT95.61%percent95.6195.61\%95.61 %2.29±0.43plus-or-minus2.290.432.29\pm 0.432.29 ± 0.4327.97±5.26plus-or-minus27.975.2627.97\pm 5.2627.97 ± 5.26
VGG1693.13%percent93.1393.13\%93.13 %1.69±0.33plus-or-minus1.690.331.69\pm 0.331.69 ± 0.3337.89±7.52plus-or-minus37.897.5237.89\pm 7.5237.89 ± 7.52
FastViT91.78%percent91.7891.78\%91.78 %1.04±0.13plus-or-minus1.040.131.04\pm 0.131.04 ± 0.1361.44±7.69plus-or-minus61.447.6961.44\pm 7.6961.44 ± 7.69
ResNet3498.64%percent98.6498.64\%98.64 %3.13±0.22plus-or-minus3.130.223.13\pm 0.223.13 ± 0.2220.48±1.39plus-or-minus20.481.3920.48\pm 1.3920.48 ± 1.39
GECCO99.29%percent99.29\bf{99.29\%}bold_99.29 %12.26±2.42plus-or-minus12.262.42\bf{12.26\pm 2.42}bold_12.26 ± bold_2.425.22±1.03plus-or-minus5.221.03\bf{5.22\pm 1.03}bold_5.22 ± bold_1.03

ModelTop-1 AccuracyThroughput (imgs/msimgsms\text{imgs}/\text{ms}imgs / ms)Latency (ms)
Swin-T73.66%percent73.6673.66\%73.66 %0.27±0.05plus-or-minus0.270.050.27\pm 0.050.27 ± 0.05236.71±46.09plus-or-minus236.7146.09236.71\pm 46.09236.71 ± 46.09
SS-ViT71.09%percent71.0971.09\%71.09 %1.03±0.21plus-or-minus1.030.211.03\pm 0.211.03 ± 0.2162.35±12.85plus-or-minus62.3512.8562.35\pm 12.8562.35 ± 12.85
VGG1682.01%percent82.01\bf{82.01\%}bold_82.01 %0.76±0.25plus-or-minus0.760.250.76\pm 0.250.76 ± 0.2584.10±28.43plus-or-minus84.1028.4384.10\pm 28.4384.10 ± 28.43
FastViT75.46%percent75.4675.46\%75.46 %1.06±0.14plus-or-minus1.060.141.06\pm 0.141.06 ± 0.1460.30±14.24plus-or-minus60.3014.2460.30\pm 14.2460.30 ± 14.24
ResNet3478.31%percent78.3178.31\%78.31 %0.60±0.11plus-or-minus0.600.110.60\pm 0.110.60 ± 0.11105.84±19.39plus-or-minus105.8419.39105.84\pm 19.39105.84 ± 19.39
GECCO77.57%percent77.5777.57\%77.57 %2.63±0.55plus-or-minus2.630.55\bf{2.63\pm 0.55}bold_2.63 ± bold_0.5524.32±5.08plus-or-minus24.325.08\bf{24.32\pm 5.08}bold_24.32 ± bold_5.08

5.2.3 Model Complexity Metrics

Model complexity metrics for this paper include the number of multiply-accumulate operations (MACs), the number of model parameters, the model size, and the number of layers. In other words, suppose accumulator a𝑎aitalic_a counts an operation of arbitrary b,c𝑏𝑐b,c\in\mathbb{R}italic_b , italic_c ∈ blackboard_R. We count the number of multiply-accumulate operations as aa+(b×c)𝑎𝑎𝑏𝑐a\leftarrow a+(b\times c)italic_a ← italic_a + ( italic_b × italic_c ). Additionally, the layer count metric is an essential factor of latency. Decreasing the number of layers will also improve the latency of a model’s inference time. The goal of an effective machine learning model is to maximize throughput while minimizing the number of MACs and the number of layers, in our case.

We measure the model complexity of our model against other popular machine learning models that we have chosen in Table5. Our model outperforms in all categories regarding our chosen model complexity metrics, highlighting its lightweightness.

Model# MACs# ParametersModel Size (Mb)# Layers
Swin-T2.12×10102.12superscript10102.12\times 10^{10}2.12 × 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT2.75×1072.75superscript1072.75\times 10^{7}2.75 × 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT109.9109.9109.9109.9167167167167
SS-ViT1.55×10101.55superscript10101.55\times 10^{10}1.55 × 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT4.85×1064.85superscript1064.85\times 10^{6}4.85 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT19.6219.6219.6219.6279797979
VGG169.51×1099.51superscript1099.51\times 10^{9}9.51 × 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT4.69×1064.69superscript1064.69\times 10^{6}4.69 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT18.7518.7518.7518.7520202020
FastViT7.16×1087.16superscript1087.16\times 10^{8}7.16 × 10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT4.02×1064.02superscript1064.02\times 10^{6}4.02 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT16.116.116.116.1226226226226
ResNet344.47×1094.47superscript1094.47\times 10^{9}4.47 × 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT2.13×1072.13superscript1072.13\times 10^{7}2.13 × 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT85.185.185.185.192929292
GECCO5.10×𝟏𝟎𝟒5.10superscript104\bf{5.10\times 10^{4}}bold_5.10 × bold_10 start_POSTSUPERSCRIPT bold_4 end_POSTSUPERSCRIPT5.08×𝟏𝟎𝟒5.08superscript104\bf{5.08\times 10^{4}}bold_5.08 × bold_10 start_POSTSUPERSCRIPT bold_4 end_POSTSUPERSCRIPT0.190.19\bf{0.19}bold_0.19𝟏𝟔16\bf{16}bold_16

5.2.4 Ablation Study

We perform an ablation study to verify that the components of our proposed model contribute positively to the overall accuracy of the MSTAR dataset.

Mini-batch GNNWeighted Sum Residual TermAccuracy on MSTAR
99.29%percent99.2999.29\%99.29 %
97.94%percent97.9497.94\%97.94 %
88.04%percent88.0488.04\%88.04 %
78.64%percent78.6478.64\%78.64 %

Additionally, we find that only a single graph convolutional layer is enough to reduce the variance and increase the accuracy of our model. Refer to Figure2.

A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (2)

5.3 Discussion

Across multiple datasets, GECCO achieves leading or competitive accuracy compared to other state-of-the-art image classifiers. GECCO outperforms other machine learning models regarding model complexity, highlighting our model’s low latency and lightweight properties.

It is difficult for our model to generalize to the RGB setting. We attribute this challenge to the vectorization process that our model uses. Learning on three channels poses a complexity challenge, as GECCO is very shallow and simple, thus making it challenging to learn on three separate channels. Additionally, our model is optimized for a low-complexity dataset regime, as datasets like CIFAR and ImageNet are much too complex for our shallow model, as they pose a complexity challenge.

Our proposed method does not make use of positional embeddings or class tokens. GECCO can learn essential features using the weighted residual term. Additionally, we tested the addition of positional embeddings and class tokens and found no improvement in accuracy across various datasets. We note that the 𝐗5subscript𝐗5\mathbf{X}_{5}bold_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT attention-like term adds positional awareness to the model.

5.4 FPGA Implementation

We develop an accelerator for the proposed model on state-of-the-art FPGA, Xilinx Alveo U200[24], to further highlight the model’s efficiency and compatibility with hardware. It has 3 Super Logic Regions (SLRs), 4 DDR memory banks, 1182k Look-up tables, 6840 DSPs, 75.9 Mb of BRAM, and 270 Mb of URAM. The FPGA kernels are developed using the Xilinx High-level Synthesis (HLS) tool to expedite the design process.

Our FPGA design incorporates several novel features: (1) Portability of the design: We design a parameterized hardware template using HLS. It is portable to different FPGA platforms, including embedded and data-center FPGAs. We present our hardware mapping algorithm in Algorithm 1. (2) Resource sharing: The model is executed layer-by-layer. Each layer in the model is decomposed into basic kernel functions. The basic kernel functions, including matrix multiplication, elementwise activation, column-wise and row-wise summations, max pooling, and various other elementwise operations, are implemented separately and subsequently invoked within their corresponding layers. Due to the reuse of these fundamental kernel functions across multiple layers, FPGA resources are shared among the different layers, maximizing runtime hardware resource utilization. (3) Single-load strategy: We employ a one-time data load strategy to load the required data from DDR only once. All other data required for the computations are stored in on-chip memory, reducing inference latency. Figure3 illustrates the overall hardware architecture of our design.

We utilize the Vitis tool[25] for hardware synthesis and place-and-route to determine the achieved frequency. The Vitis Analyzer tool is then used to report resource utilization and the number of clock cycles. The latency is calculated by multiplying the achieved clock period by the number of cycles. Table8 reports the results obtained for the MSTAR dataset. Given the compact design and resource efficiency of the model, it can be accommodated within a single SLR. Hence, we deploy multiple accelerator instances across multiple SLRs, each with one instance. This increases the inference throughput. Table8 shows the latency obtained for a single inference and the throughput achieved by running the design on 3 SLRs concurrently.

GPUOur DesignPlatformNVIDIA A5000Alveo U200TechnologySamsung 8 nmTSMC 16 nmFrequency1.17 GHz200 MHzPeak Performance (TFLOPS)27.70.66On-chip Memory6 MB35 MBMemory Bandwidth768 GB/s77 GB/sLatency on MSTAR (ms)5.225.65Throughput on MSTAR (imgs/ms)12.2633.98

1:Model f()𝑓f()italic_f ( ) and the input images;

2:Execution result;

3:foreach layer i𝑖iitalic_i in f()𝑓f()italic_f ( )do

4:if layer i𝑖iitalic_i is a fully connected layer then

5:Map to the matrix multiplication unit

6:iflayer i𝑖iitalic_i is a graph convolution layerthen

7:Map to the matrix multiplication unit

8:Map to the activation unit

9:Map to the elementwise operation unit

10:Map to the matrix addition unit

11:iflayer i𝑖iitalic_i is a batch-wise attentionthen

12:Map to the matrix multiplication unit

13:Map to the activation unit

14:Map to the elementwise operation unit

15:iflayer i𝑖iitalic_i is a max pooling layerthen

16:Map to the max pooling unit

17:iflayer i𝑖iitalic_i is a activation layerthen

18:Map to the activation unit

19:iflayer i𝑖iitalic_i is a batch normalization layer then

20:Map to the batch normalization unit

We compare our FPGA implementation with the baseline GPU implementation. The GPU baseline is executed on an NVIDIA RTX A5000 GPU, which operates at 1170 MHz and has a memory bandwidth of 768 GB/s. However, the FPGA operates at 200 MHz and has an external memory bandwidth of 77 GB/s. We compare the hardware features of the two platforms in Table7. Although the GPU has higher peak performance (41×41\times41 ×) and memory bandwidth (10×10\times10 ×), our FPGA implementation achieves a comparable latency of 5.65ms5.65𝑚𝑠5.65~{}ms5.65 italic_m italic_s and an improved throughput of 33.98imgs/ms33.98𝑖𝑚𝑔𝑠𝑚𝑠33.98~{}imgs/ms33.98 italic_i italic_m italic_g italic_s / italic_m italic_s.

A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (3)

Latency5.655.655.655.65 ms
Throughput33.9833.9833.9833.98 imgs/ms
BRAMs956(22%)956percent22956~{}(22\%)956 ( 22 % )
URAMs228(24%)228percent24228~{}(24\%)228 ( 24 % )
DSPs1226(17%)1226percent171226~{}(17\%)1226 ( 17 % )
LUTs459K(38%)459𝐾percent38459K~{}(38\%)459 italic_K ( 38 % )
FFs597K(25%)597𝐾percent25597K~{}(25\%)597 italic_K ( 25 % )

6 Conclusion and Future Work

This work introduced a novel architecture combining fully connected and graph convolutional layers, benchmarked on popular grayscale image datasets. The model demonstrated strong performance and low complexity, highlighting the importance of lightweight, low-latency image classifiers for various applications. Its efficacy was shown across SAR ATR and medical image classification, with an FPGA implementation underscoring its hardware friendliness. Key innovations include using a single-layer GCN, which, along with batch-wise attention, enhances accuracy and reduces variance. Future work should explore extending this approach to color image datasets and other domains, optimizing the architecture for even greater efficiency, and further investigating the potential of graph neural networks in shallow models.

7 Acknowledgement

This work is supported by the DEVCOM Army Research Lab (ARL)under grant W911NF2220159. Distribution Statement A: Approved for public release. Distribution is unlimited.

References

  • [1]Kaoru Ota, MinhSon Dao, Vasileios Mezaris, and Francesco G. B.De Natale,“Deep learning for mobile multimedia: A survey,”ACM Trans. Multimedia Comput. Commun. Appl., vol. 13, no. 3s,jun 2017.
  • [2]Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, AlaaeldinEl-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve,Jakob Verbeek, and Hervé Jégou,“Resmlp: Feedforward networks for image classification withdata-efficient training,” 2021.
  • [3]Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai,Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, JakobUszkoreit, Mario Lucic, and Alexey Dosovitskiy,“Mlp-mixer: An all-mlp architecture for vision,” 2021.
  • [4]Benjamin Sanchez-Lengeling, Emily Reif, Adam Pearce, and AlexanderB.Wiltschko,“A gentle introduction to graph neural networks,”Distill, 2021,https://distill.pub/2021/gnn-intro.
  • [5]Naman Goyal and David Steiner,“Graph neural networks for image classification and reinforcementlearning using graph representations,” 2022.
  • [6]Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, and Carl Busart,“Accurate, low-latency, efficient sar automatic target recognitionon fpga,”in 2022 32nd International Conference on Field-ProgrammableLogic and Applications (FPL). Aug. 2022, IEEE.
  • [7]Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu,“Vision gnn: An image is worth graph of nodes,”in NeurIPS, 2022.
  • [8]ArnabKumar Mondal, Vineet Jain, and Kaleem Siddiqi,“Mini-batch graphs for robust image classification,” 2021.
  • [9]Qishang Cheng, Hongliang Li, Qingbo Wu, and KingNgi Ngan,“Ba2m: A batch aware attention module for image classification,”2021.
  • [10]Yaochen Hu, Amit Levi, Ishaan Kumar, Yingxue Zhang, and Mark Coates,“On batch-size selection for stochastic training for graph neuralnetworks,” 2021.
  • [11]Shuzhi Yu and Carlo Tomasi,“Identity connections in residual nets improve noise stability,”2019.
  • [12]Daniel Kermany,“Labeled optical coherence tomography (oct) and chest x-ray imagesfor classification,” 2018.
  • [13]ThomasN. Kipf and Max Welling,“Semi-supervised classification with graph convolutional networks,”2017.
  • [14]Jian Du, Shanghang Zhang, Guanhang Wu, Jose M.F. Moura, and Soummya Kar,“Topology adaptive graph convolutional networks,” 2018.
  • [15]WilliamL. Hamilton, Rex Ying, and Jure Leskovec,“Inductive representation learning on large graphs,” 2018.
  • [16]Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst,“Convolutional neural networks on graphs with fast localizedspectral filtering,” 2017.
  • [17]Xavier Bresson and Thomas Laurent,“Residual gated graph convnets,” 2018.
  • [18]Karen Simonyan and Andrew Zisserman,“Very deep convolutional networks for large-scale imagerecognition,” 2015.
  • [19]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby,“An image is worth 16x16 words: Transformers for image recognitionat scale,” 2021.
  • [20]SeungHoon Lee, Seunghyun Lee, and ByungCheol Song,“Vision transformer for small-size datasets,”CoRR, vol. abs/2112.13492, 2021.
  • [21]Pavan KumarAnasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and AnuragRanjan,“Fastvit: A fast hybrid vision transformer using structuralreparameterization,” 2023.
  • [22]ZeLiu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, andBaining Guo,“Swin transformer: Hierarchical vision transformer using shiftedwindows,” 2021.
  • [23]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,”in 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016, pp. 770–778.
  • [24]Xilinx,“Xilinx alveo u200 board,”https://docs.xilinx.com/r/en-US/ds962-u200-u250/FPGA-Resource-Information.
  • [25]“Vitis HLS,”https://www.xilinx.com/products/design-tools/vitis/vitis-hls.html.
A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (2024)
Top Articles
Latest Posts
Article information

Author: Aron Pacocha

Last Updated:

Views: 5646

Rating: 4.8 / 5 (48 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Aron Pacocha

Birthday: 1999-08-12

Address: 3808 Moen Corner, Gorczanyport, FL 67364-2074

Phone: +393457723392

Job: Retail Consultant

Hobby: Jewelry making, Cooking, Gaming, Reading, Juggling, Cabaret, Origami

Introduction: My name is Aron Pacocha, I am a happy, tasty, innocent, proud, talented, courageous, magnificent person who loves writing and wants to share my knowledge and understanding with you.