A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (2024)

Abstract

Image classifiers for domain-specific tasks like Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) and chest X-ray classification often rely on convolutional neural networks (CNNs). These networks, while powerful, experience high latency due to the number of operations they perform, which can be problematic in real-time applications. Many image classification models are designed to work with both RGB and grayscale datasets, but classifiers that operate solely on grayscale images are less common. Grayscale image classification has critical applications in fields such as medical imaging and SAR ATR. In response, we present a novel grayscale image classification approach using a vectorized view of images. By leveraging the lightweight nature of Multi-Layer Perceptrons (MLPs), we treat images as vectors, simplifying the problem to grayscale image classification. Our approach incorporates a single graph convolutional layer in a batch-wise manner, enhancing accuracy and reducing performance variance. Additionally, we develop a customized accelerator on FPGA for our model, incorporating several optimizations to improve performance. Experimental results on benchmark grayscale image datasets demonstrate the effectiveness of our approach, achieving significantly lower latency (up to $16\times$ less on MSTAR) and competitive or superior performance compared to state-of-the-art models for SAR ATR and medical image classification.

Index Terms— GCN, grayscale, MLP, low-latency

1 Introduction

As the demand and popularity of real-time systems increase, low-latency machine learning has become increasingly important. With more and more consumers interacting with machine learning models through the cloud, the speed at which those models can deliver results is critical. Consumers expect fast and accurate results; any latency can lead to a poor user experience. Moreover, low-latency machine learning is essential in real-time applications, such as autonomous vehicles or stock market trading, where decisions must be made quickly and accurately. In these scenarios, delays caused by high latency can result in severe consequences and even cause inaccurate downstream calculations[1].

A particular instance where low-latency machine learning is needed is grayscale image classification for SAR ATR. For example, a targeting system on a satellite is costly, and decisions must be made using SAR efficiently and accurately. Examples like this are where low-latency grayscale image classification comes into play. It is often the case that image classifiers work on RGB datasets and grayscale image datasets, but seldom do modern image classifiers focus solely on the grayscale setting. RGB models are overkill for the grayscale setting, as the grayscale problem allows us to focus on a single channel. Models focusing on grayscale image classification are naturally more efficient, as they can concentrate on a single channel rather than three. Thus, many image classifiers that generalize to the grayscale image classification are not truly optimized for the grayscale case. For these reasons, we present a lightweight grayscale image classifier capable of achieving up to $16\times$ lower latency than other state-of-the-art machine learning models on the MSTAR dataset.

From a trustworthy visual data processing perspective, the demand for grayscale image classification requires data to be collected from various domains with high resolution and correctness so that we can train a robust machine learning model. Additionally, recent advancements in machine learning rely on convolutional neural networks, which often suffer from high computation costs, large memory requirements, and many computations needed, resulting in poor inference latency, poor scalability, and weak trustworthiness.

The inherent novelties of our model are as follows: Our proposed method is the first to vectorize an image in a fully connected manner and input the resultant into a single-layer graph convolutional network (GCN). We also find that a single GCN layer is enough to stabilize the performance of our shallow model. Additionally, our proposed method benefits from a batch-wise attention term, allowing our shallow model to capture interdependencies between images and form connections for classification. Finally, by focusing on grayscale imagery, we can focus on a streamlined method for grayscale image classification rather than concentrating on the RGB setting. A result of these novelties is extremely low latency and high throughput for SAR ATR and medical image classification.

•
We present a lightweight, graph-based neural network for grayscale image classification. Specifically, we (1) apply image vectorization, (2) construct a graph for each batch of images and apply a single graph convolution, and (3) propose a weighted-sum mechanism to capture batch-wise dependencies.
•
We implement our proposed method on FPGA, including the following design methodology: (1) a portable and parameterized hardware template using high-level synthesis, (2) layer-by-layer design to maximize runtime hardware resource utilization, and (3) a one-time data load strategy to reduce external memory accesses.
•
Experiments show that our model achieves competitive or leading accuracy with respect to other popular state-of-the-art models while vastly reducing latency and model complexity for SAR ATR and medical image classification.
•
We implement our model on a state-of-the-art FPGA board, Xilinx Alveo U200. Compared with state-of-the-art GPU implementation, our FPGA implementation achieves comparable latency and throughput with state-of-the-art GPU, with only $1/41$ of the peak performance and $1/10$ of the memory bandwidth.

2 Problem Definition

The problem is to design a lightweight system capable of handling high volumes of data with low latency. The solution should be optimized for performance and scalability while minimizing resource utilization, a necessary component of many real-time machine learning applications. The system should be able to process and respond to requests quickly, with minimal delays. High throughput and low latency are critical requirements for this system, which must handle many concurrent requests without compromising performance. We define latency and throughput in the following ways:

	Throughput	$\displaystyle=\frac{\text{Total number of images processed}}{\text{Total %inference time}}$
	Latency	$\displaystyle=\text{Total time for a single inference}$

Latency refers to the total time (from start to finish) it takes to gather predictions for a model in one batch. A lightweight machine learning model aims to maximize throughput and accuracy while minimizing latency.

3 Related Work

3.1 MLP Approaches

Our model combines various components of simple models and is inherently different from current works in low-latency image classification. Some recent architectures involve simple MLP-based models. Touvron et al. introduced ResMLP[2], an image classifier based solely on MLPs. ResMLP is trained in a self-supervised manner with patches interacting together. Touvron et al. highlight their model’s high throughput properties and accuracy. ResMLP uses patches from the image and alternates linear layers where patches interact and a two-layer feed-forward network where channels interact independently per patch.Additionally, MLP-Mixer[3] uses a similar patching method, which also attains competitive accuracy on RGB image datasets compared to other CNNs and transformer models. Our proposed method uses the results from a single-layer MLP to feed into a graph neural network, during which we skip the information from the three-channel RGB setting and only consider the single-channel grayscale problem. This is inherently different from the methods mentioned earlier, as they use patching approaches while we focus on the vectorization of pixels.

3.2 Graph Image Construction Methods

The dense graph mapping that utilizes each pixel as a node in a graph is used and mentioned by[4, 5]. For this paper, we employ the same terminology. Additionally, Zhang et al. presented a novel graph neural network architecture and examined its low-latency properties on the MSTAR dataset using the dense graph[6]. Our proposed method differs from dense graph methods, as we vectorize an image rather than using the entire grid as a graph.

Han et al.[7] form a graph from the image by splitting the image into patches, much like a transformer. A deep graph neural network learns on the patches similarly to a transformer but in a graph structure. Our structure does not form a graph where each patch is a node in a graph; instead, we create a graph from the resultant of a vectorized image passed through a fully connected layer.

Mondal et al. proposed applying graph neural networks on a minibatch of images[8]. Mondal et al. claim that this method improves performance for adversarial attacks. We use the proposed method to stabilize the performance of a highly shallow model. The graph neural network, in this case, allows learning to be conducted in a graph form, connecting images containing similar qualities.

Besides the model proposed by Zhang et al., all the methods mentioned focus on the RGB setting. This is overkill for grayscale image classification. Focusing on a single channel allows us to develop a more streamlined solution rather than forcing a model to operate on RGB datasets and having the grayscale setting come as an afterthought. Doing so allows us to reduce computational costs.

4 Overview and Architecture

This section describes our model architecture (GECCO: Grayscale Efficient Channelwise Classification Operative). The overall process is summarized in Figure1.

A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (1)

Overall Architecture. Many existing methods do not focus on the latency of their design and its implications. Additionally, the vast majority of image classification models focus on the performance of their work in the RGB setting, rarely citing the performance of datasets in various domains. We address these problems by presenting a novel architecture focused on low latency and the grayscale image setting.

Our model vectorizes a batch of images, allowing us to use a fully connected layer (FC) pixel-wise for low computation time rather than relying on convolutional neural networks. We vectorize the input images and input them into a fully connected layer. Then, we use a graph convolutional layer to learn similarities between images batch-wise. We then apply a batch-wise attention term, which is inputted into an FC for classification.¹¹1We make our code publicly available at https://github.com/GECCOProject/GECCO

Model Structure Discussion

We justify our model’s design choices by considering the following theoretical aspects.

1.
The batch-wise attention term allows the model to further capture similar features from each image to another batch-wise. Relating similar properties from images to each other boosts accuracy in our caseof a shallow model. Additionally, our batch-wise attention term is similar in spirit to the mechanism proposed by[9], which allows the model to capture long-range dependencies across the entire image.
2.
The batch-size hyperparameter is crucial in our model.A larger batch size allows the model to capture more dependencies across images, which is crucial for understanding complex image patterns.We refer to the work of[10] for a detailed analysis of the impact of batch size on the performance of GNNs.
3.
If the batch size for a given dataset is $1$ , the model eliminates the graph construction phase, making the term $\mathbf{X}_{3}$ fed directly into the FC and softmax for classification.
4.
The residual connection term makes the learning process easier and more stable.we refer to[11] for a more detailed analysis of the impact of residual connections on shallow models.

5 Experiments

5.1 Datasets

Datasets from several domains are examined to gauge the effectiveness of GECCO in diverse settings. We use the SAR ATR dataset, MSTAR, and a medical imaging dataset CXR[12].

•
MSTAR is a SAR ATR dataset with a training size of $2747$ and testing size of $2425$ SAR images of $10$ different vehicle categories. We resize each image in the dataset to $(128,128)$ pixels.
•
CXR is a chest X-ray dataset containing 5863 X-ray images and 2 categories (Pneumonia/Normal). The images are $(224,224)$ pixels. The training size is $5216$ , and the testing size is $624$ .

Our goal is to create a real-time system. That is, we wish to minimize the inference latency and maximize the throughput of our model while maintaining leading or competitive accuracy on its respective dataset. In the following sections, we measure the inference latency and throughput, as described in section2.

5.2 Results

5.2.1 Backbone

For Table1, we choose ResConv as the backbone of our model because it has the most desirable characteristics for applying a graph convolutional layer.

Convolutional Layer	Top-1 Accuracy	Throughput ( $\text{imgs}/\text{ms}$ )	Latency (ms)
GCN[13]	$98.89\%$	$50.04\pm 6.85$	$5.86\pm 0.98$
TAGConv[14]	$99.05\%$	$47.87\pm 7.11$	$6.24\pm 1.32$
SAGEConv[15]	$99.08\%$	$51.77\pm 8.19$	$5.95\pm 0.87$
ChebConv[16]	$98.56\%$	$45.37\pm 5.99$	$6.83\pm 1.27$
ResConv[17]	$\bf{99.29\%}$	$\bf{52.98\pm 9.04}$	$\bf{5.22\pm 1.03}$

We use the following hyperparameters listed in Table 2 for our experiments.

Dataset	Feature Length	Optimizer	Batch Size
MSTAR	$86$	Adam	$64$
CXR	$112$	Adam	$64$

5.2.2 Experimental Performance

Experimental performance includes the top-1 accuracy, inference throughput, and inference latency. We perform our inference batch-wise as a means to reduce latency. These metrics vary across each dataset.

We summarize our findings in Tables3 and 4. We report the best-performing accuracy, average throughputs, and latencies with their standard deviations. Our model outperforms every other model in terms of throughput and latency across all datasets, leads accuracy on the MSTAR dataset, and performs competitively in terms of accuracy on all datasets.

We perform the remaining experiments on a state-of-the-art NVIDIA RTX A5000 GPU. Additionally, we compare our model to the top-performing variants of VGG[18], the variant of the popular ViT[19], the ViT for small-sized datasets (SS-ViT)[20], FastViT[21], Swin Transformer[22], and ResNet[23] models. We use the open-source packages PyTorch and HuggingFace for model building and the PyTorch Op-Counter for operation counting. Performing the remaining experiments on the same hardware system is vital in fostering a fair comparison for each model.

Model	Top-1 Accuracy	Throughput ( $\text{imgs}/\text{ms}$ )	Latency (ms)
Swin-T	$86.04\%$	$1.36\pm 0.10$	$46.98\pm 3.20$
SS-ViT	$95.61\%$	$2.29\pm 0.43$	$27.97\pm 5.26$
VGG16	$93.13\%$	$1.69\pm 0.33$	$37.89\pm 7.52$
FastViT	$91.78\%$	$1.04\pm 0.13$	$61.44\pm 7.69$
ResNet34	$98.64\%$	$3.13\pm 0.22$	$20.48\pm 1.39$
GECCO	$\bf{99.29\%}$	$\bf{12.26\pm 2.42}$	$\bf{5.22\pm 1.03}$

Model	Top-1 Accuracy	Throughput ( $\text{imgs}/\text{ms}$ )	Latency (ms)
Swin-T	$73.66\%$	$0.27\pm 0.05$	$236.71\pm 46.09$
SS-ViT	$71.09\%$	$1.03\pm 0.21$	$62.35\pm 12.85$
VGG16	$\bf{82.01\%}$	$0.76\pm 0.25$	$84.10\pm 28.43$
FastViT	$75.46\%$	$1.06\pm 0.14$	$60.30\pm 14.24$
ResNet34	$78.31\%$	$0.60\pm 0.11$	$105.84\pm 19.39$
GECCO	$77.57\%$	$\bf{2.63\pm 0.55}$	$\bf{24.32\pm 5.08}$

5.2.3 Model Complexity Metrics

Model complexity metrics for this paper include the number of multiply-accumulate operations (MACs), the number of model parameters, the model size, and the number of layers. In other words, suppose accumulator $a$ counts an operation of arbitrary $b,c\in\mathbb{R}$ . We count the number of multiply-accumulate operations as $a\leftarrow a+(b\times c)$ . Additionally, the layer count metric is an essential factor of latency. Decreasing the number of layers will also improve the latency of a model’s inference time. The goal of an effective machine learning model is to maximize throughput while minimizing the number of MACs and the number of layers, in our case.

We measure the model complexity of our model against other popular machine learning models that we have chosen in Table5. Our model outperforms in all categories regarding our chosen model complexity metrics, highlighting its lightweightness.

Model	# MACs	# Parameters	Model Size (Mb)	# Layers
Swin-T	$2.12\times 10^{10}$	$2.75\times 10^{7}$	$109.9$	$167$
SS-ViT	$1.55\times 10^{10}$	$4.85\times 10^{6}$	$19.62$	$79$
VGG16	$9.51\times 10^{9}$	$4.69\times 10^{6}$	$18.75$	$20$
FastViT	$7.16\times 10^{8}$	$4.02\times 10^{6}$	$16.1$	$226$
ResNet34	$4.47\times 10^{9}$	$2.13\times 10^{7}$	$85.1$	$92$
GECCO	$\bf{5.10\times 10^{4}}$	$\bf{5.08\times 10^{4}}$	$\bf{0.19}$	$\bf{16}$

5.2.4 Ablation Study

We perform an ablation study to verify that the components of our proposed model contribute positively to the overall accuracy of the MSTAR dataset.

Mini-batch GNN	Weighted Sum Residual Term	Accuracy on MSTAR
✓	✓	$99.29\%$
✗	✓	$97.94\%$
✓	✗	$88.04\%$
✗	✗	$78.64\%$

Additionally, we find that only a single graph convolutional layer is enough to reduce the variance and increase the accuracy of our model. Refer to Figure2.

A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (2)

5.3 Discussion

Across multiple datasets, GECCO achieves leading or competitive accuracy compared to other state-of-the-art image classifiers. GECCO outperforms other machine learning models regarding model complexity, highlighting our model’s low latency and lightweight properties.

It is difficult for our model to generalize to the RGB setting. We attribute this challenge to the vectorization process that our model uses. Learning on three channels poses a complexity challenge, as GECCO is very shallow and simple, thus making it challenging to learn on three separate channels. Additionally, our model is optimized for a low-complexity dataset regime, as datasets like CIFAR and ImageNet are much too complex for our shallow model, as they pose a complexity challenge.

Our proposed method does not make use of positional embeddings or class tokens. GECCO can learn essential features using the weighted residual term. Additionally, we tested the addition of positional embeddings and class tokens and found no improvement in accuracy across various datasets. We note that the $\mathbf{X}_{5}$ attention-like term adds positional awareness to the model.

5.4 FPGA Implementation

We develop an accelerator for the proposed model on state-of-the-art FPGA, Xilinx Alveo U200[24], to further highlight the model’s efficiency and compatibility with hardware. It has 3 Super Logic Regions (SLRs), 4 DDR memory banks, 1182k Look-up tables, 6840 DSPs, 75.9 Mb of BRAM, and 270 Mb of URAM. The FPGA kernels are developed using the Xilinx High-level Synthesis (HLS) tool to expedite the design process.

Our FPGA design incorporates several novel features: (1) Portability of the design: We design a parameterized hardware template using HLS. It is portable to different FPGA platforms, including embedded and data-center FPGAs. We present our hardware mapping algorithm in Algorithm 1. (2) Resource sharing: The model is executed layer-by-layer. Each layer in the model is decomposed into basic kernel functions. The basic kernel functions, including matrix multiplication, elementwise activation, column-wise and row-wise summations, max pooling, and various other elementwise operations, are implemented separately and subsequently invoked within their corresponding layers. Due to the reuse of these fundamental kernel functions across multiple layers, FPGA resources are shared among the different layers, maximizing runtime hardware resource utilization. (3) Single-load strategy: We employ a one-time data load strategy to load the required data from DDR only once. All other data required for the computations are stored in on-chip memory, reducing inference latency. Figure3 illustrates the overall hardware architecture of our design.

We utilize the Vitis tool[25] for hardware synthesis and place-and-route to determine the achieved frequency. The Vitis Analyzer tool is then used to report resource utilization and the number of clock cycles. The latency is calculated by multiplying the achieved clock period by the number of cycles. Table8 reports the results obtained for the MSTAR dataset. Given the compact design and resource efficiency of the model, it can be accommodated within a single SLR. Hence, we deploy multiple accelerator instances across multiple SLRs, each with one instance. This increases the inference throughput. Table8 shows the latency obtained for a single inference and the throughput achieved by running the design on 3 SLRs concurrently.

GPUOur DesignPlatformNVIDIA A5000Alveo U200TechnologySamsung 8 nmTSMC 16 nmFrequency1.17 GHz200 MHzPeak Performance (TFLOPS)27.70.66On-chip Memory6 MB35 MBMemory Bandwidth768 GB/s77 GB/sLatency on MSTAR (ms)5.225.65Throughput on MSTAR (imgs/ms)12.2633.98

1:Model $f()$ and the input images;

2:Execution result;

3:foreach layer $i$ in $f()$ do

4:if layer $i$ is a fully connected layer then

5:Map to the matrix multiplication unit

6:iflayer $i$ is a graph convolution layerthen

7:Map to the matrix multiplication unit

8:Map to the activation unit

9:Map to the elementwise operation unit

10:Map to the matrix addition unit

11:iflayer $i$ is a batch-wise attentionthen

12:Map to the matrix multiplication unit

13:Map to the activation unit

14:Map to the elementwise operation unit

15:iflayer $i$ is a max pooling layerthen

16:Map to the max pooling unit

17:iflayer $i$ is a activation layerthen

18:Map to the activation unit

19:iflayer $i$ is a batch normalization layer then

20:Map to the batch normalization unit

We compare our FPGA implementation with the baseline GPU implementation. The GPU baseline is executed on an NVIDIA RTX A5000 GPU, which operates at 1170 MHz and has a memory bandwidth of 768 GB/s. However, the FPGA operates at 200 MHz and has an external memory bandwidth of 77 GB/s. We compare the hardware features of the two platforms in Table7. Although the GPU has higher peak performance ( $41\times$ ) and memory bandwidth ( $10\times$ ), our FPGA implementation achieves a comparable latency of $5.65~{}ms$ and an improved throughput of $33.98~{}imgs/ms$ .

A Single Graph Convolution is All You Need: Efficient Grayscale Image Classification (3)

Latency	$5.65$ ms
Throughput	$33.98$ imgs/ms
BRAMs	$956~{}(22\%)$
URAMs	$228~{}(24\%)$
DSPs	$1226~{}(17\%)$
LUTs	$459K~{}(38\%)$
FFs	$597K~{}(25\%)$

6 Conclusion and Future Work

This work introduced a novel architecture combining fully connected and graph convolutional layers, benchmarked on popular grayscale image datasets. The model demonstrated strong performance and low complexity, highlighting the importance of lightweight, low-latency image classifiers for various applications. Its efficacy was shown across SAR ATR and medical image classification, with an FPGA implementation underscoring its hardware friendliness. Key innovations include using a single-layer GCN, which, along with batch-wise attention, enhances accuracy and reduces variance. Future work should explore extending this approach to color image datasets and other domains, optimizing the architecture for even greater efficiency, and further investigating the potential of graph neural networks in shallow models.

7 Acknowledgement

This work is supported by the DEVCOM Army Research Lab (ARL)under grant W911NF2220159. Distribution Statement A: Approved for public release. Distribution is unlimited.

References

[1]Kaoru Ota, MinhSon Dao, Vasileios Mezaris, and Francesco G. B.De Natale,“Deep learning for mobile multimedia: A survey,”ACM Trans. Multimedia Comput. Commun. Appl., vol. 13, no. 3s,jun 2017.
[2]Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, AlaaeldinEl-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve,Jakob Verbeek, and Hervé Jégou,“Resmlp: Feedforward networks for image classification withdata-efficient training,” 2021.
[3]Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai,Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, JakobUszkoreit, Mario Lucic, and Alexey Dosovitskiy,“Mlp-mixer: An all-mlp architecture for vision,” 2021.
[4]Benjamin Sanchez-Lengeling, Emily Reif, Adam Pearce, and AlexanderB.Wiltschko,“A gentle introduction to graph neural networks,”Distill, 2021,https://distill.pub/2021/gnn-intro.
[5]Naman Goyal and David Steiner,“Graph neural networks for image classification and reinforcementlearning using graph representations,” 2022.
[6]Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, and Carl Busart,“Accurate, low-latency, efficient sar automatic target recognitionon fpga,”in 2022 32nd International Conference on Field-ProgrammableLogic and Applications (FPL). Aug. 2022, IEEE.
[7]Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu,“Vision gnn: An image is worth graph of nodes,”in NeurIPS, 2022.
[8]ArnabKumar Mondal, Vineet Jain, and Kaleem Siddiqi,“Mini-batch graphs for robust image classification,” 2021.
[9]Qishang Cheng, Hongliang Li, Qingbo Wu, and KingNgi Ngan,“Ba2m: A batch aware attention module for image classification,”2021.
[10]Yaochen Hu, Amit Levi, Ishaan Kumar, Yingxue Zhang, and Mark Coates,“On batch-size selection for stochastic training for graph neuralnetworks,” 2021.
[11]Shuzhi Yu and Carlo Tomasi,“Identity connections in residual nets improve noise stability,”2019.
[12]Daniel Kermany,“Labeled optical coherence tomography (oct) and chest x-ray imagesfor classification,” 2018.
[13]ThomasN. Kipf and Max Welling,“Semi-supervised classification with graph convolutional networks,”2017.
[14]Jian Du, Shanghang Zhang, Guanhang Wu, Jose M.F. Moura, and Soummya Kar,“Topology adaptive graph convolutional networks,” 2018.
[15]WilliamL. Hamilton, Rex Ying, and Jure Leskovec,“Inductive representation learning on large graphs,” 2018.
[16]Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst,“Convolutional neural networks on graphs with fast localizedspectral filtering,” 2017.
[17]Xavier Bresson and Thomas Laurent,“Residual gated graph convnets,” 2018.
[18]Karen Simonyan and Andrew Zisserman,“Very deep convolutional networks for large-scale imagerecognition,” 2015.
[19]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby,“An image is worth 16x16 words: Transformers for image recognitionat scale,” 2021.
[20]SeungHoon Lee, Seunghyun Lee, and ByungCheol Song,“Vision transformer for small-size datasets,”CoRR, vol. abs/2112.13492, 2021.
[21]Pavan KumarAnasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and AnuragRanjan,“Fastvit: A fast hybrid vision transformer using structuralreparameterization,” 2023.
[22]ZeLiu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, andBaining Guo,“Swin transformer: Hierarchical vision transformer using shiftedwindows,” 2021.
[23]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,”in 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016, pp. 770–778.
[24]Xilinx,“Xilinx alveo u200 board,”https://docs.xilinx.com/r/en-US/ds962-u200-u250/FPGA-Resource-Information.
[25]“Vitis HLS,”https://www.xilinx.com/products/design-tools/vitis/vitis-hls.html.