Feature-enhanced fusion of U-NET-based improved brain tumor images segmentation

The field of medical image segmentation, particularly in the context of brain tumor delineation, plays an instrumental role in aiding healthcare professionals with diagnosis and accurate lesion quantification. Recently, Convolutional Neural Networks (CNNs) have demonstrated substantial efficacy in a range of computer vision tasks. However, a notable limitation of CNNs lies in their inadequate capability to encapsulate global and distal semantic information effectively. In contrast, the advent of Transformers, which has established their prowess in natural language processing and computer vision, offers a promising alternative. This is primarily attributed to their self-attention mechanisms that facilitate comprehensive modeling of global information. This research delineates an innovative methodology to augment brain tumor segmentation by synergizing UNET architecture with Transformer technology (denoted as UT), and integrating advanced feature enhancement (FE) techniques, specifically Modified Histogram Equalization (MHE), Contrast Limited Adaptive Histogram Equalization (CLAHE), and Modified Bi-histogram Equalization Based on Optimization (MBOBHE). This integration fosters the development of highly efficient image segmentation algorithms, namely FE1-UT, FE2-UT, and FE3-UT. The methodology is predicated on three pivotal components. Initially, the study underscores the criticality of feature enhancement in the image preprocessing phase. Herein, techniques such as MHE, CLAHE, and MBOBHE are employed to substantially ameliorate the visibility of salient details within the medical images. Subsequently, the UT model is meticulously engineered to refine segmentation outcomes through a customized configuration within the UNET framework. The integration of Transformers within this model is instrumental in imparting contextual comprehension and capturing long-range data dependencies, culminating in more precise and context-sensitive segmentation. Empirical evaluation of the model on two extensively acknowledged public datasets yielded accuracy rates exceeding 99%.


Introduction
Brain tumors, notably characterized by their uncontrolled growth within the brain, represent a significant health challenge due to their complex origins and the severe impact they have on patients' lives and well-being.Among these, gliomas are particularly significant, constituting about 35% of all brain tumors [1].They originate from glial cells and are known for their invasive nature, ranging from low-grade benign forms to highly malignant types like glioblastoma.The early detection of these tumors is critically important because of their high malignancy and the typically short survival time for affected patients, underscoring the urgent need for effective diagnostic procedures [2].
In the realm of modern diagnostics, several methods are employed, including ultrasound imaging, CT scans, X-ray, and notably Magnetic Resonance Imaging (MRI).MRI stands out for its non-invasive nature, the detailed insights it offers without exposing patients to harmful ionizing radiation, and its exceptional ability to differentiate soft tissues, such as tumors [3].The varied imaging sequences available with MRI enable physicians to gain a comprehensive understanding of the tumor's characteristics, making it an indispensable tool in the diagnosis of brain tumors.The significance of medical imaging in modern medical diagnostics cannot be overstated, as these images are crucial in visualizing the internal structures of the human body [4,5].Medical image processing, which encompasses detection, segmentation, registration, and fusion, is essential in this context.Currently, the focus of medical image segmentation is on images of various human organs, tissues, and cells, segmenting them into regions based on similarities or differences.Enhancing these techniques, especially in MRI, is vital in advancing our ability to accurately diagnose and effectively treat brain tumors, ultimately improving patient outcomes [6].
Over the past few years, the field of medical image segmentation has witnessed a continuous stream of research endeavors, leading to the development and proposition of numerous techniques and methods.These approaches encompass a diverse array, encompassing thresholdbased segmentation, region-based segmentation, and edge detection-based segmentation methods.Notably, traditional machine learning techniques, including decision trees, random forests, and clustering algorithms, have demonstrated their effectiveness in achieving precise image segmentation.Nevertheless, these methods are inherently reliant on feature engineering, and their performance is inherently constrained by the limited expressiveness of the features they extract [7,8].
In recent years, deep learning methods, especially those based on convolutional neural networks (CNNs), have demonstrated strong feature recognition capabilities.They have generally outperformed traditional machine learning methods in areas like medical image segmentation.Consequently, deep learning-based medical image segmentation methods have garnered increasing attention and application [9,10].In this domain, medical image segmentation is a cornerstone, facilitating the differentiation of distinct regions within images, including discerning between healthy tissues and anomalies [11].Deep learning techniques, such as the Fully Convolutional Network (FCN) [12], Deep lab [13], and notably the UNET architecture [14], have been pivotal in enhancing the precision of this vital task.Yet, while UNET has found tremendous success, it isn't devoid of limitations, primarily its rigidity in adapting to datasets of different sizes and potential inefficiencies in leveraging skip connections.
UNET, a powerful tool in medical image segmentation, has notable limitations that must be considered when applying it in healthcare settings.One significant drawback is its substantial data appetite.UNETs require large and diverse datasets for training, which can be difficult to obtain, particularly for rare conditions.Moreover, deep learning models like UNET are susceptible to overfitting, especially when the training dataset is limited.This means that while they may perform exceptionally well on training data, their ability to generalize to new, unseen cases can be compromised [15].The computational demands of UNETs can also be a hindrance, as they necessitate robust hardware resources, both for training and inference, making them less accessible to smaller healthcare facilities.Another critical limitation is the model's interpretability, or rather, the lack thereof.UNETs are often regarded as "black boxes," making it challenging to explain how they arrive at their decisions, a crucial concern in healthcare where transparent decision-making is imperative.Additionally, UNETs may struggle with precise boundary delineation, potentially producing slightly irregular object boundaries.Variations in image quality, acquisition devices, and protocols can also pose challenges for the model's robustness.Addressing these limitations is paramount for the successful integration of UNETs into clinical practice.The major contributions of this study in the field of medical image segmentation are as follows: The remaining sections of the paper are structured as follows: • Section II provides an in-depth comparison of our novel methods with existing approaches.• In Section III, we offer a concise overview of the structure of our innovative techniques.• Section IV is dedicated to discussing the experimental results, including comprehensive discussions and comparisons with established methodologies.• Concluding the paper, we present our final remarks and conclusions in Section VI.

Related work
This section is structured into two main categories: segmentation methods based on Convolutional Neural Network (CNN)-UNET approaches and segmentation methods using Transformer-based techniques.This division allows for a more in-depth exploration of the specific techniques and approaches within these two prominent branches of medical image segmentation.

Image segmentation with CNN and UNET
The journey continued with the groundbreaking concept of Convolutional Neural Networks (CNNs) introduced by LeCun et al. [16], and his collaborators.Their work achieved remarkable success in recognizing handwritten digits, notably with the construction of the LeNet-5 network.As computing power continued to advance, CNNs garnered widespread attention from researchers, gaining prominence in various domains.CNNs found their application in image segmentation, excelling not only in segmentation tasks but also in related areas such as image classification and object detection [17].They have emerged as one of the most influential algorithms in the realm of deep learning.In the domain of medical image segmentation, CNN-based research predominantly falls into two categories: Image Block Classification: In this approach, the task of image segmentation is transformed into the classification of local image blocks, where each pixel's location within the image plays a crucial role.For instance, researchers like Arkapravo Chattopadhyay and Mausumi Maitra have devised CNN-based models for brain tumor segmentation [18].These models make extensive use of both local and global image features, enhancing their segmentation capabilities.The incorporation of fully connected layers at the end of the model significantly accelerates network training.
Semantic Segmentation based on Fully Convolutional Networks (FCN): This approach predicts the class to which each pixel within an input image belongs, enabling pixel-level semantic segmentation.Notably, Long and his team introduced the concept of Fully Convolutional Networks (FCN), capable of pixel-wise classification through forward propagation.This technology transforms image input into image output, enabling end-to-end segmentation [19].FCN-based semantic segmentation has attracted substantial research efforts, with novel techniques emerging to facilitate hierarchical feature learning, classification optimization, and the creation of dense predictions for entire images.
Furthermore, advanced 3D networks, inspired by U-net-like topologies, have been introduced to extract contextual information from adjacent slices within 3D volumes used extensively in clinical practice [20].Notable examples include 3D U-net and V-net, which leverage context from neighboring slices to enhance segmentation accuracy.In recent years, FCN-based semantic segmentation has dominated the landscape of medical image analysis.A significant proportion of international competitions, approximately 70%, focus on this particular area.Consequently, this chapter will delve into the exploration of fully convolutional neural networks, with a primary focus on the research status of the U-Net model in the domain of medical image segmentation.

Image segmentation with transformers
Although Convolutional Neural Networks (CNNs) have been around for many years, it wasn't until the introduction of AlexNet that CNNs became the mainstream deep learning model in the field of computer vision.Since then, deeper and more effective deep network models have gradually been proposed, such as ResNet, GoogleNet, DenseNet, and others [21].In addition to exploring network architectures, these studies also included improvements to CNN itself, such as the introduction of dilated convolutions and depth-wise separable convolutions.One of the primary advantages of CNNs compared to traditional machine learning methods is that CNNs extract richer and more expressive features, eliminating the need for manual feature engineering.In different application scenarios, selecting suitable handcrafted features can be challenging, while CNNs do not require manual feature selection and can perform end-to-end feature extraction.
However, one limitation of CNNs is their local operation, meaning that they have limited receptive fields [22].To address the problem of limited convolutional receptive fields, commonly used operations like dilated convolutions effectively increase the receptive field without reducing resolution.Dilated convolution, uses convolution kernels with different dilation rates to extract features at different scales, to some extent alleviating the limitations of standard convolution operations.Feature pyramidal pooling, on the other hand, uses different sizes of pooling combinations to obtain multiscale feature information, enhancing classification accuracy.In the field of medical image segmentation, despite the success of models based on Convolutional Neural Networks (CNNs) like U-Net, there are still limitations in terms of segmentation accuracy and granularity due to the complexity of medical images, difficulty in data labeling, and limited annotated data.
Researchers have proposed various variations of the U-Net model to address these limitations.For example, U-Net + + introduced mesh-like connectivity by using denser skip connections to link different stages of features [23].R2U-Net [24] ensured segmentation continuity by introducing recurrent convolution modules and Long Short-Term Memory (LSTM) networks.SA-U-Net incorporated spatial attention modules to suppress irrelevant areas of feature maps, enhancing classifier discriminative accuracy, and used Dropout layers to mitigate overfitting [25].However, these variations are primarily focused on improving convolutional models and do not fundamentally address the lack of global information in convolutional features.These improved variant networks still struggle to handle longrange semantic interactions in CNNs.
The Transformer was initially introduced in natural language processing research and was first applied to computer vision tasks, such as ImageNet image classification [26], through models like ViT (Vision Transformer) [27], achieving unprecedented success.Transformers divide images into fixed-size image patches, project them to a specified dimension through linear projection, and represent them as token sequences, offering a novel segmentation approach.Transformers model global information without downsampling, allowing for global information modeling while maintaining image resolution [28].This approach is a fresh approach to semantic segmentation.Without relying on operations like dilated convolutions and Feature Pyramid Networks (FPN) used in convolutional methods, the Transformer expands receptive fields and obtains feature responses from a global perspective.
Transformers, based on multi-layer self-attention and multi-layer perceptrons, achieved significant success in natural language processing [29].ViT was the first successful application of Transformers in computer vision, outperforming many advanced models in image recognition tasks.However, ViT is more suitable for large datasets.Touvron et al. [30], and others [31] improved ViT's performance on small datasets through various training strategies.The Swin-UNET model, utilizing a pure Transformer U-shaped network architecture, achieved excellent results in liver image segmentation [32].Due to the high computational complexity of core self-attention computations in Transformers, Swin Transformer introduced the concept of sliding windows, reducing parameter counts for calculating self-attention within each window and enabling communication between non-adjacent patches.
While Transformer structures may perform relatively poorly on medical image datasets with limited data, some researchers have made progress in applying Transformers to image processing with promising results.The SETR model proposed using Transformers exclusively for semantic segmentation and introduced context information dependencies at every stage, removing the previous limitations of relying on dilated convolutions and attention mechanisms to increase receptive fields [33].TransUNET was the first model to combine Transformer and CNN in a U-shaped lightweight network for abdominal organ segmentation.It used conventional CNNs to extract low-level information, serialized feature maps in the last stage of the Encoder using patches to obtain tokens, and then obtained global information through Transformers [34].The TransFuse model employed a dual-branch structure with Swin-Transformer and CNN for feature encoding, capturing both local information and global dependencies.It introduced the Bifusion module to fuse multiscale features, achieving state-of-the-art results in Polyp dataset segmentation.DS-TransUNET [35] used two different patch sizes for partitioning and introduced a dual-branch Swin Transformer to extract different scale feature representations.It proposed the TIF fusion strategy to combine the results of two different scales.In the Decoder stage, it also introduced Swin-Transformer to establish global dependencies during upsampling.The Medical Transformer model used Gated axial attention and decomposed global spatial attention into two axial directions [36], significantly reducing parameter counts.It also introduced the Local branch and Global branch to fuse global and local segmentation results [37].However, these methods have several limitations.For instance, while Transformers can establish global context dependencies, they may disrupt the shallow features of the convolutional network, which contain crucial local information for improving edge segmentation accuracy.Therefore, designing a more suitable fusion model that retains low-level information while establishing longterm dependencies is a key challenge to address.

Method
The initial phase of the feature-enhanced UNET-based Transformer (FE-UT) model involves enhancing image features through a series of preprocessing steps.These steps utilize Contrast-Limited Adaptive Histogram Equalization (CLAHE) [38], Modified Histogram Equalization (MHE) [39], and Modified Brightness and Contrast Enhancement (MBOBHE) [40] techniques.These image enhancement methods are applied to enhance the contrast and visibility of the input image, ensuring that it is well-prepared for subsequent analysis and processing.In Fig. 1, the comprehensive implementation strategy for all algorithms is visually presented, outlining the various steps involved in this process.

Image enhancement
The preprocessing stage incorporates the application of MHE, CLAHE, and MBOBHE to leverage image enhancement techniques aimed at augmenting the contrast and visibility of the input image prior to any subsequent analysis or processing.Each of these methods possesses unique characteristics and brings specific advantages to the enhancement process.All models of enhancements are described as follows: a) MBOBHE method MBOBHE operates with the explicit goal of simultaneously addressing three critical aspects of image enhancement: contrast enhancement, brightness preservation and detail preservation.
Hum et al. [40] have conducted extensive research to demonstrate the superior performance of MBOBHE in comparison to existing bi-Histogram Equalization methods.Both quantitative and qualitative results substantiate the effectiveness of MBOBHE, highlighting its ability to provide a holistic view of image enhancement.Notably, MBOBHE excels in striking the delicate balance between preserving image brightness, retaining intricate details, and enhancing contrast in the final enhanced images Figs. 2 and 3.After equalizing all the sub regions, combine them to reconstruct the final equalized image.
Multipeak Histogram Equalization can be particularly useful for enhancing the contrast in images where different objects or regions have varying illumination conditions or intensity characteristics.By equalizing each mode separately, it preserves the relative differences between modes while enhancing the contrast within each mode.
iii) Contrast-limited adaptive histogram equalization (CLAHE) (1) CDF_i(j) = sum (H(i) for all i from 0 to j) Total number of pixels in I_i (2) Contrast-Limited Adaptive Histogram Equalization (CLAHE) is a widely used technique in image processing to enhance the contrast of an image while limiting the amplification of noise in flat or low-contrast regions.CLAHE is particularly useful when dealing with images that have uneven lighting conditions or regions with varying contrasts.The basic idea behind CLAHE is to divide the image into small tiles or blocks and perform histogram equalization within each tile.However, to prevent excessive amplification of noise, CLAHE also limits the contrast enhancement for each tile by clipping the histogram.
Here's an explanation of CLAHE along with equations:

Divide the image into tiles
Divide the input image I into non-overlapping tiles or blocks.Let's denote these tiles as I(x, y), where (x, y) represents the coordinates of the top-left corner of each tile.

Calculate the histogram for each tile
For each tile I(x, y), compute the histogram H(x, y) that represents the distribution of pixel intensities within that tile.

Clip the histogram
Apply contrast limiting by clipping the histogram.This is done by setting a predefined threshold T. If any bin in the histogram exceeds this threshold, the excess pixels are redistributed to other bins.The formula for this clipping is as follows:

Calculate the Cumulative Distribution Function (CDF)
Compute the cumulative distribution function (CDF) for each clipped and normalized histogram:

Reconstruct the image
Combine the equalized tiles to form the final CLAHEenhanced image.
The key parameter in CLAHE is the contrast threshold (T).Adjusting this threshold will control the degree of contrast enhancement and noise amplification.A lower value of T results in stronger contrast enhancement but may increase noise, while a higher value of T reduces contrast enhancement but also limits noise amplification.
CLAHE is a powerful technique for enhancing local contrast in images and is commonly used in medical image processing and other applications where contrast is crucial.Its adaptability to local image content makes it a valuable tool for various image enhancement tasks.

Improved U-net segmentation
The U-Net architecture, renowned for its exceptional performance in medical science and bioinformatics image segmentation tasks, has garnered significant attention among researchers [41].Its name is derived from the network's structural shape, which bears a resemblance to the letter "U."This architecture encompasses two fundamental paths:

Self-attention-transformer
Recent advancements have seen the incorporation of Transformers, which excel in capturing long-range dependencies and contextual information [42].The Transformer blocks can be inserted at various points in the U-Net architecture to enhance feature extraction and segmentation performance.By attending to and aggregating information across the feature maps, Transformers contribute to a deeper understanding of image context, allowing for more context-aware and accurate segmentation.

Algorithm
This code implements a convolutional neural network (CNN) based on the U-Net architecture with an additional Transformer module for image segmentation tasks.Below is a detailed explanation of the code in points: • Input Shape and Layers Initialization (6) I_equalized(x, y, p) = round(CDF(x, y, I(x, y, p)) * (Number of intensity levels − 1)) The input shape of the images is defined as (240, 240, 4), indicating images with a resolution of 240x240 pixels and 4 channels.
The code initializes an input layer (inply) using the defined input shape.

• Encoder
Convolutional Layers: The input passes through a series of convolutional layers (conv1, conv2, conv3) with increasing filters (64, 128, 256) and 3x3 kernel size, followed by ReLU activation and same padding.This extracts essential features from the input image.
MaxPooling and Dropout: After each set of convolutional layers, max-pooling is applied to reduce spatial dimensions, and dropout is used for regularization to prevent overfitting.

• Transformer Module
Dropout: A dropout layer with a dropout rate of 0.1 is added to the output of the encoder (drop3).
Multi-Head Attention: The dropout output is fed into a Multi-Head Attention layer with 4 heads and a key dimension of 64.This layer captures complex patterns and long-range dependencies in the input features.

• Decoder
Convolutional Transpose Layers: The output from the Transformer module is passed through a series of transpose convolutional layers (tran1, tran2, tran3).These layers upsample the features to reconstruct the spatial dimensions of the image.
Concatenation: At each stage of the decoder, the upsampled features are concatenated with the corresponding features from the encoder to provide skip connections.This helps the network to retain detailed information from the encoder.
Convolutional Layers and Dropout: After concatenation, the features go through several additional convolutional layers (conv4, conv5, conv6) with ReLU activation and same padding.Dropout is applied after each convolutional layer for regularization.

• Output Layer
Convolutional Layer with Softmax Activation: The output from the decoder is passed through a 1 × 1 convolutional layer with 4 filters (for 4 segmentation classes) and same padding.Softmax activation is applied to obtain the final segmentation probabilities for each class.

Experimental setting and results
Results with experimental settings are described in this section.

Evaluation metrics
This study employed a range of evaluation metrics to assess the performance of the model and the equations for these evaluation methods are as follows: Where; FN (False Negative): This refers to cases where the model or classifier incorrectly predicted the negative class when the true class was actually positive.In other words, it's a situation where a positive instance is missed or not detected.
Balanced Accuracy(BA) = Sensitivity + Specificity 2 class was actually negative.In this situation, the model made a positive prediction when it should not have.

Dataset description
Datasets used in this study are BraTS dataset [43] and Medical Segmentation Decathlon (MSD) [44].The Medical Segmentation Decathlon (MSD) dataset is a comprehensive collection of medical images and corresponding segmentation masks, designed for evaluating and developing medical image segmentation algorithms.It covers various imaging modalities and anatomical regions, making it versatile for different medical tasks.Researchers use it to benchmark and compare the accuracy of segmentation algorithms, making it a valuable resource in medical image analysis research for applications like organ segmentation, disease diagnosis, and treatment planning.
The BRATS (BraTS) 2020 dataset is a widely recognized collection of medical images designed for the evaluation and development of algorithms related to brain tumor segmentation and diagnosis.It contains multi-modal magnetic resonance imaging (MRI) scans, including T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and FLAIR (Fluid Attenuated Inversion Recovery) images.The dataset provides annotations for brain tumor regions, including gliomas, making it invaluable for machine learning and deep learning research in the field of medical image analysis.Researchers and practitioners use the BRATS 2020 dataset to develop and benchmark segmentation and classification algorithms for brain tumor detection and analysis, contributing to advancements in neuro-oncology and medical imaging.

Experimental parameters setting
The hardware environment for this experiment includes a CPU with a clock speed of 3.40 GHz, an NVIDIA GTX 1070 GPU, and 16 GB of memory.This study models were constructed using TensorFlow and Keras as the backend framework.To optimize performance, the Adam optimizer, known for its robustness, was chosen.Furthermore, the preprocessing steps includes normalization to avoid complexities in preprocessing.For the dataset, we adopted a random split, allocating approximately 80% of the data to the training set and reserving the remaining 20% for the test set.Specific parameter configurations for all the algorithms employed in this research are provided in detail in Table 1.

Proposed methods segmentation performance evaluation
Our study three algorithms are compared by using different validation criteria.Balanced accuracy takes into account both the sensitivity and specificity of the segmentation results, making it a more reliable measure in situations where the class distribution is imbalanced, as often seen in medical imaging.This metric provides a balanced evaluation of how well the algorithm identifies both positive and negative regions within the image, ensuring that neither class is disproportionately favored in the assessment, which is crucial for accurate and fair evaluation of medical image segmentation models.For Dataset 1, FE1-UT is having higher balanced 98.64% which is higher then FE2-UT (98.53%) and FE3-UT(98.42%).Similarly, Precision is 98.19% for FE1-UT which is higher then FE2-UT (97.98%) and FE3-UT(97.85%).Higher precision of FE1-UT means that a greater proportion of the pixels or regions identified as belonging to a particular class (e.g., a specific anatomical structure or lesion) are indeed correct or true positives.Therefore, all algorithms higher precision shows that the segmentation algorithm produces fewer false positives and is better at correctly identifying the regions of interest in the medical image.This is particularly important in medical applications, where misclassifying or missing important structures can have serious clinical consequences.Another important metric is Recall, which is also 98.18% for FE1-UT while 97.90% for FE2-UT and 97.85% for FE3-UT.Higher recall means that the segmentation algorithm has correctly identified a greater proportion of the actual positive regions (e.g., important anatomical structures or abnormalities) within the image.Similar higher results are observed for FE1-UT in other dataset 2 as well in all metrics, which shows that CLAHE improvement has a better impact on image segmentation.Figures 4 and 5 shows the comparative performance of the proposed algorithms against different image segmentation metrics.CLAHE enhances local contrast, which helps in better delineation of subtle details and boundaries in medical images.U-Net provides excellent spatial feature extraction and segmentation capabilities.The Transformer

Traditional methods comparison with proposed methods
In medical image segmentation, like when using a U-Net architecture, sensitivity is an important evaluation metric because it measures the ability of the model to correctly identify positive instances (i.e., true positives) within the dataset.Sensitivity is also known as the true positive rate, recall, or hit rate, and it quantifies the model's ability to detect all instances of a particular class, typically the presence of a specific medical condition or object of interest within an image.To validate the results of our proposed method against different state of the arts (SOTA) methods we used to compare the sensitivity of our proposed method with other algorithms.Table 2 shows the results of different methods with best results are bold while second best results are underline.The loss function and accuracy graph with epoch is a vital visualization tool for monitoring the training progress and evaluating the performance of deep learning models in medical image segmentation.The loss function graph shows how the loss (e.g., Dice loss or cross-entropy loss) decreases as training progresses.A decreasing loss indicates that the model is converging and learning to produce better segmentations.In contrast, the accuracy graph measures the similarity between predicted and ground truth segmentations, which is crucial for assessing the model's performance.An increasing accuracy suggests that the model is improving in segmenting medical images accurately.These graphs help researchers and practitioners fine-tune models, detect overfitting or underfitting, and decide when to stop training, ensuring that the model achieves the desired level of segmentation accuracy for clinical applications.In Fig. 8, we present the average loss and accuracy graphs per epoch for both training and testing datasets.

Latest methods comparison with proposed models
We conducted a comparative analysis of our proposed models against recent studies in the field of MRI segmentation, including Zhang et al. [45], Nizamani et al. [46], and Huang et al. [47], which have demonstrated commendable performance in their recent research endeavors.Huang et al. 's model [47] is distinguished by its utilization of improved segmentation by using patchbased feature extraction.Nizamani et al. 's work [46] encompasses segmentation, clustering, and the application of CLAHE with UNET for feature extraction and tumor classification.
The results, as depicted in Tables 3 and 4 with training 70%, unveil that our proposed model exhibits superior performance when compared to another feature-based segmentation method.It's noteworthy that the other CLAHE does not perform optimally due to its limited efficiency in effectively segregating intricate datasets.Additionally, the studies by Huang et al. [48] and Aamir et al. [49] exhibit suboptimal performance, primarily due to their limited semantic understanding of complex data structures.

Ablation experiments
Additionally, we performed the ablation experiments by removing transformer and adding filters to UNET module directly and results are shown in Tables 5 and 6 that transformer addition with CLAHE is producing better results for FE1-UT method in both datasets and show the superiority of our method.

Discussion
The practical significance of our study underscores the promising advantages of harnessing sophisticated deep learning methods in the realm of medical image segmentation, with a particular focus on the analysis of brain tumor MRI scans.Nonetheless, it is imperative for researchers and healthcare professionals to remain cognizant of the study's limitations and proactively work towards mitigating them to ensure the secure and efficient integration of these techniques into clinical applications.

Practical applications
There are many practical implications of our proposed method: • Brain Tumor Detection and Segmentation: The primary focus of the study is enhancing the precision of brain tumor MRI image segmentation.This technology can be deployed in clinical settings to assist radiologists and oncologists in accurately delineating tumor boundaries, which is crucial for treatment planning and monitoring disease progression [50][51][52][53][54][55].• Tumor Volume Assessment: Accurate segmentation of tumors allows for precise measurement of tumor volumes over time.This is essential for tracking treatment response, assessing disease progression, and adjusting treatment strategies accordingly.

Conclusion
In conclusion, the precision of medical image segmentation is undeniably crucial in the modern healthcare landscape, significantly impacting diagnosis and treatment planning.Recent strides in deep learning have ushered in a new era by harnessing the capabilities of UNETs and Transformers to automate labor-intensive manual segmentation processes.However, despite these advancements, challenges persist, particularly when dealing with intricate anatomical structures and indistinct features, which can compromise accuracy.
Our study presents an innovative and effective solution to elevate the precision of brain tumor MRI image segmentation.We achieve this by seamlessly integrating UNET architecture with Transformers and incorporating feature improvement techniques, specifically MHE, CLAHE, and MBOBHE, to develop the high-performance image segmentation algorithms-FE1-UT, FE2-UT, and FE3-UT.
Our approach relies on three fundamental pillars.Firstly, we emphasize the significance of feature imrpovement during the image preprocessing stage.Through techniques like MHE, CLAHE, and MBOBHE, which employ contrast enhancement, we enhance the visibility of critical details within medical images.Secondly, our UT model is meticulously designed to enhance segmentation results through personalized layering within the UNET architecture.The incorporation of Transformers brings in contextual understanding and facilitates the capture of long-range dependencies in the data, thereby enabling more precise and context-aware segmentation.
The resulting model represents a comprehensive framework for achieving precise medical image segmentation, skillfully combining the power of UNETs, Transformers, and feature-enhanced filters.Our approach is not merely theoretical; it has been rigorously validated through experimental evaluations, which affirm its excellence in distinguishing complex brain tissues.In essence, our research makes a significant contribution to the ongoing transformation of healthcare practices.By pushing the boundaries of medical image segmentation and offering a highly accurate, automated solution for brain tumor MRI image segmentation, we are poised to enhance the quality of patient care, expedite diagnosis, and streamline treatment planning in the field of healthcare.The combination of deep learning, feature improvement, and advanced network architectures offers a promising path forward, potentially revolutionizing medical image analysis and improving patient outcomes.

Fig. 1
Fig. 1 Feature Enhanced Model for MRI segmentation using UNET and transformers

Fig. 2
Fig. 2 Layered architecture of U-Net in this study

Fig. 4 Fig. 5
Fig. 4 Performance of proposed algorithms in MSD dataset

Fig. 6 Fig. 7
Fig. 6 Comparison of proposed methods visual in database 1

Fig. 8
Fig. 8 Impact on performance with epochs

Table 1
Layer and parameter settings

Table 2
Sensitivity comparison of different algorithms with proposed methods • Radiotherapy Planning: Medical image segmentation plays a vital role in radiotherapy planning.The technology can help radiation oncologists identify tumor regions and healthy tissues, enabling them to create treatment plans that deliver radiation therapy precisely to the affected area while sparing surrounding healthy tissue.

Table 3
Proposed methods comparison with latest methods in Brats dataset

Table 4
Proposed methods comparison with latest methods in MSD dataset

Table 5
Ablation experiments for Brats dataset

Table 6
Ablation experiments for MSD dataset Overfit models may perform exceedingly well on the training data but generalize poorly to new, unseen cases.Regularization techniques and data augmentation are employed to mitigate this issue, but it remains a concern.• Imaging Variability: Medical images can exhibit substantial variability due to differences in acquisition equipment, protocols, and conditions.The model's ability to handle such variability may be limited, potentially leading to decreased accuracy in realworld clinical settings.• Clinical Validation: Although the model demonstrates high accuracy on publicly available datasets, its performance in a real clinical setting might differ due to variations in image quality, patient population, and clinical practices.Clinical validation and integration into healthcare systems are critical steps that must be addressed.• Ethical and Privacy Concerns: The use of deep learning models in healthcare raises ethical and privacy concerns related to patient data security and consent.Proper data handling and adherence to ethical guidelines are essential when implementing such systems.• Algorithm Bias: If the training data is not representative of the entire population, the model may exhibit bias, potentially leading to disparities in diagnosis and treatment recommendations.• Deployment Challenges: Integrating deep learning models into clinical workflows and ensuring their seamless operation can be challenging.Healthcare institutions may require significant infrastructure and expertise for deployment and maintenance.