Recognizing online video genres using ensemble deep convolutional learning for digital media service management

It’s evident that streaming services increasingly seek to automate the generation of film genres, a factor profoundly shaping a film’s structure and target audience. Integrating a hybrid convolutional network into service management emerges as a valuable technique for discerning various video formats. This innovative approach not only categorizes video content but also facilitates personalized recommendations, content filtering, and targeted advertising. Given the tendency of films to blend elements from multiple genres, there is a growing demand for a real-time video classification system integrated with social media networks. Leveraging deep learning, we introduce a novel architecture for identifying and categorizing video film genres. Our approach utilizes an ensemble gated recurrent unit (ensGRU) neural network, effectively analyzing motion, spatial information, and temporal relationships. Additionally,w we present a sophisticated deep neural network incorporating the recommended GRU for video genre classification. The adoption of a dual-model strategy allows the network to capture robust video representations, leading to exceptional performance in multi-class movie classification. Evaluations conducted on well-known datasets, such as the LMTD dataset, consistently demonstrate the high performance of the proposed GRU model. This integrated model effectively extracts and learns features related to motion, spatial location, and temporal dynamics. Furthermore, the effectiveness of the proposed technique is validated using an engine block assembly dataset. Following the implementation of the enhanced architecture, the movie genre categorization system exhibits substantial improvements on the LMTD dataset, outperforming advanced models while requiring less computing power. With an impressive F1 score of 0.9102 and an accuracy rate of 94.4%, the recommended model consistently delivers outstanding results. Comparative evaluations underscore the accuracy and effectiveness of our proposed model in accurately identifying and classifying video genres, effectively extracting contextual information from video descriptors. Additionally, by integrating edge processing capabilities, our system achieves optimal real-time video processing and analysis, further enhancing its performance and relevance in dynamic media environments.


Introduction
Visual media, such as images and videos, have become increasingly popular for sharing information due to their simplicity of creation and distribution [1][2][3].The complexity of video perception has increased with the incorporation of temporal elements into spatial video, challenging imagebased approaches.Creating a film from separate frames is straightforward, although the resulting quality is often subpar.Utilizing convolutional neural networks (CNNs) in computer vision has enhanced video categorization, which is essential for many computer vision tasks such as recommendation systems and video retrieval.Movie categorization is a challenging issue that requires more investigation, despite several studies conducted in this field [4].
Video-based movie classification has several applications such as genre-based video retrieval and filtering [5], automated labeling and annotation [6], and content recommendation [7,8].Early research focused on predicting film genres from input movies using domain-specific datasets as the primary aim [9][10][11].Methods sometimes involve adapting models from existing video classification tasks related to the specific challenge, such as action identification or theme recognition.Several studies have employed shot sample methods to reduce computational costs [12][13][14][15].The approaches included selecting sample images from whole films.These algorithms predict movie genres from posters or still frames, although they are constrained by dataset size and scalability.Researchers have explored methods to adjust models by utilizing large video datasets based on current information [7,9,11].Challenges exist when implementing video classification systems for predicting movie genres.Prior research has typically overlooked linguistic cues in videos that might provide genre information [9,10,[16][17][18].Movie transcripts can be used to produce accurate predictions in some cases [19].Sequences characterized by intense discourse and eerie music typically belong to the horror or thriller genre.In contrast, sequences utilizing upbeat language are usually found in the romantic or comedy genres.Furthermore, video classification frameworks often encrypt the entire movie to comprehend the content.Identifying and categorizing trailer genres illustrates video classification challenges.Trailers are sometimes classified by genre using phrases such as comedy or drama to indicate the themes of the films [20].The technique of critics and IMDB users manually categorizing films is still common today, with the final decision made by the database.To optimize multimodal techniques, people are typically provided with both textual materials and films [21][22][23].
However, text materials require more effort, and their availability cannot be guaranteed at all times.From a sample of 33,000 movie clips taken from YouTube, Condensed Movies [24] found that 50% of them were without subtitles.The genres in this field of computer vision provide unique challenges because of their wide scope and overlap.This distinguishes them from other fields such as object tracking and activity detection [25].Media outlets struggle to accurately represent genres.The only way to determine a genre is by seeing the entire film, not just parts of it.Films sometimes include aspects from many genres, making it challenging to classify them [26].
Personalized recommendations facilitate user engagement by tailoring material to individual preferences and needs.This promotes a deep sense of personalization and satisfaction, while also enhancing the likelihood that users will discover content they enjoy.Consequently, individuals are more inclined to engage in desired actions, such as reading, commenting, and sharing content.In addition, they are more inclined to increase their overall time spent on the website [27].When a video is categorized as comedy, the viewer could be shown other relevant comical videos based on their interests [28].This feature enhances customers' overall viewing experience by enabling them to explore and discover content according to their interests.Organizations can restrict access by using digital media service management, which ensures effective content filtering based on certain criteria.Managing digital media services significantly impacts targeted advertising efforts and enhances corporate efficiency and security.Organizations may significantly improve promotional activities by tailoring marketing campaigns to certain video genres using content filtering.Integrating digital media service management promotes tailored suggestions, resulting in a more immersive and customized user experience.We want to leverage the cross-correlation operator to develop a GRU neural network for movie genre classification that can effectively capture spatial and motion data and reveal their temporal links.A novel neural network resembling GRU is constructed utilizing convolution and cross-correlation methods.The distinctive GRU ensemble method acquires knowledge of temporal relationships and manages the extraction of spatial and motion data.Movie genres are classified using cross-correlation and convolution operators to extract motion and spatial data.We also explore spatial and temporal data upgrades to prevent overfitting.We conducted thorough testing on well-known video genre datasets to assess our proposed ensemble GRU's performance.
We collected an assembly dataset and utilized our framework GRU model to assess the effectiveness of our ensemble GRU architecture in engine manufacturing and upkeep.The dataset contains both commonalities and contrasts among classes, including a diverse spectrum of real-world challenges.The suggested approach demonstrates its ability to manage variations by replicating real production environments.Our ensemble-proposed GRU outperforms benchmark datasets, confirming our observed findings.Our work comprises three primary innovations.We introduce a novel ensemble GRU-like unit for video content analysis that simultaneously learns temporal connections and autonomously extracts spatial and motion information.We provide a deep learning structure for categorizing genres on the basis of our suggested GRU.The experimental results demonstrate that our developed GRU model is efficient and beneficial for classifying genres based on video frames.This study makes major contributions: 1) The incorporation of digital media service management is emphasized as a crucial element in improving efficiency and security in organizational environments.Furthermore, its expansion into targeted advertising through customized marketing campaigns in certain video genres is highlighted.This enhances the overall impact of promotional efforts by providing a more engaging and personalized user experience.2) Our model presents an innovative method for creating an ensemble GRU neural network.The main goal is to use the cross-correlation operator to extract spatial and motion information, focusing on understanding their connection over time for classifying movie genres.This GRU-like neural network utilizes convolution and cross-correlation techniques to process spatial and motion information extraction while learning temporal relationships.
3) The paper describes comprehensive tests carried out on well-known video genre datasets to evaluate the effectiveness of the proposed ensemble GRU.The study involves using the suggested GRU structure in engine production and maintenance utilizing a compiled assembly dataset.The dataset includes several real-world issues such as similarities across classes and differences within classes, demonstrating the model's capacity to handle changes in actual production settings.4) The achieved results of the proposed GRU ensemble on standard datasets demonstrate its superiority.Three main innovations in the work are highlighted: An innovative ensemble GRU-like unit has been introduced for video content analysis, along with an efficient deep learning architecture for genre classification using this GRU.The emphasis is on experimental results that highlight the functionality and effectiveness of the proposed GRU for genre classification from video frames.5) Another significant contribution to research proposed in our work involves the active utilization of edge computing and cloud computing technologies in the process of identifying movie genres and enhancing digital media management services, as explored in this article.Recognizing that the analysis and categorization of videos necessitate substantial and scalable computational processing, leveraging scalable cloud computing resources offers a distinct advantage.Additionally, through the integration of edge computing, data can be processed in close proximity to their sources of production, thereby reducing latency and enhancing efficiency in video analysis.Consequently, the amalgamation of edge computing for localized computational processing and cloud computing for central computing in this research has resulted in performance and accuracy improvements in movie genre identification and the enhancement of digital media management services.
The remaining portion of the article follows this structure."Related work" section offers a comprehensive exploration of the most recent and advanced techniques for classifying movie genres.A detailed elucidation of the proposed method is presented in "Proposed strategy" section.The dataset and experimental procedures are outlined in "Results" section.Serving as the conclusion to the article, "Conclusion" section delineates the subsequent steps and functions.

Related work
Automated movie genre categorization is a burgeoning field of interest, witnessing a surge in academic research.According to Rasheed et al. [29], video film genres can be determined by analyzing low-level statistics such as average shot time and color variation.Study [30] suggests the use of a neural network classifier for genre categorization with a single label, utilizing both visual and audio inputs.
Huang and Wang [31] employed a support vector machine (SVM) classifier for classifying visual and auditory input.Zhou et al. [32] proposed using picture descriptors like Gist, CENTRIST, and w-CENTRIST [33,34], along with a K-nearest neighbor classifier for genre prediction.A Convolutional Neural Network (Con-vNet) [35] can be utilized to generate image descriptions, enhancing genre prediction.
Multimodal approaches in video classification offer several advantages, capturing and analyzing audio, visual, and textual components to improve understanding and classification precision.Ogawa et al. [36] integrated multimodal learning, employing a bidirectional Long Short-Term Memory (LSTM) network for forward and backward dependencies in video data, with a classification strategy to distinguish favorite and non-favorite videos.
The incorporation of both spatial and temporal features simultaneously enhances video genre categorization.Alvarez et al. [37] emphasized fundamental movie elements, enhancing genre categorization outcomes.Ben et al. [38] utilized ResNet and SoundNet [39] to encode visual and auditory information, enhancing the assessment of temporal aspects with an LSTM network.
Yu et al. [40] proposed a bipartite model with attention, location, time, and sequence emphasis, utilizing a deep Convolutional Neural Network (CNN) for optimal movie frame results.Genre determination is achieved through a bi-LSTM attention model.Studies [41,42] presents a probabilistic method, considering the importance of each background scene within different video categories, and recommends measuring shot length for varying genres.
Yadav and Vishwakarma [43] developed a system for classifying movie trailers by genre using deep neural networks.They trained a deep neural network with a large dataset of labeled movie trailers, achieving high accuracy in genre categorization.Another study [40] employed convolutional neural networks to classify video game genres, predicting future game genres with labeled data and exploring methods like data augmentation and hyperparameter tuning for improved accuracy.
Studies [44,45] categorized cinema genres using a multimodal method, combining visual features from trailers, textual features from synopses, and aural features from soundtracks.This amalgamation enhanced genre classification accuracy and reliability.Behrouzi et al. [46] integrated auditory and visual components through multimodal data, employing a recurrent neural network (RNN) for temporal correlations and achieving over 90% success rate in classifying different film genres.
There are two primary methods for categorizing movie genres: one relies on static imagery like posters and frames, while the other utilizes dynamic video content such as trailers and snippets.Recent research has shown a shift towards modifying existing frameworks for genre categorization in movies, departing from traditional video classification challenges.This adaptation includes integrating approaches to action recognition [16,[47][48][49][50] and video summarization [51,52].Many frameworks encounter challenges due to the high computational cost associated with video analysis.Utilizing methods that consider all frames as input for films longer than a few hours becomes impractical [49,50].Despite suggestions of sparse sampling methods to enhance efficiency, analyzing hour-long films still requires substantial computing resources.Automated movie genre categorization has gained significant traction in recent years, leading to a surge in academic research.Various methods and techniques have been proposed and investigated to address this task.Let's delve deeper into the methodologies discussed in the related work: Rasheed et al. [29] propose analyzing low-level statistics such as average shot time and color variation to determine video film genres.This method offers simplicity in implementation and computational efficiency.However, it may suffer from lower accuracy due to its reliance on basic statistical features.Study [30] suggests employing neural network classifiers for genre categorization using both visual and audio inputs.Neural networks have the capability to learn complex patterns from data, potentially leading to higher accuracy in genre classification.However, they require large amounts of annotated data for training and can be computationally expensive.Huang and Wang [31] utilize SVM classifiers to classify visual and auditory input.SVMs are known for their effectiveness in handling high-dimensional data and can provide good classification performance.However, they may not perform well with large-scale datasets and require careful selection of kernel functions.Zhou et al. [32] propose using picture descriptors such as Gist, CEN-TRIST, and w-CENTRIST along with a K-nearest neighbor classifier for genre prediction.This approach leverages visual features extracted from images to classify genres.However, the performance heavily relies on the quality of the descriptors and the choice of the classification algorithm.
Multimodal methods, as highlighted by Ogawa et al. [36], integrate audio, visual, and textual components to improve genre classification precision.By considering multiple modalities, these approaches can capture richer information from the data, potentially leading to enhanced classification accuracy.However, integrating multiple modalities effectively can be challenging and may require sophisticated fusion techniques.Several studies, including those by Ben et al. [38] and Yu et al. [40], propose deep learning models such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks for genre classification.These models have demonstrated stateof-the-art performance in various tasks by automatically learning hierarchical representations from data.However, they often require large amounts of labeled data and substantial computational resources for training.
In response to the computational inefficiency observed in existing architectures and their inability to compete with models utilizing generalization techniques for enhanced decision-making processes, we propose a groundbreaking GRU-like neural network designed for the genre classification of a wide array of video films.This innovative network possesses the capability to internally and efficiently extract motion features through the strategic application of a cross-correlation operator.Additionally, to further enhance the real-time processing and analysis of video data, we integrate edge processing [53] capabilities into our proposed method.By leveraging edge processing at the network's periphery, we ensure swift and seamless processing, enabling more effective extraction of motion features and ultimately improving the accuracy of genre classification in real-time video analysis applications.

Proposed strategy
We present a comprehensive solution aimed at mitigating the prevalence of disturbing content in films.Harnessing the capabilities of deep learning, our innovative system has demonstrated significant efficacy in addressing the intricacies of video classification across diverse contexts.Our proposed methodology comprises three core elements: video preprocessing, deep feature extraction, and video representation and classification, as illustrated in Fig. 1.To ensure data integrity, rigorous preprocessing procedures are implemented to eliminate redundant or extraneous content from our dataset.Additionally, frames extracted from each video clip undergo standardization, ensuring uniform dimensions before input into convolutional blocks.These blocks, integrated into an ImageNet model, initialize the extraction of relevant features from each clip.Subsequently, the extracted features undergo meticulous analysis within the ensGRU architecture to derive efficient and representative video representations, seamlessly integrated into the decision-making process.Each stage of our methodology is comprehensively detailed in the following sections.
Moreover, we present a proposed framework architecture designed to streamline the process of video genre identification.In the initial stage, videos undergo preprocessing to eliminate redundant frames and convert the remaining frames into distinct entities.Following this, convolutional blocks and ensGRU software are employed to extract feature vectors from the frames.The ensGRU architecture is then utilized to represent the video after transforming all feature vectors.The current procedure involves the use of a fully connected layer to compute the probability of a video clip being classified under a specific movie genre.Subsequently, an output layer incorporates a decision-making component.

Preprocessing step
Processing video frames is an essential step in various video analysis tasks, including classification, object detection, and action recognition.Preprocessing involves several operations aimed at enhancing the quality of video frames and extracting relevant information for subsequent analysis.The preprocessing pipeline typically commences with frame extraction, where consecutive frames are sampled from the video sequence at a fixed frame rate or keyframe intervals.This ensures a consistent representation of the video content and streamlines subsequent analysis.Once frames are extracted, they undergo various enhancement techniques to refine their quality and minimize noise.These techniques may involve operations such as denoising, contrast adjustment, and sharpening, aimed at improving image clarity and detail.
Following enhancement, we apply color normalization and standardization techniques to ensure uniform color representation across frames.This helps alleviate variations in lighting conditions and camera settings, thus enhancing the reliability of subsequent analysis algorithms.Additionally, spatial resizing and cropping may be employed to standardize the dimensions of video frames, rendering them suitable for input into deep learning models or other analysis algorithms requiring fixed-size inputs.
In addition to spatial preprocessing, temporal operations are also employed to capture motion information between consecutive frames.Techniques such as optical flow estimation are utilized to compute motion vectors between frames, providing valuable temporal context for tasks such as action recognition.The preprocessing of video frames plays a pivotal role in preparing raw video data for subsequent analysis tasks.By improving frame quality, reducing noise, and standardizing representation, preprocessing facilitates more precise and robust analysis of video content across various applications.

Convolutional blocks and concatenation
The "Convolutional Block-Concatenation" algorithm is an advanced method used to extract spatial features from videos for automatic genre classification.In this algorithm, convolutional blocks are used to extract important features from each video frame, and these features are then combined to form a comprehensive representation of the video.Since videos have variable lengths and the number of frames may differ, this algorithm helps us effectively extract features from all frames and arrive at an overall representation of the video.If the input frames are represented as X (video frame) and the output is denoted as Y (extracted features), then: In a Residual block (A type of pre-trained convolutional networks), if x is the input, the output is calculated as follows: where, Conv denotes the convolution operation, ReLU represents the rectified linear unit (ReLU) activation function, and " + " symbol is the direct addition operator between the input and the initial output.After extracting spatial features from each video frame using convolutional blocks, the extracted features from all frames are concatenated together to form a unified representation of the video.This concatenation is typically performed using distinct connections, where the features of each frame are merged in parallel.In the concatenation part, the input consists of y 1 , y 2 , …, y n , which represent the extracted features from all frames.Furthermore, the output is denoted as Z, which signifies the overall representation of the video.Lastly, the concatenation operation is defined as follows: As outlined in Algorithm 1, the pseudocode outlines the process of convolution and concatenation, which involves extracting features from each frame of a video, and then concatenating those features to form the overall representation of the video.In addition to further analysis, the resulting video representation can be used to classify genres. (1)

Ensemble GRUs
We propose that the Ensemble GRU model represents an advanced approach in the realm of deep learning, specifically tailored for video genre classification.This model is adept at processing sequential data types such as time series and videos.Leveraging the architecture of recurrent neural networks, it effectively captures dynamic relationships inherent in sequential data.Moreover, it employs the Ensemble technique, amalgamating multiple distinct GRU models, thereby enhancing accuracy and overall model performance.Typically, a GRU network comprises one or more layers of GRU units, with each layer sequentially processing input information and generating output.GRU units are equipped with gates that regulate information flow and facilitate long-term communication learning.The proposed Ensemble GRU model harnesses multiple independent GRU networks, amalgamating their outputs to formulate a final prediction.This approach commonly augments model accuracy and overall performance by capitalizing on the diverse learning capabilities of each GRU network.Collectively, these networks contribute to a more varied and comprehensive prediction.In the context of movie genre classification, each GRU network may discern distinct temporal patterns within movies, encompassing the progression of events, the frequency of emotional shifts, or the dynamics of character interactions.By consolidating the outputs of these networks, the Ensemble GRU model furnishes a refined and comprehensive prediction of movie genres.Shao and Guo Journal of Cloud Computing (2024) 13:102

Spatial-temporal module
While deep learning-based models are generally preferred over hand-crafted feature-based models, there is still significant work required to effectively learn various aspects of complex video data.Video data typically encompasses two fundamental features: motion information and spatial information, both of which evolve over time.Therefore, to comprehensively understand video data, it is necessary to address three primary sources of information: spatial information, motion information, and their temporal dependencies.
Existing models address these challenges by designing separate streams to extract spatial information using convolutional layers and optical flow using recurrent neural networks to capture temporal dependencies.However, unlike these models which often utilize mathematical methods or deep neural networks for optical flow extraction, our proposed approach employs a GRU neural network to simultaneously capture motion information through a spatio-temporal module.In our proposed model for video genre classification, as depicted in Fig. 1, the GRU neural network is designed to concurrently extract both spatial and motion features.This integrated approach allows for a more comprehensive understanding of the video content, leveraging both spatial and temporal information simultaneously.

GRU architecture
Initially, video frames are individually fed into two identical CNNs with shared weights.These CNNs are responsible for extracting spatial features from the video frames.Subsequently, the outputs of these CNNs are aggregated to produce a unified output.This aggregated output is then passed to a layer of spatio-temporal Gated Recurrent Unit that we have developed.In the final stage, the features obtained from the ensGRU layer are fed into a classifier.
Our model demonstrates the capability to capture both spatial-temporal properties and temporal relationships through the use of the ensGRU layer.Figure 2 provides an insight into the internal structure of the ensGRU layer.The ensGRU updating mechanism, as proposed, enables the extraction of motion characteristics and spatial information from videos while considering the temporal dependencies between these features.Figure 2 highlights the differences between our proposed ensGRU and the original GRU.Notably, in the ensGRU, inputs and weights are represented as 2D arrays, and convolution is employed instead of traditional multiplication.
Additionally, our proposed ensGRU can compute the correlation between two consecutive video frames by incorporating the video frame from the previous time step, t-1, as an additional input.The mathematical expression for the suggested ensGRU is as follows: Here, the Reset gate, Hidden state, New memory cell, and Update gate are defined as Eqs.( 5) to (8): The operator ⊗ represents the batch-wise correla- tion operator, while ⊙ and ⁕ represents the point-wise multiplication operators and convolutional function, respectively.The term "Corr t " denotes the correlation matrix.To compute the correlation between x t−1 (4) Fig. 2 This figure depicts the internal architecture of our proposed GRU and x t , the variables are partitioned into sub-matrices called patches.Subsequently, the normalized crosscorrelation between these patches is computed using the following method: where, which can be expressed based on (11): where, and, The cross-correlation operation is represented by • .The collection of square patches M z with sizes 1 + 2P is denoted as M w×h .To establish a stronger connec- tion, we first normalize the input.Moreover, σ M z and σ N are utilized for normalization, utilizing the stand- ard deviation and mean of M z , respectively.Since this normalization process does not carry any weight, it is not included in the training process.
Spatial and motion information can effectively be captured, and temporal interdependence can be modeled by employing both convolution and correlation methods.The cross-correlation procedure identifies portions of a picture closely resembling a smaller kernel image.Therefore, correlation is utilized to ensure that subsequent video frames exhibit similar patches.The heightened intensity in the central region of the right picture occurs when two photos display a high degree of similarity.The suggested ensGRU is employed to extract motion using the same approach.
The suggested correlation operator generates a meaningful 3-dimensional matrix, specifically referred to as (Corr t Corr t ).The model is trained by computing the values of Corr t , x t (the current input frame), and h t- 1 (the output of the previous time step).

Edge computing
In our research, we have integrated edge computing technology to enhance the process of identifying online genres of videos, thereby improving digital media (9) management services.In this method, video data sourced near production points, undergo processing via local or edge processing layers.To enact edge computing, we have devised a local processing system leveraging local hardware and edge servers [54,55].For instance, the implementation of edge computing technology enables the extraction of crucial features from videos at locations proximate to the source of video production, such as camera devices.This approach results in diminished latency and heightened accuracy in detecting key features pertinent to movie genres.Subsequently, by transmitting the extracted data to cloud computing environments, the process of classification and analysis is executed with greater precision.Cloud platforms boast potent and scalable computing resources, facilitating the execution of complex processing operations swiftly and with enhanced accuracy.This system comprises distinct processing layers operating concurrently: a) Data Acquisition and Input: This layer entails the ingestion of video data from input cameras into the system.b) Edge Processing: Within this layer, local image processing algorithms are applied to the input data.This step involves extracting key features from images and preprocessing them for subsequent use.c) Transmission of Processed Data to Cloud Servers: Subsequently, the processed data from the edge servers are forwarded to cloud servers to facilitate a more accurate classification of video genres, leveraging the robust computing resources of the cloud.d) Processing on Cloud Servers: In this layer, more intricate genre classification algorithms are applied to the video data.This entails the utilization of deep learning models to recognize video genres and categorize them accordingly.
By adopting this structure, the integration of edge computing enables a significant enhancement in processing speed and accuracy in identifying movie genres, while also efficiently distributing the computational workload between local resources and cloud servers.This optimized approach harnesses the computational prowess of both edge computing and cloud computing to yield superior and expedited results in the analysis and categorization of movie genres.Shao and Guo Journal of Cloud Computing (2024) 13:102

Results
We begin by providing precise details regarding the benchmark datasets utilized in our study.Following this, we will analyze the evaluation methodology.Subsequently, we delve into the specific details of the implementation.Finally, we compare our approach with the most exemplary ones in the field.The tables showcase the top performers in each metric by presenting their results in bold.

Dataset and setting
A total of 4021 videos were collected from the LMTD-9 multi-label trailer database [56] for the purpose of conducting this investigation.The diversity of trailer types found in LMTD-9 surpasses that of any other dataset available.The database consists of 603 sets, with 2815 sets allocated for testing and 603 sets for validation.The trailers analyzed in this study span across nine distinct genres, comprising 693 thrillers, 313 science fiction, 651 romances, 436 horror, 2032 dramas, 659 crime, 1562 comedy, 593 adventure, and 856 action alternatives.
To ensure non-overlapping frames, adjustments were made to the initial frame, facilitating the sampling of video clips at a rate of 22 frames per second.To accommodate video clips with frame rates outside the standard range of 23 to 24 frames per second, the method of padding and duplicating the final frame was employed.All experimental procedures regarding the training, validation, and testing of neural network models adhere strictly to a frame sampling rate of 22 frames per second (fps).The optimization of performance in feature extraction from video frames using an ImageNet pre-trained CNN model necessitates fine-tuning the hyperparameters of both the GRU and the CNN layers.Some examples of the videos from this dataset are depicted in Fig. 3.Although this dataset comprises multiple labels, certain researchers have focused on determining the dominant genre for each video and conducted subsequent analyses.
In the Convolutional Blocks, we employed the VGG-Net16 architecture, which comprises three essential components: the convolutional layer, utilized to extract spatial information.Each layer has been trained extensively on the ImageNet dataset for this purpose.A 3 × 3 matrix serves as the convolutional mask, selecting 3 × 3 patches to compute cross-correlation.Throughout this study, we investigate the Adam optimizer and softmax loss function for training purposes.
We extract a total of 30 frames from each video.Random stride and random starting points are utilized to sample each video clip.Various spatial data augmentation (DA) techniques, such as cropping, flipping, and rotating, are implemented to address overfitting concerns.Additionally, DA over time is employed to assemble a collection of films with similar themes but varying in pacing and genres.Moreover, V ∈ R t×w×h×ch repre- sents a color-channeled video comprising t frames, each with a size of w × h .From the provided data, a sequence of video clips, denoted as V c ∈ R t′×w′×h′×ch , t′ < t, w′ < w, , and h′ < h , is extracted.In determining , the following factors are taken into consideration:

Experimental results
To evaluate the accuracy and performance of the genre classification method, we first divide a set of movies into training, validation, and test sets.Then, we train the classification method on the training set and tune its parameters.
To assess the accuracy and performance of the method, we use various metrics such as accuracy (Acc), precision (Prec), recall (Recl), and F1 score (F1).After training the method on the training set, we evaluate it on the validation set and compare the results with the actual labels.
Additionally, after running the method on the test set, we compare its output with the actual labels and compute accuracy, precision, recall, and F1 score metrics for the model.These metrics help us evaluate the overall performance of the classification method in detecting and predicting movie genres and compare its accuracy and reliability.Table 1 illustrates a comparison of the test results obtained through our strategy.
The proposed methodology for categorizing cinema genres offers a viable alternative to conventional learning approaches such as VGG-16, VGG-19, and VGG-GRU (VGG-ensGRU).Utilizing a dataset comprising (14) V c = f i j |i s ∈ 1, 2, 3, ..., t − t ′ × s , i j = i j−1 + s Fig. 3 In this figure, examples of videos along with their corresponding genres from this dataset (LMTD-9) are depicted [56] preference samples from various video genres, ranging in magnitude from 1 to 9, our analysis reveals that this strategy is notably more effective and applicable across a broader range of scenarios than previously assumed.Achieving an impressive accuracy rate of 94.38% when tested on 9 distinct genres, our approach leverages video frames to cater to individual preferences, with each technique achieving the desired level of accuracy.
Results indicate that employing the fivefold cross-validation (CV) technique enhances experimental accuracy while substantially mitigating potential bias, yielding reliable outcomes.However, some elements within the video frame collection may exhibit suboptimal performance.To address this, three enhanced models-VGG-16, VGG-19, VGG-GRU, and the proposed model-were employed, as shown in Table 1.Our research underscores Table 1 The table provided presents our proposed methodology alongside experimental results from three similar experiments employing fivefold cross-validation studies using similar models.A diverse selection of videos was gathered from a wide range of films for analysis these models' exceptional computational accuracy, reactivity, and ability to discern between different data.
Consequently, our design surpasses state-of-the-art techniques in both accuracy and computational cost.Based on trial data, VGG-GRU outperforms other frameworks with an F1 score of 0.9102 and an accuracy of 93.70%.
The categorization strategy yielded valuable data with a precision rate of 93%.The proposed approach demonstrated superior performance compared to similar models in a nine-class classification task, achieving an accuracy rate of 94.3% across all three rounds of fivefold cross-validation.Multi-class categorization was applied to each model scenario.Confusion matrices, as depicted in Fig. 4, serve as confusion matrixes for model comparison and representation of categorization outcomes.The data did not exhibit any statistically significant variation Fig. 4 The presented findings showcase confusion matrices that illustrate the classification outcomes of three comparable models in the context of movie categorization, in comparison with the proposed model or deviation, with a categorization accuracy rate of 94.4% across various movie genres.
Given the subjective nature of categorization, films are often classified into a single genre rather than multiple genres.Evaluation of classification output and the approach's effectiveness across various movie genre classification scenarios is conducted using variance estimation.Figure 5 displays classification results from several tests, compared to similar models such as VGG, VGG-GRU, ensGRU, and VGG-ensVGG, using receiver operating characteristic (ROC) curves and area under the curve (AUC) for each class.Our proposed model consistently outperforms VGG-GRU, ensGRU, and CNN + ensVGG architectures in genre categorization, demonstrating exceptional classification accuracy and robustness in the face of ambiguity.

Discussion and comparison
The model we have designed, named ensGRU + CNN, has remarkable capabilities and is comparable to other methods such as VGG, ResNet, and DenseNet in video genre classification.Its accuracy for identifying 9 different genres in the LMDT dataset is higher than three other methods, at 94.4%.This model stands out from other methods due to its combination of GRU and CNN networks.One of its strengths is its high ability to detect temporal and spatial patterns in videos.GRU, as a recurrent network, has the ability to learn temporal patterns, while CNN is used to extract spatial features from video frames.This combination of two networks enables the model to identify more complex and diverse patterns, resulting in higher accuracy in video genre classification.One of the strengths of the ensGRU + CNN model is its ability to leverage the deep learning capabilities of CNN models while also being able to learn complex temporal patterns using GRU.This combination enhances the accuracy in detecting video genres.Additionally, the ensGRU + CNN model benefits from using pre-trained models such as ImageNet to extract useful features from video frames, improving the model's generalization ability.However, one potential weakness of this approach may be its high computational cost, especially when the model is dealing with large and complex datasets.Additionally, training this model may require access to large and suitable datasets, which may be challenging for some domains.When comparing the ensGRU + CNN approach with other methods such as VGG, ResNet, and DenseNet, distinct strengths and weaknesses emerge.VGG, known for its straightforward architecture and spatial feature extraction prowess, tends to suffer from computational inefficiency due to its parameter-heavy nature, and may struggle with capturing temporal patterns effectively.Similarly, while ResNet addresses the vanishing gradient problem and achieves excellent performance in image classification, its reliance on purely convolutional layers limits its ability to model temporal dependencies in video data.DenseNet's dense connections promote feature reuse and gradient flow, yet scalability issues persist as the network deepens, potentially hindering its performance in tasks requiring explicit modeling of temporal dynamics.In Fig. 6, a comparison has been made between the methods based on accuracy criteria, indicating that the proposed method exhibits superior genre classification capabilities in relation to video films.
Given the recent methods published by researchers in recent years, the superiority of the ensGRU + CNN model over them can be explained as follows: 1. Superiority over GRU + SVM [42]: • The ensGRU+CNN model utilizes a combination of GRU and CNN, providing a significant improvement in capturing both temporal and spatial patterns.In contrast, GRU+SVM relies solely on recurrent neural networks and a traditional classification method like SVM, which may offer less assistance in embedding spatial features for video classification.2. Superiority over 1D-Conv-V [42]: • Compared to 1D-Conv-V, which employs a onedimensional convolutional network for video analysis, the ensGRU+CNN model has a better ability to learn complex temporal and spatial features.This Fig. 6 In this we compared our model with similar deep strategies based on accuracy criteria, demonstrating that the proposed method exhibits superior genre classification capabilities in relation to video films fusion of GRU and CNN can enhance the recognition of intricate temporal patterns and increase classification accuracy.3. Superiority over LLFM [37], LSTM [57], and CTT-MMC-TN [57]: • The Ensemble GRU+CNN model utilizes an effective combination of recurrent and convolutional neural networks to improve video classification accuracy.In contrast, LLFM, LSTM, and CTT-MMC-TN employ alternative methods for modeling and classifying, which may not be as effective in capturing temporal and spatial patterns.Overall, the ensGRU + CNN model, with its blend of recurrent and convolutional neural networks, demonstrates better capabilities in modeling videos and recognizing complex temporal and spatial patterns, resulting in higher accuracy and precision in video genre classification.The results presented in Table 2 demonstrate a significant improvement in scores across various genres with the incorporation of the multi-modal component within the network.This observation implies that integrating movie soundtracks can greatly enhance the accuracy of genre prediction.
The observed decrease in the science fiction score with the inclusion of audio can be attributed to the inherent symmetry found in the musical compositions of action and science fiction films.Analysis of confusion matrices and loss convergence indicates a noticeable difference in the quantity of action movie trailers compared to sci-fi trailers.This discrepancy has the potential to create confusion within the network, resulting in misclassification of these two genres.The AUC ratings of the Gated Recurrent Unit (GRU) + SVM and 1D-Conv + SVM models show close proximity.However, it is evident that the GRU + SVM model outperforms the 1D-Conv + SVM model in five out of nine genres.Furthermore, the GRU + SVM technique demonstrates superior performance in all categories except for drama, as assessed by the AUC metric, when compared to the most advanced models.Notably, the GRU + SVM model shows significant improvements in performance within genres characterized by limited data availability, such as thriller and horror videos.

Limitations and difficulties
One of the primary limitations encountered in genre classification of videos using the Ensemble GRU + CNN method is data imbalance among different genres.Certain genres may have a significantly larger number of samples compared to others, leading to biased model training and potentially affecting the accuracy of classification for underrepresented genres.Another challenge lies in the complexity of temporal patterns present in video data.While the Ensemble GRU + CNN approach excels in capturing both spatial and temporal information, the intricate nature of temporal dynamics in videos, such as rapid scene changes or subtle transitions, can pose difficulties in accurate genre classification.Moreover, the subjective nature of genre definitions can introduce variability in the classification process.Different annotators may have diverse interpretations of genre labels, leading to inconsistencies in the labeled dataset.This variability can affect the model's ability to generalize across genres and may result in misclassification errors.Despite the effectiveness of deep learning models like Ensemble GRU + CNN, there may be limitations in their ability to understand contextual nuances within videos.Some genres may rely heavily on subtle cues or contextual information that is challenging for the model to discern accurately, potentially leading to misclassification or ambiguity in genre assignment.In addition, implementing the Ensemble GRU + CNN method for video genre classification requires significant computational resources, including powerful hardware for training and inference.The computational complexity of deep learning architectures, coupled with the need for large-scale video datasets, can pose challenges in terms of infrastructure and resource availability for researchers and practitioners.The decision to employ the proposed method over alternative approaches stems from several advantages that our technique offers, including enhanced performance, efficiency, and applicability to the specific problem domain of movie genre classification.Our method, which integrates a novel ensemble GRU-like unit, has demonstrated superior performance compared to existing techniques on benchmark datasets.Through extensive testing, we have consistently observed higher accuracy rates and F1 scores, indicating the effectiveness of our approach in accurately classifying video genres.Our ensemble GRU architecture is designed to efficiently capture spatial and motion data while revealing their temporal relationships.By leveraging the cross-correlation operator and employing convolution and crosscorrelation techniques, our model can effectively extract relevant features from video frames, leading to improved classification accuracy.Furthermore, our method exhibits a high degree of applicability to the problem domain of movie genre classification.We have conducted thorough evaluations on well-known video genre datasets and have also validated the effectiveness of our approach in real-world scenarios, such as video production and maintenance, using a compiled assembly dataset.These experiments demonstrate the robustness and versatility of our proposed method across different domains and applications.In summary, the main reason for choosing our proposed method lies in its ability to deliver superior performance, efficiency, and applicability to the specific problem domain of movie genre classification.
Utilizing a single dataset allowed us to thoroughly evaluate the performance of our proposed method under controlled conditions.Despite the constraints, we have conducted extensive testing and validation on the LMDT database to provide a comprehensive assessment of our approach's effectiveness.Although we recognize the importance of testing on multiple datasets to validate the generalizability of our method, we believe that our results on the LMDT dataset provide valuable insights into the capabilities of our proposed model.Furthermore, fine-tuning across multiple datasets introduces additional challenges, such as dataset-specific biases and variations, which may require further adjustments to the model architecture and hyperparameters.Given these complexities, we opted to focus on thoroughly evaluating our proposed model on the LMDT dataset, ensuring a comprehensive understanding of its performance within a controlled setting.

Conclusion
Our study achieves a commendable classification accuracy of 94.4%.The proposed method, Ensemble GRU + CNN, emerges as an effective approach for online video genre recognition, leveraging the combined strengths of gated recurrent units (GRU) and convolutional neural networks (CNN).Through comparisons with several techniques in video genre classification, we justify the superior performance of our model.Additionally, we employed edge processing in our work to facilitate real-time processing and analysis of video data.By leveraging edge processing capabilities, we ensured smooth processing and relied on it to enhance the analysis of video data.This enabled us to achieve accurate classification across various genres while maintaining efficiency in digital media service management systems.The high classification accuracy attained underscores the efficacy of the Ensemble GRU + CNN model in digital media service management, offering promising prospects for enhanced content recommendation and personalized user experiences on online video platforms.As authors, we acknowledge several challenges and anticipate future research endeavors.Acquiring diverse and comprehensive datasets for video genre classification remains a significant challenge, necessitating labeled data spanning various genres and cultural backgrounds for robust model training.Achieving model generalization across different platforms and user preferences also poses challenges, requiring thorough experimentation and fine-tuning of the Ensemble GRU + CNN model.Implementing real-time inference of genre classification models in digital media service management systems introduces computational challenges, demanding optimization strategies tailored to platform-specific requirements.Incorporating user feedback and preferences into the genre classification pipeline adds complexity, necessitating the development of adaptive algorithms capable of dynamically adjusting genre classification based on user interactions and feedback.By acknowledging these challenges and incorporating edge processing into our approach, we aim to advance the field of online video genre recognition and digital media service management, paving the way for more efficient and accurate video analysis in real-time applications.

Fig. 1
Fig. 1 In this figure, the proposed method for the genre classification of movies is shown

Fig. 5
Fig.5 An assessment has been conducted to compare the proposed model with other similar techniques in order to evaluate the AUC criterion in different schemes using ROC analysis.The movie genre was determined by comparing the model with GRU and ensGRU

Table 2
The table below provides a comprehensive overview of technical breakthroughs and the corresponding fields in which each film aims to implement its methodologies, focusing on accuracy criteria