1 Introduction

Accurate detection and identification of pharmaceutical pills are essential to maintaining medication safety, ensuring effective clinical treatment, and meeting regulatory standards in contemporary healthcare systems. With the increasing integration of automated drug-dispensing machines in pharmacies and hospitals [1], the need for reliable and scalable inspection mechanisms has become more urgent. These machines, although capable of efficiently interpreting electronic prescriptions and sorting medications, are still susceptible to dispensing errors, necessitating secondary verification to ensure delivery accuracy. A commonly adopted method for this post-dispensing validation is visual inspection using digital imaging systems. Traditional approaches rely on rule-based analyses or template matching techniques, comparing extracted features from pill images with pre-stored references [2]. However, such methods lack adaptability in the presence of visual ambiguities, novel pill geometries, variable lighting, and diverse backgrounds. To overcome these limitations, recent research has embraced deep learning-based solutions for pill classification and recognition. For instance, collaborative multi-label classification models that jointly learn label semantics and label-specific visual features have demonstrated improved classification accuracy and interpretability [3], though their computational complexity poses limitations for real-time applications. Furthermore, AI-driven models have shown high performance in detecting standard pill types such as circular and oval tablets [4], but these systems often falter when exposed to irregular shapes or uncommon designs, highlighting the need for diverse and representative datasets. One study using Microsoft’s Azure Custom Vision platform trained on over 26,000 pill images from hospital environments achieved precision and recall values of 98.7% and 95.1%, respectively, yet experienced significant performance variation under real-world deployment scenarios [5]. This underlines the importance of designing user-centered interfaces and supporting effective AI-human interaction in clinical practice. In parallel, computer vision techniques have been applied to medication adherence monitoring, such as through blister pack analysis using Circular Hough Transform, which achieved over 95% detection accuracy in both full and partially used blister packs [6, 7]. Despite these advances, key challenges remain, including high visual variability among pills in terms of shape, color, size, and surface imprint—often complicated by the presence of pills with similar appearances but differing pharmacological compositions [16, 17, 18]. Additionally, real-time recognition using webcams or mobile devices introduces technical barriers such as motion blur, occlusions, background noise, and inconsistent lighting conditions [8, 9, 19,20,21]. Compounding this issue is the limited availability of comprehensive public datasets, particularly those covering rare, generic, or region-specific pills, which constrains model generalizability [11, 22]. Recent innovations seek to address these issues through advanced image preprocessing techniques, including contrast normalization, denoising, and super-resolution methods, aimed at enhancing image quality under challenging conditions [23]. Moreover, multimodal learning approaches that integrate image data with associated textual descriptions or spectral signatures are emerging as powerful tools to bolster robustness and context-aware recognition [24, 25]. Finally, the advent of edge computing architectures enables real-time inference at the point of care, reducing latency and dependence on cloud infrastructure while supporting mobile and decentralized applications of pill detection and classification. To address these issues, we propose an automated pill detection and identification system based on YOLOv5s (You Only Look Once), a state-of-the-art object detection model known for its real-time efficiency and high accuracy.

This paper presents a deep learning-based approach for pill detection and recognition using the YOLOv5s algorithm, integrating multiple techniques to enhance accuracy and robustness. The key contributions of our work are:

  • YOLOv5s-based pill detection

We employ the YOLOv5s object detection model for localizing pills and their imprints in high-resolution images. YOLOv5s offers a favorable trade-off between detection accuracy and computational efficiency, particularly in small-object detection scenarios common in pharmaceutical applications.

  • Confidence-based non-maximum suppression (NMS)

To address challenges arising from overlapping pills or imprints, we implement a confidence-based NMS strategy that suppresses redundant detections while preserving high-confidence candidates. This improves localization precision in cluttered environments.

  • Deep text spotter (DTS) for imprint recognition

A retrained Deep Text Spotter (DTS) module is incorporated to detect and interpret imprinted alphanumeric characters on pill surfaces. The module is fine-tuned using pill-specific text textures under various lighting and occlusion conditions, enabling robust imprint detection.

  • Character-level RNN with coordinate encoding

To refine OCR outputs and correct misrecognized or spatially distorted characters, we introduce a character-level recurrent neural network (RNN) augmented with coordinate encoding. This mechanism considers spatial layout information, thereby improving recognition of partially occluded or rotated imprints.

While these contributions align with established methodologies, their specific implementations and integrations within the proposed system offer practical advancements in pill detection and classification.

2 Dataset

To train and evaluate the proposed pill detection and recognition framework, two complementary datasets were employed: (i) a publicly available benchmark dataset from the U.S. National Library of Medicine (NLM), and (ii) a custom real-world pill image dataset collected under uncontrolled conditions. This dual-dataset strategy ensured both standardized evaluation and validation of the system’s robustness under deployment-oriented scenarios.

The NLM dataset contains approximately 24,404 images corresponding to 1,000 distinct pharmaceutical products. Each product is represented by reference images (captured under standardized conditions with front and back views) and consumer-grade images (captured via mobile devices under varied lighting, background, and device settings). For our experiments, a representative subset of 3,887 pill images was selected, reflecting morphological diversity across shapes and dosage forms. Within this subset, 1,000 images were used for model refinement (training), while 2,887 were reserved exclusively for evaluation. The split was carefully designed to prevent data leakage: images corresponding to the same pill identity or near-duplicate captures were not shared across subsets.

The custom real-world dataset was independently collected to simulate challenging deployment conditions. Images were captured using a standard smartphone camera (12 MP, f/1.8 aperture, autofocus enabled) was used to mimic typical user and clinical capture scenarios. Under uncontrolled lighting, cluttered backgrounds, and partial occlusion, representing clinical, household, and point-of-care environments. This dataset was used as an additional validation resource to assess generalization performance. The same leakage-prevention strategy was applied, ensuring that identical pills or duplicate captures did not appear across training and test splits.

A concise overview of dataset composition, imaging conditions, and train/validation/test splits is provided in Table 1.

Table 1 Dataset Composition

Proposed methodology:

The proposed system adopts a modular deep learning framework for robust and real-time pill detection, classification, and imprint recognition, optimized for deployment on mobile or edge-enabled platforms. At its core, the system leverages the YOLOv5s (You Only Look Once, version 5 small) object detection model to localize both pharmaceutical pills and their surface imprints within natural RGB images, typically captured using mobile device cameras. The YOLOv5s model is custom-trained to detect small-scale and fine-grained features, which are critical in the pharmaceutical domain, where pills often exhibit subtle differences in shape, size, and imprint configuration. Upon receiving an input image, the YOLOv5s model outputs a set of bounding boxes, class labels, and associated confidence scores, effectively identifying both the pill bodies and their imprinted characters. This performance is enabled by YOLOv5s’s Cross Stage Partial Network (CSPNet) backbone for efficient gradient flow and feature extraction, along with a multi-scale detection head that supports recognition of objects at varied resolutions. To improve detection accuracy, particularly in scenes involving overlapping pills or partial occlusions, a confidence-based Non-Maximum Suppression (NMS) algorithm is employed. This method filters redundant or low-confidence detections, producing a refined set of localized bounding boxes for downstream processing.

Following detection, the identified pill regions and imprint areas are cropped from the input image and processed independently. For pill classification, each cropped region is passed through a ResNet-32 feature extractor, which captures high-level visual attributes such as shape, color, and contour. The resulting compact feature vectors serve as input for classification tasks, enabling the system to discriminate between pill types with high precision. Simultaneously, the imprint regions undergo a specialized recognition pipeline. These subimages, either in raw form or texture-enhanced via pre-processing filters, are passed to a retrained Deep Text Spotter (DTS) model. The DTS model is optimized for character detection under challenging conditions, such as distorted fonts, uneven lighting, or partial occlusion. Its output is subsequently refined by a character-level Recurrent Neural Network (RNN) augmented with coordinate encoding, which integrates spatial sequence information to resolve character ambiguities and correct optical character recognition (OCR) errors. This integration of visual and spatial-linguistic cues enables the system to accurately reconstruct imprint text even in the presence of noise, blur, or partially degraded characters.

The final system output includes (i) the classified pill identity, (ii) the recognized imprint string, and (iii) corresponding confidence scores, with optional visualization overlays displaying bounding boxes and imprint annotations. This architecture provides a comprehensive and scalable solution for automated pill inspection, with potential applications in pharmaceutical dispensing validation, clinical safety checks, and mobile health (mHealth) tools.

2.1 Network model

We propose a real-time framework for pill detection and classification based on the YOLOv5s object detection architecture, incorporating a Cross Stage Partial Network (CSPNet) backbone to facilitate efficient and robust feature extraction. The task of pharmaceutical pill recognition is inherently challenging due to the presence of subtle visual distinctions among pills, including variations in shape, color, size, and imprint content. These factors demand a high-resolution detection pipeline capable of capturing fine-grained features under diverse imaging conditions. YOLOv5s, the most lightweight model in the YOLOv5 series, offers an optimal trade-off between computational efficiency and detection accuracy, making it particularly advantageous for real-time applications in resource-constrained environments. Its compact design and high inference speed render it well-suited for embedded platforms, such as mobile health (mHealth) devices and edge-based clinical monitoring systems, where rapid and accurate identification of tablets and capsules is essential.

2.1.1 Backbone: CSP darknet with CSP net

At the core of the proposed framework lies the YOLOv5s detection architecture, which employs a CSPNet-based backbone known as CSPDarknet to enhance the network’s representational capacity while maintaining computational efficiency. In this architecture, the input feature map X is partitioned into two components, X1 and X2 ​. The subset X1​ is propagated through a series of convolutional layers, denoted by the function F(⋅), and its output is subsequently concatenated with the bypassed pathway X2, as formulated below:

$$\text{Y} = \text{concat} (\text{F}(\text{X}_{1}), \text{X}_2)$$
(1)

This design facilitates diverse feature learning while minimizing computational redundancy and improving gradient propagation across layers. Such improvements are particularly valuable in the context of pill detection, where the ability to discern fine-grained visual features—such as embossed imprint characters, logos, or surface texture variations—is essential for accurate identification [26].

The overall YOLOv5s architecture comprises three main components: the backbone (CSPDarknet), the neck (Path Aggregation Network, PANet), and the detection head. The CSPDarknet backbone is tasked with extracting hierarchical, multiscale features from input pill images, thereby enabling robust representation of both global shape attributes and localized imprint patterns. The PANet neck further refines these features by integrating cross-scale semantic information, while the detection head predicts bounding boxes, object classes, and confidence scores across multiple spatial resolutions.

2.1.2 Neck: PANet for multi-scale feature aggregation

The neck component of the YOLOv5s architecture plays a pivotal role in enhancing multi-scale feature representation through the incorporation of the Path Aggregation Network (PANet). Specifically, PANet facilitates both top-down and bottom-up feature fusion, enabling the model to aggregate contextual and spatial information across multiple layers. This design significantly improves the detection of objects—particularly small-sized pills—that may appear at varying resolutions and positions within an input image. The bottom-up path augmentation introduced by PANet enhances localization precision and enriches the semantic understanding of low-level features [27].

Following feature fusion, the detection head predicts bounding box coordinates, object confidence scores, and class probabilities for each identified region. The bounding box coordinates are derived from the neural network’s raw output values using the following transformations:

$$\hat{x}=\sigma(t_{x})+ c_{x}, \qquad \hat{y}=\sigma(t_{y})+ c_{y}$$
(2)
$$\hat{w}=P_{w}e^{t_{w}},\qquad \hat{h}=P_{h}e^{t_{h}}$$
(3)

where tx, ty, tw, th represent the raw outputs of the network for the center coordinates and dimensions, respectively, σ denotes the sigmoid activation function, cx, cy​ are the top-left coordinates of the grid cell, and pw, ph​ are the dimensions of the predefined anchor boxes. The confidence score for objectness and classification is computed as:

$$\text S = \sigma(t_{o}) .\, \text {softmax}\,(t_c)$$
(4)

where to​ is the objectness score, indicating the likelihood that an object exists within the predicted bounding box, and tc​ is the class probability vector representing the likelihood distribution over possible pill categories. These formulations ensure that the model not only accurately localizes pills but also assigns high-confidence classifications, even in cluttered or low-contrast visual environments.

2.1.3 Loss function

The proposed pill detection framework is trained using a composite loss function that integrates three distinct components: classification loss, objectness loss, and bounding box regression loss. The total loss is defined as:

$$\text L_\text {total} = \lambda_\text{cls} .\text L_\text{cls} + \lambda_\text {obj} . \text L_\text{obj} + \lambda_\text {box} . \text L_\text{box}$$
(5)

where λcls, λobj, λbox​ are scalar weighting factors for the respective loss function weights. The classification loss (Lcls​) and objectness loss (Lobj) are computed using the binary cross-entropy (BCE) function, ensuring effective discrimination among pill classes and presence detection within bounding boxes.

For bounding box regression, the Complete Intersection over Union (CIoU) loss is employed, which extends standard IoU by incorporating distance, aspect ratio, and overlap constraints. The CIoU loss is defined as follows:

$$\text L _ \text{CIoU} = \text{1-IoU}\:\frac{{{\rho\:}}^{2}\left({b},{{b}}^{{{g}}_{{t}}}\right)}{{{c}}^{2}}+{\alpha\:}{v}$$
(6)

In this formulation, ρ(⋅) denotes the Euclidean distance between the centers of the predicted bounding box b and the ground truth box b(gt), c is the diagonal length of the smallest enclosing box covering both b and b(gt), v represents the aspect ratio consistency between the predicted and ground truth boxes, and α is a dynamically computed balancing coefficient.

Model training was conducted using the Adam optimizer with an initial learning rate of 0.001 and a batch size of 16. The learning rate was progressively adapted over 300 training epochs using a cosine annealing schedule, which gradually reduces the learning rate to improve convergence during later stages of training. To prevent overfitting, early stopping was employed based on validation loss, and model checkpointing was used to preserve the best-performing weights throughout the training process. All experiments and model implementations were conducted using the PyTorch deep learning framework.

2.2 Non-maximum suppression (NMS) in pill detection

In computer vision applications such as pharmaceutical pill detection using the YOLOv5s algorithm, Non-Maximum Suppression (NMS) serves as a crucial post-processing step to refine the set of predicted bounding boxes. During inference, the model may generate multiple overlapping predictions for a single object—particularly in cases where pills share similar visual features such as color, shape, or size. Without proper filtering, these redundant detections can significantly impair the accuracy and interpretability of the system.

NMS addresses this issue by systematically eliminating lower-confidence predictions that overlap significantly with higher-confidence detections. The procedure begins by sorting all predicted bounding boxes in descending order based on their associated confidence scores Ci​. The bounding box with the highest confidence score, denoted as Bmax​, is selected as the reference. Subsequently, the Intersection over Union (IoU) is computed between Bmax​ and each of the remaining boxes Bj to quantify their spatial overlap. The IoU is defined as:

$$\text {IoU}({{\rm B}_{\rm max,}\text B_{\rm j})=}\:\:\frac{\mathbf{A}\mathbf{r}\mathbf{e}\mathbf{a}(\mathbf{B}\mathbf{m}\mathbf{a}\mathbf{x}\:\mathbf{}\cap\:\:\mathbf{B}\mathbf{j})\mathbf{}}{\:\:\:\mathbf{A}\mathbf{r}\mathbf{e}\mathbf{a}(\mathbf{B}\mathbf{m}\mathbf{a}\mathbf{x}\:\mathbf{}\cup\:\:\mathbf{B}\mathbf{j}\mathbf{})\:\:\:\:\:\:\:\:}$$
(7)

Bounding boxes with an IoU exceeding a predefined threshold (typically 0.5) are suppressed, as they are considered redundant detections of the same object. This process is iteratively repeated for the next highest-ranked bounding box in the sorted list until all boxes have been either retained or discarded. The result is a non-overlapping set of high-confidence bounding boxes, significantly enhancing the precision of the detection system and reducing false positives in densely populated pill scenes.

2.3 Feature extraction

In this section, we describe the feature extraction procedure that combines a multitask learning strategy with ResNet. Our model makes use of residual learning concepts to produce training results that are more accurate. In particular, we used a 32-layer deep neural network called ResNet-32 [28]to extract visual characteristics from pill photos, including form, colour, and shape.

The model was trained to recognise two forms (f), sixteen colours (c), and eleven different shapes (s).

ResNet-32 creates a feature vector of hidden dimension size h that describes the appearance of a pill given an input image of the pill. This vector is then used to derive task-specific feature outputs by multiplying it with three weight matrices:

  • w[s]∈Rh×s for shapes,

  • w[c]∈Rh×c for colors,

  • w[f]∈Rh×f for forms.

Each output is passed through a softmax function to produce predictions. For a given pill i, the shape prediction is computed as:

$$\text {z[i,s]} = \textrm{softmax(w[s]}\cdot\textrm{ p[i])}$$
(8)

where p[i] is the feature vector feature derived from ResNet output from ResNet-32 for pill i. Similarly, the color prediction is:

$$\text{z[i,c] = softmax(w[c]} \cdot \textrm{ p[i]) }$$
(9)

Although pills may exhibit multiple colors (e.g., capsules), we simplify the task by selecting a single representative color per pill and apply a cross-entropy loss. All three tasks—shape, color, and form—are associated with their own cross-entropy loss functions. These individual losses are added up to determine the overall loss, which is then reduced throughout training. In order to enhance classification performance across all features, the optimisation process [29, 30]is guided by this total loss, which updates the ResNet-32 weights and the task-specific matrices. Lastly, we separated the pill types into capsules and tablets. The following is the pill I form information.:

$$\text{z[i,f] = softmax(w[f]p[i])}$$
(10)

Each feature’s loss functions are as follows:

$$\begin{aligned}\:{{L}}_{{k}}=-\sum_{{j}}^{{N}}{{I}}_{{i},{j}}^{{k}}{log}\left({{z}}_{{i}}^{{k}}\right) \end{aligned}$$
(11)

In this case, k stands for one of the feature types: form (f), colour (c), or shape (s). The number of available categories for each feature type is indicated by N, which is 2 for forms, 11 for shapes, and 16 for colours. I[i, j,k]∈{0,1} provides the ground truth label, indicating whether pill i falls within category j for feature type k.

The sum of the individual cross-entropy losses for every class and sample is used to calculate the overall loss for each feature extraction module. To balance their impact, each loss is weighted by a matching hyperparameter [31]. We empirically set al.l weights to 1 in our implementation, considering the multitask learning objective’s shape, colour, and form categorisation equally.

2.4 Pill recognition

2.4.1 Imprint detection

As shown in the lower portion of Fig. 1, the suggested model extracts imprinted text information from pills in two steps: first, it identifies possible text regions, and then it retrains the DTS model [1] to identify the letters or symbols within those regions. The model uses the modified bilinear sampling technique from [1] to accommodate different text region shapes and orientations. This method reshapes each identified region into a canonical tensor with uniform dimensions. Specifically, for a detected text region.

r ∈ Rw×h×C.

The region is normalized into a tensor with fixed height using:

rn ∈ R wH′/h×H′×C.

Rewrite rn using.

$$\begin{aligned} \text r_n & =\:\sum_{{x}=1}^{{u}}\sum_{{y}=1}^{{h}}{m}{a}{x}\left[\text{0,1}-\left|{x}-{{\tau\:}}_{{x}}\left({{x}}^{{{\prime\:}}}\right)\right|\right) \\ & \quad. \:{m}{a}{x}\left[\text{0,1}-\left|{x}- {{\tau\:}}_{{y}}\left({{y}}^{{{\prime\:}}}\right)\right|\right) \end{aligned}$$
(12)

The fixed height H′ utilised for normalisation was set to 32 in the suggested scheme, where τ indicates a point-wise coordinate translation. Using the texture image, we modified the text recognition part. It was acquired by applying a low-pass filter on the RGB input image.

The model uses Connectionist Temporal Classification (CTC) to transform each normalised text region rn into a conditional probability distribution [10]. This enables the model to choose the most likely symbol sequence that matches the text that is imprinted.

After being trained using the letter A, the DTS model’s text recogniser generates a prediction matrix. Mt of size \(\:\frac{\stackrel{-}{w}}{4}\) × |A| where: \(\:\stackrel{-}{W}\) = \(\:\frac{W{H}^{1}}{h}\:is\:the\:normalized\:width\:of\:the\:input\:and\:\)|A| is the length of the alphabet.

Each row i of the matrix Mt​ contains a vector:

vi = (\(\:{V}_{1}^{i},\cdots\:\cdots\:{V}_{j}^{i}:\dots\:\cdot\:{V}_{\left|A\right|}^{i})\).

where each \(\:{V}_{j}^{i}\) represents the likelihood of the jth symbol in the alphabet at position i. The probability of a label sequence s within the detected region rn​ is given by:

$$\begin{aligned} \text P(S\mid V)=\:\prod_{{i}=1}^{\raisebox{1ex}{$\stackrel{-}{{w}}$}\!\left/\:\!\raisebox{-1ex}{${i}$}\right.}{{v}}_{{j}}^{{i}},\varvec{s}{\epsilon\:}{{A}}^{\stackrel{-}{{w}}/4} \end{aligned}$$
(13)

To eliminate blanks and repeated labels, a many-to-one mapping \({\rm M}_{\rm A} : {A}^{\bar{w}/4} \rightarrow\) \(\:{A}^{\le\:\:\stackrel{-}{w}/4}\), is applied, producing the final sequence sf​. The objective function used to train the text recognition network is then:

$$\begin{aligned}S_{f} \in {{A}}^{\le\:\:\stackrel{-}{{w}}/4} X\:\sum_{{s}:{{M}}_{{A}}\left[{v}\right]\:=\:\mathbf{s}\mathbf{f}} {\rm p(s|v)} \end{aligned}$$
(14)

After retraining the network ftext(⋅), the following steps are performed:

  1. 1.

    Generate the texture map It from the RGB input IRGB.

  2. 2.

    Use a text detection network to identify all possible text regions.

  3. 3.

    Select the top two ranked text regions, concatenate them to form one normalized tensor rn​.s.

  4. 4.

    Feed rn​ into the recognition network to obtain matrix Mt = ftext(rn) \(\:\epsilon\:\) \(\:{R}^{\stackrel{-}{w}/4\:\times\:\:\left|\text{A}\right|}\).

To obtain the final text probability vector vt \(\:\epsilon\:\) \(\:{R}^{1\:\times\:\:\left|\text{A}\right|}\), the model averages across the width (position) dimension of Mt:

$$\begin{aligned}{\text{V}_{tj}}=\:\frac{ \sum_{{t}=1}^{\stackrel{-}{{w}}/4}{{\nu\:}}_{\varvec{j}}{i}}{\stackrel{-}{{w}}/4} \end{aligned}$$
(15)

This vector vt​ captures the average likelihood of each symbol in the alphabet appearing in the detected pill text regions. The entire process of transforming It​ to vt is denoted as:

$$\:{{f}}_{{t}{e}{x}{t}}^{{a}{v}{g}} {(\text {M}{_t}) = V_t }$$
(16)

Once retrained, the DTS model is fixed and subsequently integrated into a fusion network, where it is combined with other feature streams. Intuitively, vt provides meaningful cues about the presence of imprinted symbols on pills and serves to complement the outputs from the other three visual streams in the final classification stage.

2.4.2 Imprint correction

To improve pill identification, we incorporated imprinted characters and pill characteristics into the model by treating them as a sequence and context, respectively. The model was trained to rectify character sequences to more accurately represent the desired imprints found on pills. The imprinted text frequently contains crucial information such as the active ingredients, their dosages, and the manufacturer’s name, rendering it highly pertinent for precise identification. Given the common occurrence of numerical sequences in pill imprints—many of which are structurally and visually alike—the imprint calibration module can be effectively trained. This module receives character sequences along with their 2D coordinates, which are extracted by the imprint detection module. We propose a character-level RNN enhanced with Gated Recurrent Units (GRUs) [31–35, 42] to function as a language model for correcting imprinted characters. Furthermore, we introduce a novel coordinate encoding technique specifically designed for pill data rather than general text corpora. Each character’s 2D position on the pill is encoded as part of its input representation, allowing the model to better comprehend spatial relationships between characters. This character correction module not only enhances the accuracy of individual characters but also improves their order, utilizing the spatial information derived from the coordinates. To accomplish this, we implement an attention-based, bidirectional sequence-to-sequence model [35] with a many-to-many RNN architecture, inspired by techniques in machine translation. For a character t with spatial coordinates (xt,yt), the input vector is created by concatenating its 2D coordinates with its one-hot encoding. The encoder processes this input sequence to produce a context representation, which the decoder subsequently uses to output one character at a time. The decoder depends on attention weights, which are calculated based on the current decoder state and all encoder hidden states. The model optimizes the following conditional log-likelihood:

$$\begin{aligned}\text {log p(tgt | src)} = \:\sum_{{j}}^{{m}}{{l}{o}{g}}_{{p}}\left[\left.{{t}}_{{j}}\right|\right.{{t}}_{<\dot{{j}}},{s}{r}{c} ]\end{aligned}$$
(17)

where tgt = {t₁, …, \(t_m\)}denotes the target imprint character sequence, src denotes the input source sequence (imprint characters with spatial coordinates), and m is the length of the imprint. The conditional probability at each decoding step depends on the previously generated characters t < jt_{< j}t < j, the encoded source features, and the attention mechanism.

2.4.2.1 Coordinate-encoded character-level RNN

To refine OCR outputs, we extend a character-level recurrent neural network (RNN) by integrating explicit spatial encoding of detected character regions. Specifically, for each candidate character identified by the Deep Text Spotter, we extract its 2D centroid coordinates (x, y)(x, y)(x, y), normalized by image width and height to ensure scale invariance. These spatial features are concatenated with the one-hot (or embedded) character predictions before being fed into the RNN sequence model. In this way, the RNN learns to jointly model both character sequence order and spatial arrangement, allowing it to correct misrecognitions such as misplaced, rotated, or overlapping characters. Unlike standard CRNNs, which assume linear text flow, the proposed coordinate encoding explicitly incorporates positional context, which is crucial for pill imprints that often follow non-linear layouts (e.g., arcs, circular arrangements, or multi-line engravings). This lightweight encoding strategy offers a task-specific alternative to more computationally intensive attention-based spatial models used in generic OCR, providing robustness in scenarios with partial occlusion, rotation, or irregular spacing.

3 Results and discussions

In this study, we evaluated the performance of the proposed pill identification using the YOLOv5s object detection model on publicly available dataset and on real-time pill image dataset collected in real-world settings, which contains thousands of annotated pill images with variations in color, shape, imprint, and orientation. The model was trained to detect pills and accurately localize their bounding boxes while classifying them into corresponding categories.

3.1 Detection performance

The performance of the YOLOv5s model was measured using standard object detection metrics, including precision, recall, mean average precision at IoU threshold 0.5 (mAP@0.5).

Table 2 Performance metric of YOLOv5s

The results show that the YOLOv5s model achieves high precision and recall, indicating that the model is both effective in detecting pills and efficient in reducing false positives and false negatives. The mAP considers detection performance over a range of IoU thresholds as shown in Table 2, confirms the robustness of the model in various detection conditions.

To provide a more comprehensive evaluation, we also report mAP@[0.5:0.95] = 87.1%, which accounts for performance across varying IoU thresholds. Confidence and NMS thresholds (0.45 and 0.50, respectively) are explicitly specified to ensure reproducibility. These results confirm the robustness of YOLOv5s in handling cluttered pill environments and small-object scenarios.

3.2 Imprint recognition and correction

Following pill detection, text recognition and imprint correction were applied using a retrained DTS model and RNN with coordinate encoding. The imprint recognition module achieved an average character-level accuracy of 89.7%, while the imprint correction module improved this to 95.1% by correcting common OCR errors and misordered characters.

This correction was particularly effective for pills with complex imprints or multiple text regions, where the coordinate-aware sequence model outperformed baseline character recognition systems. The attention mechanism helped re-order characters based on their spatial arrangement, emulating human-like interpretation.

Multitask Feature Recognition.

The model also incorporated a multitask ResNet-32 classifier trained to predict pill shape (11 classes), color (16 classes), and form (2 classes). The classification accuracies were:

  • Shape accuracy: 96.3%

  • Color accuracy: 94.1%

  • Form accuracy: 98.7%

The high accuracy across all three attributes demonstrates the effectiveness of the multitask learning framework and the contribution of residual connections in extracting discriminative features from pill images.

To assess performance more comprehensively, multiple evaluation metrics were considered as indicated in Table 3. The Normalized Edit Distance (NED) remained low at 0.07, indicating minimal character-level errors. Importantly, the system exhibited resilience to challenging conditions, maintaining accuracy under partial occlusion (91.5%) and rotation (92.3%). These results highlight the effectiveness of the combined DTS + RNN approach in robust imprint interpretation under real-world imaging conditions.

Table 3 OCR and imprint correction results across multiple evaluation metrics

3.3 Ablation study

To systematically assess the impact of individual components within the proposed framework, an ablation study was performed. Four model configurations were evaluated by progressively integrating key modules, including imprint recognition, visual feature extraction, and spatial correction mechanisms. The results, summarized in Table 4, demonstrate the incremental performance gains achieved through component integration.:

Table 4 Ablation study

As shown in Table 4, utilizing only imprint recognition results in moderate classification performance (88.4%), while incorporating visual features alone provides a notable improvement (91.5%). The joint use of both imprint and visual cues further boosts accuracy to 93.2%, underscoring the benefits of multimodal information fusion. Finally, the inclusion of the imprint correction module, which employs a character-level RNN with coordinate encoding, yields the highest classification accuracy of 97.8%. These findings confirm that the integration of visual, textual, and geometric representations significantly enhances the model’s robustness, particularly in the presence of real-world imaging challenges such as occlusion, distortion, and imprint ambiguity.

Table 5 Summary of experimental settings for training and evaluation

Table 5 summarizes the standardized experimental configuration applied across all model variants. The same training and evaluation protocol was maintained for both the proposed framework and baseline models to ensure fairness and reproducibility. For ablation studies, identical configurations were used.

The final pill classification step integrates detection and OCR outputs with shape and color cues. As reported in Table 6, Top-1 accuracy reached 92.4% (micro-averaged) and 90.7% (macro-averaged), the latter accounting for dataset imbalance. Top-5 accuracy was 97.8%, reflecting a high likelihood of the true label being included in the top candidates. Even when OCR errors occurred, classification accuracy remained robust at 89.1%, demonstrating the complementary role of multimodal features in the decision process.

Table 6 Multitask attribute prediction results for pill classification
Table 7 Computational cost and efficiency of ablation variants

As shown in Table 7, the ablation study highlights not only the incremental accuracy benefits of each module but also the corresponding computational costs. While the addition of imprint recognition and visual feature extraction moderately increases parameter count and inference latency, these components provide substantial gains in recognition accuracy. The coordinate-encoded RNN correction module, despite introducing the largest computational overhead (≈ 2 M additional parameters and ~ 2.7 ms/image increase), delivers the most significant accuracy boost, raising performance from 93.2% to 97.8%. This trade-off demonstrates that the proposed correction mechanism is computationally lightweight relative to its contribution in robustness, particularly in scenarios involving occlusion or rotated imprints. Moreover, the overall inference time of the full pipeline (< 20 ms per image) remains suitable for real-time or near real-time applications in clinical and consumer settings.

3.4 Discussion

The proposed framework demonstrates strong potential for real-world applications in pharmaceutical verification, counterfeit pill detection, and clinical decision support systems. In contrast to conventional Optical Character Recognition (OCR)-based models or unimodal classifiers, the integration of deep learning with domain-specific optimizations—notably, coordinate-aware character correction and multitask feature learning—enables the proposed system to outperform existing state-of-the-art approaches in both accuracy and efficiency.

Despite its robust performance, the model continues to face challenges when dealing with low-resolution images, blurred or partially occluded imprints, and complex backgrounds. These limitations present opportunities for future research, including the application of self-supervised pretraining, domain adaptation techniques, and real-time image enhancement modules to improve generalizability across diverse imaging scenarios.

3.4.1 Comparative evaluation with state-of-the-art approaches

Comparison of the proposed model with state-of-the-art pill recognition approaches as indicated in Table 8. Results for the proposed model and the CNN + OCR baseline were reproduced and evaluated on the NLM Pill Image Database under the same experimental settings. Results for DeepPill [14], PillRecNet [13], and OCR-RCNN [12] are reported as published in their original works, since the corresponding datasets were not fully available for reproduction. The comparison includes architectural components, datasets used, classification accuracy, inference latency, and distinguishing features.

Table 8 Comparative evaluation of State-of-the-art algorithms for pill identification using the NLM pill dataset

The YOLOv5s-based model within the scope of the tested datasets (e.g., the NLM Pill Image Database and our real-time dataset) in terms of both accuracy and inference time, indicating its suitability for real-time deployment in environments such as pharmacies, clinical kiosks, and mobile health applications. In contrast to traditional OCR pipelines, which often suffer from spatial misalignment and imprint ambiguity, the proposed model leverages a coordinate-aware RNN to correct character sequences and reduce recognition errors. Furthermore, the multitask learning design enables simultaneous extraction of visual features such as pill shape, color, and contour, reducing over-reliance on imprint recognition alone.

3.4.2 Evaluation on real-world and benchmark datasets

To assess the model’s generalizability, performance was evaluated on two datasets: the NLM Pill Image Dataset and a Real-Time Captured Pill Dataset, the latter of which contains images acquired under uncontrolled conditions simulating actual usage scenarios. The results are summarized in Table 9.

Table 9 Performance metric of the YOLOv5s model on the NLM pill image dataset and a real-time captured pill dataset

These metrics confirm that the model achieves a high degree of sensitivity and specificity, both of which are critical for healthcare applications where false negatives may lead to missed detections and false positives may result in incorrect dispensing. The F1 score indicates balanced performance across precision and recall, while the AUC-ROC and AUC-PR values demonstrate strong discriminative capability. Furthermore, the model’s high mAP@0.5 value indicates effective object localization performance.

Importantly, the slight performance drop on real-time images—while expected due to lighting, occlusion, and device variability—can be further mitigated through domain-specific augmentations and transfer learning strategies. These results underscore the model’s practicality for deployment in diverse healthcare settings, and highlight avenues for further optimization.

To comprehensively assess the performance and generalization capability of the proposed YOLOv5s-based pill detection framework, we evaluated it across five distinct datasets that vary in quality, imaging conditions, and real-world complexity. The datasets include both standardized benchmarks and unstructured real-world data, thereby facilitating a robust evaluation of the model’s applicability across clinical and non-clinical environments. The detailed results are summarized in Table 10.

Table 10 YOLOv5s model performance for different datasets

The model demonstrates excellent accuracy and precision on clean, curated datasets such as the NLM Pill Dataset and the RxIMAGE Dataset, highlighting its ability to perform reliable object localization and classification under standardized conditions. These results establish a strong baseline and validate the model’s architectural efficacy in controlled scenarios.

However, as the complexity of the data increases—such as in the Real-Time Pill Dataset and the Custom Clinical Dataset—a moderate decline in performance is observed. This drop is attributable to challenges such as background clutter, pill overlap, motion blur, and lighting inconsistencies, all of which are commonly encountered in practical deployment environments. Despite these challenges, the YOLOv5s model maintains a high level of performance, with accuracy remaining above 92%, indicating its robustness and suitability for real-world applications such as pharmacy automation, clinical kiosks, and mobile health platforms.

The model’s ability to maintain high precision-recall balance, as reflected in F1 scores and mean Average Precision (mAP), confirms its resilience to occlusions, class imbalance, and visual variation across datasets. These findings suggest that with minimal image pre-processing—such as background filtering, normalization, or contrast enhancement—the system can be reliably integrated into healthcare workflows where accuracy and inference speed are paramount.

Figure 1 illustrates a comparative bar chart visualizing key performance metrics (accuracy, precision, recall, and F1 score) across all evaluated datasets, highlighting the consistency and scalability of the proposed system.

Figure 2 illustrates the evolution of model accuracy across 50 training epochs for the YOLOv5s model on the NLM Pill Dataset. Both training and validation accuracy exhibit consistent improvements over successive epochs, with performance plateauing near 98% toward the final epochs. This trend reflects the model’s effective learning dynamics and aligns with the high accuracy metrics observed during quantitative evaluation, thereby confirming the robustness of the training strategy and architectural design.

Fig. 1
figure 1

YOLOv5s Performance metrics across Pill image dataset

Fig. 2
figure 2

YOLOv5s model accuracy on NLM Pill dataset

The left panel Fig. 3 illustrates the evolution of training and validation accuracy alongside precision across 50 epochs, demonstrating a consistent upward trend indicative of effective learning and generalization. The right panel displays the corresponding training and validation loss curves, which exhibit a steady decline throughout the training process. Together, these plots reflect the model’s training stability, convergence behavior, and capability to accurately capture both spatial and semantic features of pharmaceutical pills under controlled conditions.

Fig. 3
figure 3

presents a dual-panel visualization of the YOLOv5s model’s training performance on the NLM Pill Dataset

Figure 4 illustrates representative input images employed to evaluate the performance of the proposed YOLOv5s-based object detection framework for pill detection and classification. These images were captured under unconstrained, real-world conditions using a standard smartphone camera. Pills exhibiting diverse morphological characteristics—including variations in shape, size, color, and imprint—were placed against a plain white paper background and photographed from multiple angles to emulate scenarios commonly encountered in domestic, pharmacy, or clinical environments.

The dataset presents several challenges typical of realistic deployment settings, including inconsistent lighting, shadow artifacts, partial occlusion, and overlapping pills. Moreover, the visual similarity among pills—particularly those sharing similar colors or shapes—further increases the complexity of both localization and classification tasks. These challenges necessitate the use of a high-performance object detection model capable of fine-grained visual discrimination, a requirement well addressed by the YOLOv5s architecture owing to its real-time, single-stage detection pipeline.

To validate the practical applicability of the proposed system, a Graphical User Interface (GUI)-based application was developed. This application integrates advanced image processing techniques with deep learning-based recognition to support pill identification from both stored images and real-time camera feeds. Upon image acquisition, the system executes segmentation, feature extraction, and classification in a streamlined and automated pipeline. The application achieves high accuracy and low latency in varied real-world scenarios, demonstrating its potential as a viable solution for clinical decision support, pharmaceutical verification, and counterfeit detection.

Overall, this implementation represents a meaningful advancement toward the integration of AI-powered visual recognition systems into healthcare workflows, offering a scalable, responsive, and user-friendly solution for medication identification across diverse environments.

Fig. 4
figure 4

Real-world pill image of supradyn,cipcal,minolast,doxcef and paracetamol samples used for pill detection and classification

Figure 5 demonstrates the effectiveness of the proposed YOLOv5s-based model in detecting and classifying pharmaceutical pills within a realistic setup. The left panel depicts the unprocessed input image, showcasing a variety of pills differing in shape, color, size, and imprint, arranged against a neutral white background. The image was captured under natural lighting conditions, simulating a typical scenario encountered in domestic or clinical medication management.

The right panel presents the model’s inference output. Detected pills are enclosed within bounding boxes, each labeled with the corresponding pill name and associated confidence score. Notably, the model accurately recognizes and classifies several pills including Doxcef (confidence: 0.86), Supradyn (0.84), Minolast (0.90), Cipcal (0.90), and Paracetamol (0.82), underscoring its capability to distinguish between visually similar objects in complex scenes.

These results validate the robustness and practical applicability of the YOLOv5s model in multi-class pill detection tasks. The system’s ability to deliver high-confidence predictions in uncontrolled settings supports its integration into healthcare automation systems, particularly for use cases such as automated pill verification, medication adherence monitoring, and counterfeit detection. The demonstrated accuracy reinforces the model’s potential to enhance safety and efficiency in digital health solutions.

Figure 6 illustrates the model’s performance in a real-time pill classification scenario, where input images are captured dynamically under uncontrolled environmental conditions. This visualization highlights the model’s ability to accurately detect and classify pills in situ, even when faced with variable lighting, occlusions, and background noise—conditions commonly encountered in daily medication management practices.

3.4.3 Evaluation on overlapping pill scenarios

In addition to benchmark and real-world datasets, we further evaluated the proposed YOLOv5s-based framework on a dedicated subset of images containing partially overlapping and closely spaced pills, as such scenarios are frequently encountered in real-world medication settings.

To address challenges related to duplicate or ambiguous detections, a confidence-based Non-Maximum Suppression (NMS) strategy was employed. This mechanism prioritizes bounding boxes with higher confidence scores while suppressing redundant or low-confidence predictions.

The results, summarized in Table 11, demonstrate that the proposed system maintains strong performance under overlapping conditions, achieving an accuracy of 94.6%, precision of 92.8%, and recall of 91.7%, with an F1 score of 92.2%. While there is a modest performance drop compared to well-separated pill scenarios, the results confirm the robustness of the confidence-guided NMS in suppressing false positives and retaining valid detections.

Table 11 YOLOv5s performance on overlapping pill subset

The successful classification of multiple pills in real time underscores the practical utility of the YOLOv5s framework for healthcare-oriented applications, including but not limited to smart pill organizers, mobile health assistants, and automated medication compliance monitoring systems. The model’s fast inference and high classification accuracy make it well-suited for deployment in resource-constrained or edge computing environments, facilitating real-time decision-making and enhancing patient safety through reliable pill identification.

Fig. 5
figure 5

Illustration of YOLOv5s performance on stored pill images captured under real-world conditions

Figure 7 presents the Precision–Confidence Threshold curve for the detection and classification of Ibuprofen using the YOLOv5s model. The results were evaluated on both the standardized NLM Pill Dataset and a real-time image dataset. As shown, the model consistently achieves high precision on the NLM dataset across varying thresholds, reflecting its robustness under controlled imaging conditions. In contrast, a noticeable decline in precision is observed on the real-time dataset, particularly at lower confidence thresholds. This reduction can be attributed to challenges inherent in real-world scenarios, including lighting variability, background clutter, partial occlusion, and imprint degradation.

These findings underscore the importance of dataset domain alignment and highlight the need for further improvements such as domain adaptation, enhanced pre-processing, or self-supervised learning to maintain precision in unconstrained environments. The curve serves as a valuable diagnostic tool for fine-tuning confidence thresholds to optimize precision in different deployment contexts.

Fig. 6
figure 6

Real-time pill classification using the proposed YOLOv5s-based detection framework

Here is the plot illustrating Precision vs. Confidence Threshold for multiple pill types using the YOLOv5s model on the NLM Pill Dataset. The chart shows how precision varies across different confidence thresholds for five representative pills—Doxcef, Supradyn, Minolast, Cipcal, and Paracetamol. As expected, precision generally increases with higher confidence thresholds, reflecting the trade-off between detection certainty and inclusiveness. This kind of analysis is valuable for optimizing decision thresholds in real-time pill identification systems, especially in clinical or mobile environments (Fig. 8)

Fig. 7
figure 7

Precision Vs confidence threshold for ibuprofen

Fig. 8
figure 8

Precision Vs confidence threshold for multiple pill types

4 Conclusion

This study presented a robust and computationally efficient framework for pill detection and classification, leveraging the YOLOv5s object detection model in conjunction with a custom-curated dataset of pharmaceutical tablets. The proposed system effectively identifies pills in both static images and real-time camera feeds by extracting visual attributes such as shape, color, size, and imprints. The integration of this model within a graphical user interface (GUI) further demonstrates its practical utility for deployment in clinical, pharmaceutical, and mobile health applications. The model achieved high detection accuracy and real-time inference speeds, underscoring its suitability for applications in medication verification, adherence monitoring, and automated pill dispensing. However, to further enhance generalizability, future work will focus on diversifying the dataset to encompass a broader range of pill types, including capsules, generics, and region-specific formulations. Additional improvements may include the incorporation of multimodal learning, integrating visual features with imprint recognition, metadata, or even spectral and chemical information to increase robustness in challenging conditions.

Moreover, optimization for deployment on edge and mobile devices using model compression techniques—such as pruning, quantization, or migrating to lightweight architectures like YOLOv8 or EfficientDet—represents a promising avenue for scalability. Clinical validation through user studies involving healthcare professionals will be essential to assess usability and effectiveness in real-world scenarios. Ensuring regulatory compliance and interoperability with electronic health record (EHR) systems will also be prioritized to facilitate seamless integration into existing healthcare workflows. The proposed approach within the scope of the tested datasets (e.g., the NLM Pill Image Database and our real-time dataset) in the domain of automated pill recognition and offers a reliable, scalable solution with potential to enhance medication safety and operational efficiency across various healthcare environments.