Close Menu
  • Breaking News
  • Business
  • Career
  • Sports
  • Climate
  • Science
    • Tech
  • Culture
  • Health
  • Lifestyle
  • Facebook
  • Instagram
  • TikTok
Categories
  • Breaking News (5,006)
  • Business (312)
  • Career (4,246)
  • Climate (213)
  • Culture (4,212)
  • Education (4,428)
  • Finance (202)
  • Health (854)
  • Lifestyle (4,101)
  • Science (4,116)
  • Sports (311)
  • Tech (174)
  • Uncategorized (1)
Hand Picked

Jordan Love, Tucker Kraft propel Packers to win over Steelers

October 27, 2025

New tool identifies overlooked areas crucial for protecting animals

October 27, 2025

Medway Nabs $14k in State Career Ed Grants.

October 27, 2025

BTS RM to deliver keynote speech at APEC CEO Summit — Showcasing K-Culture’s global soft power < Culture < 기사본문

October 27, 2025
Facebook X (Twitter) Instagram
  • About us
  • Contact us
  • Disclaimer
  • Privacy Policy
  • Terms and services
Facebook X (Twitter) Instagram
onlyfacts24
  • Breaking News

    Jordan Love, Tucker Kraft propel Packers to win over Steelers

    October 27, 2025

    Milei’s party wins big in high-stakes Argentina polls, early results show | News

    October 27, 2025

    Hang Seng Index, Nifty 50, Kospi, Nikkei 225

    October 27, 2025

    Jean-Pierre says she doesn’t regret anything as Biden press secretary

    October 26, 2025

    Two killed in Cameroon protests ahead of election results, opposition says | Elections News

    October 26, 2025
  • Business

    Google Business Profile New Report Negative Review Extortion Scams

    October 23, 2025

    Land Topic is Everybody’s Business

    October 20, 2025

    Global Topic: Air India selects Panasonic Avionics’ Astrova for 34 widebody aircraft | Business Solutions | Products & Solutions | Topics

    October 19, 2025

    Business Engagement | IUCN

    October 14, 2025

    10 ways artificial intelligence is transforming operations management | IBM

    October 11, 2025
  • Career

    Medway Nabs $14k in State Career Ed Grants.

    October 27, 2025

    Mark Fletcher Jr. has career-best 3 TDs, No. 9 Miami rolls past Stanford 42-7

    October 27, 2025

    Career Expo returns to Blaisdell with Hawaii’s top employers

    October 27, 2025

    News and Community

    October 26, 2025

    Staff Excellence: Center for Career Opportunities

    October 26, 2025
  • Sports

    Bye Week Off-Topic Thread – Yahoo Sports

    October 25, 2025

    This Thunder Rookie Guard Benefits from the Nikola Topic Injury

    October 23, 2025

    South Bend Topic Sports-betting | WSBT 22: News, Weather and Sports for Michiana

    October 21, 2025

    John Tesh’s iconic ‘Roundball Rock’ theme returns for NBA on NBC

    October 21, 2025

    YahooSergio Scariolo touched on the topic of European …Sergio Scariolo touched on the topic of European basketball and the NBA Europe project. “We don't have enough information..2 days ago

    October 21, 2025
  • Climate

    PA Environment & Energy Articles & NewsClips By Topic

    October 26, 2025

    important environmental topics 2024| Statista

    October 21, 2025

    World BankDevelopment TopicsProvide sustainable food systems, water, and economies for healthy people and a healthy planet. Agriculture · Agribusiness and Value Chains · Climate-Smart….2 days ago

    October 20, 2025

    PA Environment & Energy Articles & NewsClips By Topic

    October 17, 2025

    World Bank Group and the Intergovernmental Negotiating Committee on Plastic Pollution Process

    October 14, 2025
  • Science
    1. Tech
    2. View All

    It is a hot topic as Grok and DeepSeek overwhelmed big tech AI models such as ChatGPT and Gemini in ..

    October 24, 2025

    Countdown to the Tech.eu Summit London 2025: Key Topics, Speakers, and Opportunities

    October 23, 2025

    The High-Tech Agenda of the German government

    October 20, 2025

    Texas Tech Universities Ban Teaching About Transgender and Other Gender Topics

    October 19, 2025

    New tool identifies overlooked areas crucial for protecting animals

    October 27, 2025

    Living computers powered by mushrooms

    October 27, 2025

    Scientists Just Found a Super-Earth Exoplanet Only 18 Light-Years Away : ScienceAlert

    October 26, 2025

    Astronomers Just Found a Sneaky Asteroid Near the Sun—and It Highlights a Dangerous Blind Spot

    October 26, 2025
  • Culture

    BTS RM to deliver keynote speech at APEC CEO Summit — Showcasing K-Culture’s global soft power < Culture < 기사본문

    October 27, 2025

    For superfans, comic-con culture is more than fun – it’s sacred, a sociologist explains

    October 27, 2025

    ‘World’s first’ Museum of Youth Culture to open in London · News ⟋ RA

    October 26, 2025

    Hundreds of books on NYC history and culture available at sale

    October 26, 2025

    First, came the Louvre heist. Then came the memes

    October 26, 2025
  • Health

    Breast Cancer Awareness Month 2025

    October 26, 2025

    Hampton: Community Encouraged To Attend November Los Alamos County Health Council Meeting

    October 24, 2025

    Health Insurance vs. Nuclear Weapons

    October 23, 2025

    Health Care Coverage For Seniors Topic Of West Hartford Forum

    October 20, 2025

    Mental health & finance topic for women @Bromley conference

    October 17, 2025
  • Lifestyle
Contact
onlyfacts24
Home»Health»A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
Health

A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information

March 1, 2025No Comments
Facebook Twitter Pinterest LinkedIn Tumblr Email
41598 2024 83743 Fig1 Html.png
Share
Facebook Twitter LinkedIn Pinterest Email

Training topic model with Afaan Oromo health documents

The aim of topic modeling is used to mine the unstructured documents’ main topics17. In our methodology, we used LDA, which is the generative probabilistic model and is a popular topic modeling technique, for discrete data sets such as text corpora23,30,41. The LDA will receipts the pre-processed texts as input. Then group the words inside the text documents based on most similar meanings42. The words in the topics are associated with conditional probability and representing a confidence degree, which means how much each word or terms is relevant to that cluster’s topics. The obtained groups of words signify the diverse topics in the document. The main difficulties of using LDA model is that it does not label the extracted topics as mentioned in35,43. To solve this issue, we manually label this group of words in the themes. In our work, we limit the top 10 words and 10 topics. When using human judges to interpret the topic model results by domain experts, the best effective technique is to carefully analyze the underlying keywords of the model output. Thus, the issue of unlabeled extracted topics is resolved. After labeling the acquired clusters of words, extract the most representative keywords of each topic and each documents. We initially, we extract a topic matrix vs. document. In this matrix, every term is associated with the probability of belonging to a specific topic.

The following summary of steps carried out during LDA model training steps.

  1. 1.

    To implement the topic detection, processing of the topic keyword distribution, and the topic document distribution, the Gensim package multicore LDA models is used.

  2. 2.

    The best number of topics we used by running the LDA model on our corpus is 10, and for each topic we used 10 keywords.

  3. 3.

    The final step assigns the topics to each document using the model by finding the probability distribution of the sentence per generated topic assignment.

Finally, we obtain the topics of the documents and, for each topic, assigned the label along with the most illustrative words associated with those documents.

  1. A.

    Dominant topic extraction and output visualization.

Table 2 reports some topics that are extracted from the electronic health document. As observed from this experiment, every topic consists of 10 keywords associated to the topics. Each extracted word in each topic has a correspondent probability ratio. We used 10 most probable keywords for each topic. However, there are a few words that might have nonsense to those topics, but most of them are relevant to associated topics. Some of the topics are a little mixed, and meaning can be dissimilar, but most of the other extracted topics give sense, and they can be clustered as a category in the document. To this achieve this optimal topic and keyword selection, we relied on various model evaluation metrics, namely coherence values for topic interpretability assessment and perplexity scores for measuring model ability to predict data. The evaluation of these metrics was conducted iteratively by varying the number of topics and observing their impacts on model performance. This assortment procedure was guided by both quantitative metrics and qualitative assessments by the domain experts to ensure transparency and relevance. Each keyword is listed based on its probability distribution in descending order. The meanings of each keyword are provided in English in the context of medical concepts.

Table 2 Extracted topics from AOHD dataset when LDA model applied.

As we stated in the training topic modeling section, every text document in corpus will be assigned to one theme class based on a high probability contribution to the dominant topic. After training the LDA models, we present the distribution of the topics in the document for each sentence and the weight of the topic and the keywords in a pleasantly formatted output, and the sample results were as shown in Table 3. Figure 4 illustrates that document topic distribution weights for all documents in our corpus.

Table 3 Sample document topic distribution after implementing LDA model from the AOHD dataset.
Fig. 4
figure 4

Topic distributions of documents in corpus of the AOHD dataset.

  1. B.

    Topic labeling.

In this experiment, topic modeling produces word distribution that gives meaningful topics similar to word topic probability distribution according to the chosen number of topics. Each word has a probability of each document and the relevance of one topic to the other. After the LDA model has been generated, a document can be determined based on the distribution of topics describing the document’s collection of terms. The LDA model uses a BOWs assumption, which means the order of words is ignored in the document. From our experimental result, the generated topics with important keywords are listed in Table 2. Each term under each topic or theme is assigned based on a weight, with the highest ranking words having greater weights.

Defining the core meaning of these extracted topics is crucial. However, the model does not tag topic labels automatically. The labels are made manually by carefully observing the grouped words in each topic, since LDA will only return the group of keywords known as topics. Despite the developments in statistical methods, the interpretability of the model output is not assured because of the difficulty in the language44. Therefore, when merging the word matrix of each topic, it is required to assign the artificial labels to correctly reflect the internal association and context of corresponding keywords. To label the extracted topics we involved three domain expert annotators. The labeling process was guided by balancing interpretability and coherence values to ensure accurate and meaningful labels. When disagreements arose among the annotators, particularly regarding ambiguous words such as homonyms or polysemy, we took the majority to resolve it. This procedure expected to maintain consistency and strengthen the interpretability of the extracted topics in the medical context. Based on the extracted topics was labeled and with their distribution results so that we could easily describe each topic. In Topic 0 most keywords are “dhagna”, “dhufee”, “gogsa”, “roqomsiisa”, “dugda”, “qorachiisu” are showing that this topic is about nervous related disease, the most common words in Topic 1 “ulfaa”, “hordoffi”, “laguu”, “dhiigatu”, “darbee”, “garmalee”, and “baayyata” are implying that the topic is about gynecology, Topic 2 is mental illness, Topic 3 is eye disease, Topic 4 is ear disease and disorders, Topic 5 is internal disease, Topic 6 is chronic disease, Topic 7 is skin disease, Topic 8 is orthopedic conditions, and Topic 9 is dental related disease are presented in Table 4.

Table 4 Manually formed labels for the extracted keywords per topics.
  1. C.

    Topic model evaluation.

To evaluate the output of the model we used the two qualitative metrics the coherence value (Cv) score from topic coherence framework and the quantitative method perplexity39,40.

Perplexity.

It is a standard quantitative method to evaluate topic models. It is termed as how successful our model predicts the sample word probability over the documents based on term occurrences in the topics. Perplexity45 is defined as algebraically corresponding to the inverse of the geometric mean per word probability and calculated for the corpus D test by computing the natural exponent of the mean of log-likelihood of the corpus of words, as presented in Eq. 1. In topic modeling, the lower perplexity value shows the better generalization ability.

$$\:\text{P}\text{e}\text{r}\text{p}\text{l}\text{e}\text{x}\text{i}\text{t}\text{y}\left(\text{D}test\right)=exp(-\frac{{\sum\:}_{d=1}^{m}\:log\:p\left(wd\right)}{{\sum\:}_{d=1}^{m}\:Nd}\:\:)$$

(1)

Where, Dtest is the test dataset, M is the number of textual documents in the test datasets, wd represents the words in document d, and p(w, d) is the likelihood of the document d. The perplexity of our LDA model at topic 10 based on the experiment is −6.403. We used perplexity to evaluate the model result for guessing the number of topic k for the model assessment of our dataset since the model assumed the K parameters are mandatory. Calculating log perplexity yields a negative value due to the logarithm of a number.

Coherence value.

In topic modeling, the coherence value is the score measure of semantic likeness between each word in the topics extracted17. When topics generated are semantically interpretable, topic coherence scores are high. Several measures based on the word co-occurrence score of the most significant terms for each distinct topic have been applied in the topic extraction literature. The best practice to determine how a topic is interpretable is to evaluate the coherence of the topic. The human topic ranking is the gold standard for assessing coherence, but it is pricy. In the current experiment, we implement a particular case of topic coherence using Eq. 2.

$$\:Cv=\frac{2}{N\left(N-1\right)}{\sum\:}_{i=1}^{N-1}\:{\sum\:}_{j=i+1}^{N}sim\left(\text{w}i,\text{w}j\right)\:$$

(2)

Where, N is the total number of terms in the top-N list for the given topic, sim(wi, wj) is the Jaccard similarity between the sets of documents containing words wi and wj.

In our experiment, topic modeling is employed for topics 2, 4, 6, 8, 10, and the coherence score is nearly the same for each generated topic as observed, ranging from 0.5918 to 0.5878. In this work, extracted topics are well interpretable at topic 10 by considering coherence values and human expert judgment. The interpretability of the topics in our methodologies coherence values is illustrated in Fig. 5.

Fig. 5
figure 5

Topic model Coherence values using LDA algorithms.

In this work, we assess our model results based on interpretability quality criteria through human inspection and topic coherence metrics. In human judgment, the decisions of each resulting topic are manually recorded to articulate whether the topic provides a meaning or not.

Now, in order to mine further specific medical information, the technique is needed to advance retrieval performance. Finding the relevant document within this collection of documents in response to the user query is known as information retrieval46. As the information retrieval wants to address ambiguous knowledge, the exact processing techniques are not appropriate47. The probabilistic models are more suitable for these techniques. Within this model, relations are provided with probability weights corresponding to the significance of the document. This weight reflects different ranks of relevance. The results of the current IR schemes are regularly sorted lists of the documents where the top outcomes probabilities are more expected to be relevant with the search index according to the systems. In some methods, the users can judge the returned results and tell the methods which ones are applicable for them. The method then resorts to the result set. Documents that have many of the arguments present in them are ranked as having a higher probability value48. These relevance response processes are known to greatly advance IR performance. As a search key from the corpus, we can use document ID, document name, or document keywords. In our experiment as search queries, we use document identity numbers to retrieve respected topic information through the LDA model. As a consequence, the model tries to retrieve relevant topic information from a query entered by the user. As per the LDA model assumptions, every document is a mixture of topics. The model can also be able to visualize this topic mix and dominant topics in each document. For instance, when the user provides document ID 1018 to the developed model, which contains “gurra keessa waraana, ciniina” are displayed as the topic weight distributions. This means the model finds information about the given document ID from our corpus. Topic mix for sample documents is as presented in Table 5.

Table 5 Presents the information retrieved from our corpus based on document ID.

Document classification analysis and evaluation

As we mentioned in the above dominant topic extraction section, each document was allocated to one category based on the highest topic distribution weight. This is important when we execute topic models on the same dataset again and again to find the best k number of topics. We trained the LDA model to extract the topics; then after, we required to go further with the health document classification in Afaan Oromo with the model. We have gone through document categorization experiments according to the topics. Once a topic classifier is trained using the LDA models, we need to test the topic distribution in the documents.

We have randomly selected 600 (20%) of documents as a test set from the total datasets. After performing the experiment, we manually labeled this test document with a specific label name for every document. We compared the manually labeled test documents with the results of LDA-labeled documents. We used common metrics, namely accuracy and F1 score. Then, we computed the effectiveness of the LDA model output in order to realize how the model applied is good enough for the rest of the whole dataset. Although our models classify most of the problem statements accurately and give the correct class labels for the test documents, we also analyzed a few examples where our model could not give accurate results due to a few constraints of our dataset.

In the context of topic modeling and NLP, the main challenge of polysemy (words with multiple meanings) and homonymy (words that are spelled the same but have different meanings) can significantly impact the interpretability and accuracy of the model. For instance, from our dataset, the word “Dhukkuba” can mean “disease” or “pain”. In a medical-related document, it could refer to a specific illness, general pain, or symptoms, depending on its context. The word “Nyaata” can mean “food” (noun) or “eats” (verb). The context of usage will determine whether it refers to the food itself or to the act of eating. This can make it difficult for the model to clearly distinguish between the topics. However, our model shows promise results when we evaluate our test documents. From this experiment, we obtained 79.17% accuracy and 79.66% F1 score for the LDA model results that we tested with the similar training set. These metrics are promising, especially in the context of unsupervised topic modeling where such performance levels are often considered robust. In this framework, these scores show the model’s ability to generate coherent topics that align well with the underlying structure of the data.

Medical specialist recommendation to patients

In this work, we present a recommender method that helps the users find solutions for their health conditions. Individuals spend more time and need various efforts to search doctors and medical departments according to the specialties provided. For this task, we have successfully created the corpus, the topics, and the documents associated with specific labels. The proposed method provides efficient medical specialists recommendations to fill this gab. The model takes an input inquiry from the user. This taken input inquiry is considered, and it gets sent into our model to the most satisfying output as the medical specialists and from that we generate and give out the most probable suggestions that can be used by the patient based on the most similar weight. The model will choose the highly predictive similar documents for personalized recommendation to the target users. According to our integrated systems, it recommends the documents with the highest predicted score value to the users, as presented in Fig. 6. The model result is deemed accurate if the patient’s consultation query falls within the disease class in which the suggested medical professionals are highly skilled.

Fig. 6
figure 6

Medical specialist prediction model architecture.

Finally, we use our model to make predictions of the suitable medical specialists for new documents or patent queries provided by the users. In this experiment, the output of the LDA results, the topics and topic probabilities associated with trained documents, would result in labeled documents that would finally be useful to classify unseen documents. Figure 7 illustrates the document-word distribution weights for the ten topic or specialists departments. Each document under each topic is assigned a weight, with the higher ranking documents having greater weights. Based on the higher ranking, the model will recommend the highest topic name for each inserted document, as presented in Table 6. The users can also view the model generated response for the provided unseen documents.

Table 6 Presents our model topic labels prediction of unseen documents of the AOHD.
Fig. 7
figure 7

Graphical representations of documents topic distribution of unseen data using our model (Note: Nervous disease = 0, Gynecology = 1, Mental illness = 2, Eye disease = 3, Ear disease and disorders = 4, Internal disease = 5, Chronic disease = 6, Skin disease = 7, Orthopedic conditions = 8, Dental disease = 9).

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Breast Cancer Awareness Month 2025

October 26, 2025

Hampton: Community Encouraged To Attend November Los Alamos County Health Council Meeting

October 24, 2025

Health Insurance vs. Nuclear Weapons

October 23, 2025

Health Care Coverage For Seniors Topic Of West Hartford Forum

October 20, 2025
Add A Comment
Leave A Reply Cancel Reply

Latest Posts

Jordan Love, Tucker Kraft propel Packers to win over Steelers

October 27, 2025

New tool identifies overlooked areas crucial for protecting animals

October 27, 2025

Medway Nabs $14k in State Career Ed Grants.

October 27, 2025

BTS RM to deliver keynote speech at APEC CEO Summit — Showcasing K-Culture’s global soft power < Culture < 기사본문

October 27, 2025
News
  • Breaking News (5,006)
  • Business (312)
  • Career (4,246)
  • Climate (213)
  • Culture (4,212)
  • Education (4,428)
  • Finance (202)
  • Health (854)
  • Lifestyle (4,101)
  • Science (4,116)
  • Sports (311)
  • Tech (174)
  • Uncategorized (1)

Subscribe to Updates

Get the latest news from onlyfacts24.

Follow Us
  • Facebook
  • Instagram
  • TikTok

Subscribe to Updates

Get the latest news from ONlyfacts24.

News
  • Breaking News (5,006)
  • Business (312)
  • Career (4,246)
  • Climate (213)
  • Culture (4,212)
  • Education (4,428)
  • Finance (202)
  • Health (854)
  • Lifestyle (4,101)
  • Science (4,116)
  • Sports (311)
  • Tech (174)
  • Uncategorized (1)
Facebook Instagram TikTok
  • About us
  • Contact us
  • Disclaimer
  • Privacy Policy
  • Terms and services
© 2025 Designed by onlyfacts24

Type above and press Enter to search. Press Esc to cancel.