Data source: online caregiving forum
ALZConnected.org is a USA-based website for ADRD patients and their caregivers officially supported by Alzheimer’s Association20. The caregiver forum of ALZConnected.org is the place where the caregivers, mostly informal caregivers of people with ADRD are actively seeking and providing advice and information to care for their loved ones4. The forum requires official registration as a member in order to write an original post and replies to the posts are granted only to the members. However, all the forum posts are public and seen by anyone without registration. This public forum was previously used for content analysis for other health topics among caregivers and ADRD patients21. The Institutional Review Board of Stanford University deemed this study exempt from human subjects research.
Data extraction: web-scraping
Web-scraping was done to retrieve public forum posts, from March 1, 2018 to February 28, 2022. We scraped only publicly available information, including the title, main body, and replies to the body of the posts, the date the post was added, the username of the post, and the date the user joined the forum. We pre-specified ten mental health keywords to selectively extract posts that contained caregivers’ mental distress and related stressors using the following keywords: depression, anxiety health, little interest, hopelessness, nervousness, worrying, loneliness, mental health, and mental distress. Ten keywords were adapted from the terms of widely used mental distress screening tool, patient health questionnaire-4, as well as commonly used terms for mental distress status in online health forums16. We programmed to scrape the posts if they contained any of the listed keywords, and the extracted information was imported into the Microsoft Excel spreadsheet.
Feasibility of studying mental health care needs of caregivers using online forum data
The feasibility was determined based on whether we were able to investigate informal caregivers’ mental distress and related stressors and needed support through the online caregiving forum data qualitatively and quantitatively. The titles of the posts were screened for this purpose given that they typically encapsulate the essence of the posts (e.g., “Emotionally tired,” “Feeling frustrated,” “Caregiver recovery program?”). Three independent researchers reviewed the titles for feasibility assessment (JK, YC, and ZRC). The primary researcher (JK) reviewed and categorized the posts if the title intended to (1) express the caregivers’ own mental distress or negative emotions, or (2) seek advice or support for their own mental and emotional distress. If the post could be qualified for both, it was assigned to the seeking advice category. If the title did not provide enough information to determine the category, the body of the posts was read as needed. The secondary researcher (YC) reviewed and confirmed the categorization adapting member checking approach. To further ensure the quality of categorization, the third researcher (ZRC) iteratively validated it by randomly selecting 100 titles from the total posts to review and categorize independently. The categorization of the third researcher (ZRC) was compared with the original categorization (JK and YC) and intercoder reliability was reported in terms of Cohen’s kappa.
Data analysis using natural language process (NLP)
Natural language processing (NLP) is a subarea of linguistics, computer science, and artificial intelligence and is widely used for large data analysis in health-related fields22,23. Machine learning (ML)-based NLP algorithm was used to perform three NLP techniques; tokenization, lemmatization and stemming, and topic modeling. For text analysis, we used some functions of NLTK (Natural Language Toolkit) for text pre-processing. First, NLTK Word-Tokenize split texts into groups of words, which were tokens (e.g., Thanks for any insight here. → ‘Thanks’, ‘for’, ‘any’, ‘insight’, ‘here’, ‘. ’). Second, we removed stopwords (e.g., ‘the’, ‘of’, ‘to’) listed in the NLTK corpus, along with additional stopwords defined at a later stage. This process also involved the removal of articles and punctuation. Third, NLTK Lemmatizer (e.g., WordNetLemmatizer) was used to strip the words down to their most basic form (e.g., being → be, walks → walk, Thank → thank). Fourth, with these pre-processed words, topic modeling was conducted using the Latent Dirichlet Allocation (LDA). LDA is a three-level hierarchical Bayesian model for collections of discrete data, including text corpora24. When LDA is applied to text modeling, it can generate a set of topic probabilities, in which each topic probability could represent a single prominent topic. The LDA model creates a group of topics that contain words that are most likely to belong to using ‘genism’ and ‘pyLDAvis’ libraries. For the topic modeling, we built bigram and trigram models to catch two or three-word groups commonly appearing together for advanced interpretation of the topic modeling results.
Validity assessment of the topics generated by the NLP/ML-based topic modeling
The validity assessment was done with two primary objectives. The first was to validate the performance of the NLP/ML model to see if it can generate the representative topics discussed in the online forum. For this purpose, the topics generated by the NLP/ML topic modeling were compared with the themes qualitatively generated by human readers to see if the NLP/ML modeled topics were matched with the qualitative themes. As a first step for this validation assessment, two researchers labeled the NLP-modeled topics. The primary researcher (SJR) interpreted and labeled the topics using the word clouds from the NLP model. The secondary researcher (SO) reviewed and confirmed the labels based on the member-checking approach. The disagreements were resolved through discussion. As a second step, Thematic Analysis was applied, which has been widely used to identify and analyze patterns of themes, topics, or ideas from online data for individuals with ADRD and their caregivers21. Two trained researchers qualitatively analyzed the text data, none of them were involved in the labeling of the NLP topics to avoid potential bias. The primary researcher (JK) read the posts carefully to generate tentative themes of the posts and refined the themes after multiple reviews. Out of posts containing at least one mental health keyword, we initially selected 100 posts at random to create the initial theme codebook, then progressively added additional posts into the analysis until no new topics could be discerned. The secondary researcher (YC) independently read the posts at random to review, check, and further refine the initial codes. Two coders met to reconcile the conflicts through discussion to finalize the codebook. Then, to validate the code, two researchers independently read the posts (body and title) to apply the identified code using approximately 3% of the posts. Two coders discussed conflicts for reconciliation and theme confirmation, and a meeting was planned with the third coder in case the two primary coders were unable to resolve the disagreement.
Upon the completion of the thematic analysis, a trained researcher (ZRC), who was neither involved in the thematic analysis nor NLP topic labeling, assessed the validity of the NLP topic modeling approach of extracting the key information from the large text data, as done previously25. The NLP-modeled topics were compared with themes from the qualitative analysis to be matched and the comparison was reported as a table10.
The second objective was to validate the online caregiving forum content if it aligns with the existing knowledge on caregivers’ mental distress and related stressors. For this purpose, we compared the NLP/ML modeled topics and themes with the well-developed framework for caregiving burden and strain among informal caregivers of people with ADRD4. Two researchers (ZRC and JK), who know the context of the online caregiving forum well, independently examined whether the online forum content (NLP/ML topics and themes) was consistent with the previously reported informal caregivers’ mental distress and related stressors. The assessments were compared and reconciled through discussion, and the agreed validity assessment was reported as a comparison table.