By DataKind San Francisco
An unresolved conflict at work, unconscious bias from a boss, an unexpected layoff. We all have experienced some of this at some point in time and can understand how a seemingly peaceful workplace can suddenly become tough and how important it is to have someone to talk to and guide us through these obstacles. But what if you don’t have such a supporter to confide your stress and frustration to or a wise mentor to point you to the light at the end of the tunnel? Certainly, you can turn to professional therapists, but what if you can’t afford one or have little or no prior experience with them? Enter Empower Work, a nonprofit organization that aims to solve this exact problem by providing free, confidential counseling services to help distressed employees with their work-related issues.
From November 2018 to July 2019, DataKind San Francisco was privileged to work with Empower Work to improve their services using what we do best — data science.
Outlining the Partnership
Empower Work works with volunteers and trains them to become skilled counselors, who then interact directly with the users via text messages to talk them through a specific issue and brainstorm a solution. At the time of our partnership, the organization had collected about 500 conversations (de-identified to protect user privacy) and asked us to perform the following two tasks:
- Understanding Key Drivers of Success Analysis: Provide quantitative analysis to complement a qualitative understanding of what drives conversations to successful completion. Identify specific trends in successful vs. unsuccessful conversations. For example, are there specific tools that correlate with better outcomes? Or is it something on the texter’s side that is out of the counselor’s control? Empower Work is interested in this kind of information to potentially improve its counseling approach, which is grounded in coaching best practices.
- Tagging Automation Approach: Each conversation is annotated with one or more tags describing the issues, themes, and milestones. As of now, all of the tags have been assigned by counselors themselves. Going forward, hoping to alleviate the human labeling effort, Empower Work would like us to explore a machine learning approach to automate the labeling process.
The below describes our approaches and conclusions to each of the two tasks described above. Read on!
Understanding Key Drivers of Success Analysis
We approached this question as a classic inference problem and built a logistic regression model to understand how different factors correlate with the outcome. In particular:
- Our dependent variable is a binary variable indicating whether a conversation has been completed or not, as determined by the counselors themselves.
- Our independent variables include the following:
- Counseling Tools: Based on our review of a sample of conversations, we identified a few common counseling techniques such as articulation (i.e., putting into words something that the texters have expressed, but not clearly stated and reflecting it to the texters), acknowledgement (i.e., acknowledging the texters for their values, actions, and accomplishments), and validation (i.e., validating their emotions). Using regular expressions, we created a bunch of dummy variables to automatically indicate whether a given practice is used in a given conversation.
- Text Features: These are features extracted directly from the textual data itself. Some of the common ones we considered include the sentiment scores (whether positive or negative words were used), message lengths, and the similarity between the messages sent by the texters and their respective counselors.
- Non-Text Features: These are mostly behavioral features such as the response lags, the aforementioned human-labeled issue tags, and the user platforms.
After fitting our model with these features, we found the following factors particularly insightful:
- On the counselor’s side, we found that acknowledging the texters has a high positive impact. For example, counselors frequently acknowledge their texters for being thoughtful and courageous and admire their values and integrity. Given how much we need validation and appreciation in a distressed state, it’s no coincidence that this particular counseling style proves to be highly effective.
- In addition to acknowledging the texters, we found that simply articulating what they have said is very effective. For example, counselors often recognize and confirm what they’re hearing by saying “I’m hearing that…”, “it sounds like…”, and “I can sense that…” Doing so may seem trivial at first glance, but this confirms the importance of being heard. In addition, this gives the texters the opportunity to let the counselor know if they disagree and to offer their own take.
- Switching to the texters’ side, we found a few dominant factors that are effectively out of the counselor’s control. For example, when a texter comes in with a relatively positive sentiment, as shown in his/her messages’ sentiment score, there’s a significantly higher chance that the conversation will conclude than if he/she comes with a negative sentiment.
In addition to the significant factors described above, we also identified a few interesting counseling practices that, although statistically insignificant in our model, are interesting enough to warrant an A/B test. For example, some repeating patterns that we see in counselor messages include questions designed to channel the solution out of the texters themselves (e.g., “What would you hope the outcome to be?”) and reframing (e.g., “What would you do if you were in his shoes?”). After all, our model was built using less than 500 conversations, which limited its explanatory power.
As a result of these findings, Empower Work has been able to refine its training and support to stress the use of tools that correlate to better outcomes.
Tagging Automation Approach
For each tag, we built an XGBoost model to predict it, one model per tag. As for the model features, in addition to the same counseling style and text features used in the inference task above, we also included a TF-IDF matrix that represents each conversation as a numeric matrix (i.e., a “bag of words”) and assigned weights to each word based on its within-document and cross-document frequencies. This simple representation turned out to be quite effective in our case given that certain words have an almost 1:1 relationship to certain tags.
One challenge was that there were over 100 tags across 500 conversations and half of them appeared in less than 10 conversations. Obviously, there’s little we can do for those minority tags, but where should we draw the line in terms of the minimum sample size? There are traditional rules of thumb that suggest cutoffs such as 30, but how do we derive and validate them empirically for our particular task?
To do that, we decided to first build a simple model for each of the 100 tags, regardless of its size, and validate its performance on a separate holdout set. In addition, to measure the stability of the models, we built five versions for each of them with different random seeds and computed the standard deviation of their performance scores. Intuitively speaking, a good model should have a high-performance score (i.e., the F1 score in our case) and low variance. After running this experiment, we visualized our result below by plotting the average F1 score and its standard deviation against each tag’s sample size. As expected, the more data a given tag has, the better our model is at predicting it (as indicated by the higher F1 score) and the more stable the model is (as indicated by the lower standard deviation). Based on this empirical analysis, we picked 50 as our threshold and only built models for the tags that have at least 50 data points.
We selected the F1 score as the metric in this case because it balances the model’s precision (i.e., when a model predicts a tag exists, how often does it actually exist?) vs. its recall (i.e., when a tag actually exists, how often does the model catch it?), a tradeoff similar to balancing false positives vs. false negatives. We also examined the model’s performance from an accuracy perspective, where some of our better-performing models exceeded 90% accuracy. However, accuracy can be misleading for the less common tags. As an extreme example, an algorithm that predicts “no tag” for all the examples might achieve high accuracy but would, of course, be useless.
Now that the models are built, they still have limited use if they’re just sitting on our laptops. To make it easier for the organization to use them, we built a web app using Flask that accepts one or more conversations (saved in a CSV file) and outputs the probability for a given tag. A screenshot of our interface is shown below.
As a proof of concept, our model has demonstrated how a machine learning approach can help identify best practices, which in turn have improved the way Empower Work trains and supports volunteers to provide the greatest impact to people who come to them for help. With more data gathered in the future, we’re excited, along with Empower Work, to retrain the model and further expand its impact.
Conclusion & Acknowledgments
For this project, we collaborated with Empower Work to analyze the potential factors that impact the outcome of a counseling conversation and made recommendations to improve the service. In addition, we built models to predict the relevant tags for a given conversation to alleviate the human labeling process in the future.
We want to thank Empower Work for entrusting us with this interesting and impactful project. We have learned a lot from it. We’d also like to recognize our talented and devoted volunteers from DataKind, namely, Runze Wang (Data Ambassador), Vishal Motwani, Edwin Zhang, and Peter Adelson. The project wouldn’t have materialized without your hard work.
Lastly, but importantly, we’d like to take this opportunity to raise the awareness of work-induced stress and the importance of mental health. If you feel being subject to pressure, discrimination, and burnout from work, please don’t feel alone in taking action. Talking by itself is usually an effective first step and if you can’t find anyone that’s readily available, come and talk to the good folks at Empower Work! There’s always someone who’s willing to lend an ear. You just need to speak.
Join Us
To get involved in current and upcoming projects with DataKind San Francisco, please check the DataKind website or follow DataKind San Francisco on Facebook or LinkedIn for more information.
Appendix
In this appendix, we briefly describe the steps we took to process the raw text data. If you’re working on something similar, we hope this will provide you with some ideas.
- Remove automated messages: Because of the nature of the service, some messages are sent automatically instead of by human counselors. We removed all of them because they do not contribute to the actual counseling sessions and confound features like response delays and sentiment scores.
- Aggregate messages that were sent disjointly: Many texters like to break a long running message into multiple ones and send them separately. To accurately compute the message-level features such as the average lengths, sentiment, and similarities, we aggregated all the uninterrupted messages sent by the same party up to a day into a single message.
- Compute sentiment scores: We used the off-the-shelf rule-based VADER sentiment analyzer from NLTK.
- Tokenize messages: We used spaCy to tokenize our messages because it’s easy to parallelize and comes with additional features like word embeddings out of the box, which were later used to compute sentence similarities.
- Compute message similarities: As described above, we used word embeddings to compute the similarities between the messages sent by the texters and those sent by their counselors in order to measure the coherence of the conversations.
- Identify counseling styles: As mentioned in the post itself, we developed a few regular expressions to capture counseling practices such as praise and acknowledgment.
- Create TF-IDF matrix: Lastly, in the tag prediction task, we constructed a TF-IDF matrix using both unigrams and bigrams that appear in at least 5% of the conversations. As mentioned in the post, this simple vectorization proved to be quite effective in capturing the keywords that determine the main issues of a conversation.
Because we needed to run the same processing pipeline for different tasks and subsets of the data, we modularized our development code into a library that can be easily called and extended, which allowed us to iterate quickly. Sometimes it does pay off to spend time upfront to facilitate things in the long run.
The header image above courtesy of Empower Work.