Interpersonal communication is often difficult, as we have to switch our communication style based on the context (e.g. work or social), our conversational partner (romantic partner vs. parents vs. colleagues), and the topic we’re talking about (e.g. general chit-chat, joking around, or discussing serious issues). Written contexts, especially, frequently suffer from misunderstandings and failures to understand or effectively convey subtle messaging, which often results in insecurity about the best wording of messages and emails. On the other hand, we are sometimes faced with manipulative or harmful communication patterns employed by conversational partners, which are hard to recognise now.
Analysing written interactions with others and offering feedback on the communication style employed by users and their conversational partners has the potential to improve interpersonal written communication in various contexts. This includes questions of the right tone (i.e. is this email to my boss too passive-aggressive? Is this feedback constructive enough?) but also analysis of improvable or potentially toxic relationship dynamics (i.e. recognise and flag if a conversational partner is being manipulative, detection of grooming in messages of teenagers, etc.)
We aimed to build a locally running, LLM-based personal assistant that analyses written conversations from various media (email, WhatsApp, etc.) and provides the user with feedback about 1) their communication style, and 2) their relationship dynamics with the conversational partner. The user can communicate with the AI to get advice on how to change/improve their communication and what to do to adapt existing relationship dynamics. The assistant was intended to be adaptable depending on the social context and type of conversational partner, meaning the feedback would look different for friends or family as opposed to colleagues and employers, for instance.
Technical ApproachThe main challenge to achieving our goals in building an assistant with the capabilities of providing useful advice that users will perceive as helpful was finding data to use to train a model. We did not find any public datasets of advice given on interpersonal communications. We did find some datasets on (toxic or manipulative) conversations, but most of them were very focused on specific topics or domains (give specific examples), which led us to take a different approach towards data curation, leveraging the capabilities of the most powerful large language models available at the time. We followed a multi-step approach to create sample chats, which then served as a basis for the solicitation of advice. This advice was then rated in the context of a study employing crowd workers. The best pieces of advice were chosen as training data.
1. Chat Generation and Curation
The first step towards the collection of relevant data consisted of researching common types of manipulation strategies employed in different types of interpersonal relationships and using GPT-4 (gpt-4) and Claude Opus (claude-3-opus-20240229) to generate chats for each relationship-manipulation strategy combination using the OpenAI and Anthropic API, respectively, passing the following prompt:
{'role': 'system', 'content': You are an experienced psychologist and relationship and communication expert. You know everything about how people communicate and can pinpoint and recognize the most subtle forms of manipulation and toxic behaviour.\} (Note: The system prompt was passed only to GPT-4),
{'role': 'user', 'content': Create an example for a highly realistic [whatsapp chat/email thread] between a [romatic couple (A and B)/two or more friends/two or more colleagues/one or more colleagues and their boss].
The chat should contain one person being manipulative without the other person noticing by using the following conversational tactic: [tactic].
Return only the messages and the speakers and nothing else. Do not explain the chat.}
The table below gives an overview of the combinations. Following this, we manually read through and identified the most realistic and useful chats for our use case.
We selected 33 well-suited chats (all displayed manipulative behaviour by at least one speaker in variable degrees of subtlety). This selection was used as a basis for advice solicitation.
2. Advice Generation
We generated advice for each chat using different highly capable large language models (gpt-4, claude-3-opus-20240229, mixtral-8x22b-instruct, llama-v3-70b-instruct), the rationale being that the advice generated by these powerful models can serve as training data for smaller, less capable models with the ability to run on a local machine.
When generating advice, we differentiated between three different scenarios:
- Scenario 1: The manipulated person feels uneasy about the interaction, but is not sure why, leading them to ask for advice.
- Scenario 2: The manipulator feels like they may have acted wrong in the situation and asks for advice.
- Scenario 3: The manipulated person does not know what to say next and asks for advice.
We solicited advice for each of these scenarios from each of the models listed above via the OpenAI (GPT-4), Anthropic (Claude), and Fireworks (Mixtral, Llama3) API, using the following prompt template.
Below is a(n) [Whatsapp chat/email conversation] between me ([name of manipulated person]) and my [boss/colleagues/friends/romantic partner] ([name of manipulator]):
[Chat]
[The conversation leaves me uneasy, but I do not know why. What went wrong in this interaction?/I feel I might have acted wrong in the conversation. Did I make any mistakes? What could I do better?/What should I say next? Why would that be the right reaction?
3. Advice Quality Judgements
Following the generation of advice, we set up a prolific study, employing crowd workers in the judgement of the quality of advice. Participants were first presented with the respective chat and asked to judge how they would feel if they were the manipulator/manipulated in the chat, to make sure they understood the context of the chat correctly (see screenshots below).
Following this, they were presented with advice and asked to judge the advice on a scale of one (very bad) to five (very helpful). We also solicited free text explanations of what they liked or disliked about the advice [refer to chat_advice figure]. Each participant was asked to rate three pieces of advice, each corresponding to a different chat. This way we collected between 2 and 5 (mean=3) ratings for each advice included in the study. The study interface was created with Django and Docker, the code is included with the submission.
4. Few-shot Learning Smaller LLMs
After we had gathered ratings and judgements of the advice generated using the more powerful LLMs currently available, we used these ratings as few-shot examples for three smaller open-source LLMs: Phi3-mini, Llama3-8b-instruct, and Mistral-7b-instruct. A comparative evaluation of their respective performance with and without few-shot learning is presented in the evaluation section. To select suitable few-shot examples, we calculated the mean rating given for each advice and discarded all pieces of advice that had a mean rating lower than 4, ending up with 2-6 few-shot examples per relationship-conversational context combination. Two scenarios did not contain any advice with a rating higher than 4: Advice on what to say next in the relational context of friendships and the workplace, where no hierarchy exists between the conversational partners. As a workaround, we used corresponding few-shot examples from romantic relationships and the workplace with hierarchy, respectively in these two cases.
5. Advice Rater Model
To evaluate the efficacy of our few-shot approach, we used Google’s Gemini 1.5 flash as a rater model. For a test of whether this is feasible, we split the advice ratings into training and test sets and used the training examples as few-shot samples to be passed to Gemini. We then had it rate the test data and calculated the mean absolute error. The resulting error was 0.85. We took this as a sufficient value, as it, while not completely accurate in absolute numbers, seems to be successful at determining a general trend of which advice is better and which is worse, allowing for a comparison between pieces of advice. For the final rater model, we passed all pieces of advice with the respective ratings and rating explanations as few-shot examples to Gemini followed by the final piece of advice to rate.
6. Evaluation of Performance of Small Open Source Models
With the rater model in place, we evaluated our few-shot learning approach on the example of advice for the manipulated person in chats from the context of romantic relationships. To this end, we selected all the chats concerning romantic relationships we had initially generated but not included in our crowdsourcing study (a total of 14 chats) and used each of the small LLMs mentioned above (Phi, LLama3, and Mistral) to generate one piece of advice for the manipulated person in a zero-shot and one piece in a few-shot approach. The resulting responses were sent to the evaluator model to get ratings.
Both Mistral and Llama had improved scores when few-shot learning was applied, the difference being stronger for Mistral. The quality of advice given by Phi3 did not improve. Since Mistral benefitted most of the few-shot learning and was overall rated best, we chose Mistral-7b-instruct as the model to incorporate into out system.
7. Programming the GUI
The graphical user interface for interaction with the assistant was created using Gradio. At this point in time, it is a simple interface, that allows the user to paste a chat (for instance from WhatsApp) and select the type of relationship they have with the other person, and what type of advice they want to get. In the chat interface, they are then presented with the generated advice.
While in our experiments so far, we only used the rater model as a way to judge the efficacy of our few-shot learning approach, this could also be an effective tool for reinforcement learning in future iterations.
As of now, we have implemented a simple prototype, and shown the efficacy of our approach. In the future, we plan to improve the pervasiveness of the assistant, by allowing it to index the user’s interactions automatically and proactively interact with users when a problematic situation arises. This would necessitate the creation of a „manipulation-detector“. Considering the effectiveness of fine-tuning in improving the advice generation process, we expect the detection of manipulative interactions to be similarly feasible.
ConclusionIn summary, our project aimed to develop a locally running, LLM-based personal assistant capable of analyzing written conversations and providing feedback on communication style and relationship dynamics. The system was designed to be adaptable to different social contexts and conversational partners, offering tailored advice for various scenarios.
We tackled the challenge of data scarcity by generating realistic chats using advanced language models and crowdsourcing the evaluation of generated advice. This process enabled us to create a training dataset for smaller, open-source models through few-shot learning, which demonstrated improved performance in generating helpful advice.
Our prototype includes a functional GUI that allows users to input chats and receive advice based on their relationship context. Future improvements could involve integrating proactive detection of manipulative interactions and further refining the reinforcement learning approach using our evaluator model.
Overall, our project showcases the potential of leveraging large language models for enhancing interpersonal written communication and provides a foundation for further development and improvement in this area.
A demo of the functioning system is available at https://assistant.samyateia.de.
Comments
Please log in or sign up to comment.