This is my work together with Shotallo Kato and Katrien Verbert for IUI 2022, a conference on intelligent user interfaces. I am particularly proud of this work because it is based on Shotallo's master's thesis, which I guided. We studied the impact of (placebo) explanations on adolescents' trust in a platform that recommends mathematics exercises.
Paper
Download the preprint underneath or read it online on ResearchGate or ACM Digital Library.
Abstract
In the scope of explainable artificial intelligence, explanation techniques are heavily studied to increase trust in recommender systems. However, studies on explaining recommendations typically target adults in e-commerce or media contexts; e-learning has received less research attention. To address these limits, we investigated how explanations affect adolescents’ initial trust in an e-learning platform that recommends mathematics exercises with collaborative filtering. In a randomized controlled experiment with 37 adolescents, we compared real explanations with placebo and no explanations. Our results show that real explanations significantly increased initial trust when trust was measured as a multidimensional construct of competence, benevolence, integrity, intention to return, and perceived transparency. Yet, this result did not hold when trust was measured one-dimensionally. Furthermore, not all adolescents attached equal importance to explanations and trust scores were high overall. These findings underline the need to tailor explanations and suggest that dynamically learned factors may be more important than explanations for building initial trust. To conclude, we thus reflect upon the need for explanations and recommendations in e-learning in low-stakes and high-stakes situations.
Main takeaways
- Multidimensional trust measures are more nuanced than one-dimensional trust measures
- Dynamically learned factors (e.g., perceived accuracy of recommendations, exercises’ quality) may be more important than explanations for building initial trust
- Placebo explanations are a useful baseline, especially when combined with qualitative data
- No explanations may be acceptable in low-stakes situations, but tailoring explanations remains important in high-stakes situations
Demo
Here is a short demo video to get a better understanding of what our mathematics platform looked like. The demo shows how students can practise, pick recommended exercises, and see explanations for why those exercises were recommended.
Presentation
Due to the COVID pandemic, the IUI 2022 conference took place virtually. Here are the prerecorded presentation and the handouts of the slides.
Transcript
Hi everyone! My name is Jeroen Ooge. I am a PhD researcher at KU Leuven in Belgium and I'm very excited to present you the work that I did with Shotallo Kato and my supervisor Katrien Verbert. As you can see in the title, we worked on explanations for recommendations in e-learning and their effects on adolescents' trust.
Let me first give you a bit of background. There has been a lot of research on recommendations and how to explain them for a specific target audience. Especially nowadays this topic gains a lot of new traction in a wider field called explainable artificial intelligence. While we see a lot of very interesting progress, we also find some limitations in the current state-of-the-art. A first limitation is that studies are mainly framed in e-commerce or media contexts whereas e-learning is less studied. Second is that study participants are often university students or adults, and rarely adolescents. And finally, on a methods level, we also see that studies often compare an explanation only with interfaces without explanations.
We tried to address these limitations by doing a study in e-learning with adolescents and using two baselines for a comparative study. In more detail, we built a learning platform for mathematics called Wiski and Wiski recommended exercises based on the level of students. These students were middle or high school students, so they were adolescents. And then finally, we also explained the recommendations and compared our explanation interface with two baselines: one interface without explanations and one with placebo explanations. These placebo explanations, by the way, are a kind of pseudo-explanations that do not really reveal anything about the underlying algorithm. Our general research question was then how these explanations affect initial trust in Wiski for recommending exercises.
Before going to the results, let me tell you something about our methods. All questions and all students on Wiski had an Elo rating that changed while the students were solving exercises. And when we wanted to recommend exercises to students, we used those Elo ratings in a first step to find suitable questions. In a next step, we then ranked those questions with collaborative filtering where we looked at students who solved similar questions. In the end, we then recommended the top three questions that are suitable for students.
To explain those recommendations, we built an interface in three iterations and several think-alouds. And this user-centered approach allowed us to evolve from a full-fledged tutorial that provided full transparency to a single-screen explanation.
Here is the explanation in more detail. As you can see, it consists of different parts. The first part is a why-statement that just said the exercise was recommended based on the level of users. The second is a justification that stated that Wiski expected the students to need X amount of attempts before solving the exercise correctly. And then finally, students could compare themselves with others in this bar chart that also showed the number of attempts of other students. At the bottom, in green, students could go to an overview of exercises to manually pick exercises instead of following the recommended exercises. Here at the right top you see our placebo explanation interface which just contained a single sentence stating that the exercise is recommended because the algorithm of Wiski said so. And then finally, the no-explanation interface of course
did not give any motivation for the recommendation.
With these explanation interfaces, we conducted a randomized control experiment with three groups. Obviously, each group got one explanation interface. Here's a general flow of our study. Students could select exercises, solve them, and then saw an explanation interface. They did this five times, so every student solved five exercises and saw the explanation interface five times before filling out a post-study questionnaire.
The post-study questionnaire was focused on our research question about initial trust. Here are the 19 Likert-type questions that directly measured initial trust. We did this in two ways. One question measured one-dimensional trust and the other questions measured multidimensional trust, which is an average of three components. First is intention to return, then perceived transparency, and finally trusting beliefs, which is on its own again an average of competence, benevolence, and integrity. And after each of these groups we also added an open text field where students could further elaborate on their answers. Besides these direct measurements, we also indirectly measured initial trust by logging whether students accepted recommendations or not.
Okay, time for the results. We found that real explanations did increase multidimensional initial trust. And as you can see in the tables, both trusting beliefs and perceived transparency increased significantly, whereas intention to return did not. However, at the same time, we also found that one-dimensional trust did not increase significantly, which was quite a surprise. And finally, we found that students with real explanations accepted significantly more recommended exercises. So there are a lot of things we can learn from these results. I will share two main lessons. The first one is that multidimensional trust measures seem to be more nuanced than one-dimensional trust measures. You also see this in the box plots where the multidimensional trust scores are more finely distributed than the one-dimensional trust scores. A second thing that we see here is also that overall trust scores were actually rather high. This led us to believe that the explanation interface that we built was maybe not the most important factor for gaining initial trust. Rather, we think that dynamically learned factors were important, for example, perceived accuracy of recommendations, quality of the exercises, the way the platform looked... This was further backed by a correlation analysis. Here you see, for example, that one-dimensional trust was barely correlated to perceived transparency, whereas it was highly correlated to both competence and integrity. So this seems to suggest that the explanation interfaces increased competence, for example, which in turn increased the initial trust.
Regarding the placebo explanations, we found that they did not increase initial trust compared to no explanations. This is different from previous research. So therefore we would advise against using placebo explanations as a surrogate for real explanations, for example in a situation where real explanations cannot be provided because of a cold start problem, because, as we saw, placebo explanations may undermine perceived integrity, which in turn leads to lower trust. However, we do think that placebo explanations are a very useful baseline, especially when they are combined with qualitative data. In our study, for example, the open comments allowed us to better understand how critical students stood towards the explanations that they saw and also how much transparency they actually needed.
To conclude, let me take a quick step back and reflect a bit on explanations in e-learning. We distinguish between two contexts. On the one hand, we have low-stakes contexts, for example if students are drilling and just making a lot of exercises in a short amount of time. In those contexts, we think that it may not be necessary to provide full-fledged explanations but rather it may be sufficient to just show an indication of the difficulty level. For example, in our exercise overview, we showed these tags which indicated that exercises were either easy, medium or hard for the logged-in student. And then on the other hand, we also have high-stakes contexts, for example when students are preparing for an exam and time is limited, then it becomes more important to explain the recommendations. And specifically, it's very important to tailor those explanations both to the context and to the students. For example, in our think-alouds, we found that middle school students were having a harder time to quickly understand the histogram in our explanation interface.
So I think time is up. On behalf of myself and my colleagues, I would really like to thank you for watching this presentation. I hope you found it inspiring and I hope you continue our work on explaining recommendations for adolescents in e-learning. If you have any questions or comments, of course do not hesitate to reach out via email or social media. Also, in our paper there are lots of more details and interesting discussions so hopefully you find some time to read that as well. Thanks a lot and see you around!