Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
559006 | Computer Speech & Language | 2015 | 19 Pages |
•We integrate user appraisals in a POMDP-based dialogue manager procedure.•We employ additional socially-inspired rewards in a RL setup to guide the learning.•A unified framework for speeding up the policy optimisation and user adaptation.•We consider a potential-based reward shaping with a sample efficient RL algorithm.•Evaluated using both user simulator (information retrieval) and user trials (HRI).
This paper investigates some conditions under which polarized user appraisals gathered throughout the course of a vocal interaction between a machine and a human can be integrated in a reinforcement learning-based dialogue manager. More specifically, we discuss how this information can be cast into socially-inspired rewards for speeding up the policy optimisation for both efficient task completion and user adaptation in an online learning setting. For this purpose a potential-based reward shaping method is combined with a sample efficient reinforcement learning algorithm to offer a principled framework to cope with these potentially noisy interim rewards. The proposed scheme will greatly facilitate the system's development by allowing the designer to teach his system through explicit positive/negative feedbacks given as hints about task progress, in the early stage of training. At a later stage, the approach will be used as a way to ease the adaptation of the dialogue policy to specific user profiles. Experiments carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS), support our claims in two configurations: firstly, with a user simulator in the tourist information domain (and thus simulated appraisals), and secondly, in the context of man–robot dialogue with real user trials.