کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
559006 | 875029 | 2015 | 19 صفحه PDF | دانلود رایگان |
• We integrate user appraisals in a POMDP-based dialogue manager procedure.
• We employ additional socially-inspired rewards in a RL setup to guide the learning.
• A unified framework for speeding up the policy optimisation and user adaptation.
• We consider a potential-based reward shaping with a sample efficient RL algorithm.
• Evaluated using both user simulator (information retrieval) and user trials (HRI).
This paper investigates some conditions under which polarized user appraisals gathered throughout the course of a vocal interaction between a machine and a human can be integrated in a reinforcement learning-based dialogue manager. More specifically, we discuss how this information can be cast into socially-inspired rewards for speeding up the policy optimisation for both efficient task completion and user adaptation in an online learning setting. For this purpose a potential-based reward shaping method is combined with a sample efficient reinforcement learning algorithm to offer a principled framework to cope with these potentially noisy interim rewards. The proposed scheme will greatly facilitate the system's development by allowing the designer to teach his system through explicit positive/negative feedbacks given as hints about task progress, in the early stage of training. At a later stage, the approach will be used as a way to ease the adaptation of the dialogue policy to specific user profiles. Experiments carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS), support our claims in two configurations: firstly, with a user simulator in the tourist information domain (and thus simulated appraisals), and secondly, in the context of man–robot dialogue with real user trials.
Journal: Computer Speech & Language - Volume 34, Issue 1, November 2015, Pages 256–274