1 Introduction

In this paper we describe a frameworkFootnote 1 for building knowledgeable, collaborative, explainable, multimodal dialogue systems. To illustrate the approach, we present in detail Eva, a fully-functional neuro-symbolic domain-independent collaborative dialogue system that takes seriously the tenet that the purpose of task-oriented dialogue is to assist the user. Eva attempts to collaborate with its users by inferring and debugging their plans, then planning to overcome obstacles to achieving their higher-level goals. In order to do so, Eva represents and reasons with beliefs, goals and intentions (BDIFootnote 2) of the user and the system itself. Because the dialogue engine is a planner, as the dialogue proceeds, the system is able to go beyond scripted, slot-filling, or finite state dialogue behavior to flexibly generate, execute, and potentially repair its plans using both non-communicative actions and speech acts. As part of its reasoning, Eva performs plan/goal recognition on the user's mental state. Importantly, the system itself decides what to say, not the developer, by obeying the well-studied principles of persistent goals and intentions (see Cohen and Levesque [1]). Importantly, because the framework is centered around the underlying BDI machinery, Eva is able to explain its actions and its plans, thus achieving more trustworthy interactions. Our purpose here is not to argue that Eva is the best framework for implementing task-oriented conversational systems, and certainly not that systems built on top of Eva will necessarily have better performance on any of the common benchmarks. Rather, we stress that the capabilities we just briefly mentioned, and which we will detail in this paper, are essential for conversational agents that are engaged in meaningful, goal-oriented dialogue, but have been and continue to be neglected by much of the otherwise extensive body of work in this area.

A useful and meaningful dialog system in a rich task-oriented natural language conversational setting must be collaborative. Indeed, collaboration is so essential to society that we teach our children to be collaborative at a very early age [2]. True collaboration is more than just being “helpful”, in that one could help someone else by setting up the “environment” such that the other agent succeeds. For example, we might be helpful with children in such a way that they do not know what we have done to help them. However, most conversational systems, even those dubbed as “assistants,” do not know how to be helpful, much less to collaborate. At the dialogue level, they are generally incapable of inferring and responding to the intention that motivated the utterance. We and others have argued that deep collaboration involves agents’ (mutual) beliefs and joint intentions to ensure that the joint goals are achieved [3,4,5,6]. Whereas physical actions are planned to alter the physical world, communicative acts are planned to alter the (joint) mental and social states of the interlocutors. Dialogue is a special case of collaboration that has properties of its own.

A collaborative dialogue system is able to combine information from a representation of domain and communicative actions, a representation of the world and of its interlocutors’ mental states, and a set of planning and plan recognition algorithms, in order to achieve its communicative goals. Among the actions that are planned are speech acts, some of whose definitions we have given in various papers of ours (e.g., [7,8,9]). The system thus plans its speech acts (e.g., to ask for the user’s name) just like it plans its domain actions (e.g., to make an appointment for the user). The approach dates back to work done at Bolt Beranek and Newman [10,11,12], at the University of Toronto [9, 13, 14], and at the University of Rochester (e.g., [15,16,17,18,19]). Such systems attempt to infer their conversants’ plan that resulted in the communication, and then to ensure that the plans succeed. Recent works in this vein include [5, 17, 20, 21].

We claim this expectation for dialogue and task collaboration derives from implicit joint commitments or shared plans [1, 5, 6, 22] among the conversants, here the task-oriented dialogue system and its user, towards the achievement of the user’s goals. Such a joint commitment implies that the parties will help each other to achieve the jointly committed-to goal, and will inform one another if that goal becomes impossible.Footnote 3 Whereas it is possible to build dialogue systems that reason with the formalization of joint intention/commitment [23], the present system attempts to behave according to JI Theory principles (cf. [18, 19, 24, 25]).Footnote 4 A central feature of this approach is that the system will attempt to infer as much of the user’s plan as it can, will try to identify obstacles to its success, and plan to overcome those obstacles in order to help the user achieve his/her higher-level goals. Thus, plan recognition and planning are essential to Eva’s architecture and processing.

Though the collaborative plan-based approach to dialogue is an attractive theory that has received many years of research (e.g., [1, 8,9,10, 13, 15, 16, 30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45]), few full-scale implementations have taken place that incorporate all the components needed to create a system that can engage in useful collaborative dialogues.Footnote 5 In Eva, a dialogue system’s internal state is declaratively specified as logical expressions, which provides a basis for a reasoning system, as well as a formal semantics for the system’s internal states and its inferences. As a result, we and the system itself are able to explain its inner workings. In other words, the system’s state is not merely latent [46], but transparent.

The planning-based approach detailed here is quite different from approaches in present research or commercial dialogue systems, and even plan-based approaches of the past. Essentially, rather than just populate a system “belief state” made up of so-called “intents” with “slots” and “values” as many present conversational systems do [47], Eva maintains a far more expressive representation of beliefs, goals, and intentions that drive the collaborative planning and plan recognition system. Furthermore, the same reasoning mechanism supports multi-agent (person or bot) dialogues. In doing so, the system uses a rich knowledge representation that describes its user and its domain of discourse.

Whereas much of the basic theory has been developed for some time, among the novelties of this framework is how the theoretical pieces can be combined in an operational dialogue system that provides collaborative dialogue capabilities. For example, current generation “intent+slot” systems have great difficulty with the user’s answering a slot-filling question (“what time do you want the reservation?”) with a statement of a constraint (“not before 11am”). Furthermore, such systems assume that it is the user who fills slots. But systems can propose potential slot fillers (“would you Like 9 am?”), which a user may then accept, reject, or negotiate. The parties have together filled the slot, not just one of them.Footnote 6 We accomplish the reasoning with a Horn-clause interpreter (realized in PrologFootnote 7), which enables computable inference within its subset of logic. Many researchers have thought the logic of intention to be too complex to implement and to support a dialogue system. However, Eva systems operate in real-time using spoken language and driving a complex avatar. This paper shows the components of Eva that make it possible to develop such conversational systems.

1.1 Sample dialogue

To give the reader an idea of the kinds of dialogues that Eva can support, we present one example set in a vaccination appointment scheduling scenarioFootnote 8. A sample of the assertions comprising the specification of the domain for implementing the dialogue system used here can be found in Appendix C.

Nr.

Utterance

1.

U: Are there any covid vaccination centers nearby

2.

S: Yes, the CVS Covid vaccination center is located at 130 Main St and this location is 6 miles away from you via the following route: turn right, go straight, then turn left.

3.

S: Would you like to be vaccinated at a Covid vaccination center?

4.

U: Yes

5.

S to CVS: Do you have covid vaccine available?

6.

CVS to S: Yes

7.

S to U: OK, I believe that the CVS covid vaccination center has Covid vaccine.

8.

S: How old are you?

9.

U: why do you ask

10.

S: The reason is that I need to determine whether you are eligible for the vaccine.

11.

S: Sorry to repeat myself… How old are you?

12.

U: 45 years old

13.

S: What is your occupation?

14.

U: why

15.

S: The reason is I need to determine if you are an essential worker

16.

S: Sorry to ask you again, but what is your occupation?

17.

U: I am a teacher

18.

S: You are eligible for the covid vaccine and appointments are available at CVS on June 26 at 9 am.

19.

S: Would you like me to make an appointment at a covid vaccination center?

20.

U: Yes, please

21.

S: What time would you like the appointment?

22.

U: Monday the earliest time available

23.

S: the earliest time available is 8 am.

24.

S: Would you Like me to make an appointment for you at 8 am?

25.

U: Yes, please

26.

S: Ok I made an appointment for you at CVS on June 26 2023 at 8 am.

27.

S: Is there anything else I can help you with?

28.

U: yes, i actually cannot make it on monday at 8

29.

S: Would you Like me to reschedule your appointment at CVS that is currently scheduled on June 26 2023 at 8?

30

U: Yes

31.

S: What date would you like me to reschedule your appointment at CVS to?

32.

U: Monday at 10

33.

S: Ok, I’ve rescheduled your appointment at CVS to a new time on June 26 2023 at 10 am.

34.

S: Is there anything else I can help you with?

35.

U: No, thanks.

36.

S: Glad to be of service. Goodbye!

What this sample dialogue (revisited in greater detail in Appendix B) illustrates is that the system can:

  • Plan its actions to achieve its goals;

  • Infer the goals and intentions that the user likely has;

  • Plan to assist the user by proactively informing what it believes the user needs to know (e.g., the route);

  • Handle the user’s over-answering;

  • Handle constraints (e.g., “the earliest time available”);

  • Reason with domain rules (e.g., a person who is less than 50 years old is eligible for the covid vaccine if that person is an essential worker);

  • Explain its actions (e.g., its reasons for asking about the user’s age and occupation);

  • Generate new knowledge acquisition goals;

  • Reason about who knows the answers to its knowledge acquisition goals (e.g., CVS);

  • Engage in dialogue with multiple parties (e.g., ask CVS a question to find information relevant to user’s goals);

  • Develop and execute a plan to accommodate the user’s changing her mind by achieving the user’s revised goal. In the process, the system undoes what has already been done in the process of achieving the user’s original goal (i.e., it reschedules the user’s appointment from Monday at 8 to Monday at 10).

The rest of the paper describes how the Eva framework makes the above features possible.

1.2 Map of the paper

In Section 2 we describe the system architecture, including its essential representations, its basic operating loop, its embodiment as an avatar, and its logical form meaning representations. Section 3 presents the formalism, mostly drawn from Cohen and Levesque [1], including the base Horn clause logic, the modal operators, and the action representation. Section 4 shows how Eva’s Horn clause modal logic meta-interpreters can reason about and maintain the system’s rational balance among its mental states. Given this machinery, Section 5 discusses how speech acts are represented as planning operators. Section 6 presents our approach to collaboration, especially its reliance on planning and plan recognition. Eva is driven by its beliefs about its own and its user’s goals and intentions, so it is important to know where goals come from. Therefore, we discuss there how goals arise during the course of planning. Section 7 shows how a model of the user’s mental states underlies “slot-filling”, thus formalizing the core feature of most alternative approaches. Section 8 describes our BDI architecture, which reasons with mental state formulas to form plans of action incorporating both domain and communicative actions. This architecture is shown to have an operational semantics based on the specification of rational interaction provided by Cohen and Levesque [1]. Section 9 discusses how the system chooses which of its many intentions to execute next, and Section 10 presents Eva’s approach to maintaining and using context. Among the important features of the example and system is how Eva handles requests for explanation during the dialogue, which we discuss in Section 11.

Because this research digs deeply into the past four decades of a number of branches of AI research, we cannot possibly do justice to all the relevant work, but refer the reader to surveys of some of the relevant literature [49, 50]. We provide some of the major references in the text, and discuss in Section 12 some of the recent work that is treading similar ground, highlighting what makes the Eva framework unique. The appendices provide the details of the logical formalism, a comprehensive walk-through of the system’s behavior during the dialogue shown above, an illustration of the knowledge requirements for defining a dialogue system in Eva, and some details of a small-scale exploratory study on the behavior of agents based on large language models (LLMs).

2 Overall system architecture

As shown in Fig. 1, in Eva we assume a dialogue system that takes as input speech and other modalities (e.g., gesture, sketch, touch, vision, etc.), parses them into meaning representations and fuses their meanings into logical forms (LFs) that incorporate one or more speech acts (sometimes referred to as dialogue acts). Using the same representation of speech acts for planning and plan recognition [8, 9, 14, 17, 20, 51,52,53], the LF is input to a plan recognition process that attempts to infer why the user said/did what was observed. Once a user’s plan is derived, Eva adopts the user’s goals as its own if they do not conflict with its other goals and obligations. The system then collaborates by attempting to find obstacles to the plan [8, 9, 13, 17, 20, 51, 52], which it plans to overcome in order to help the user achieve their higher-level goals, resulting in intended actions. Afterwards, it executes (some of) those intended actions, which may well involve communicating with the user, generating linguistic and multimodal output, including text-to-speech, graphics, and avatar behavior. In the course of this processing, the system may access backend databases and commercial systems to gather and/or update required information and take needed actions. We only discuss the dialogue manager in this paper, but first we mention the system’s inputs and outputs.

Fig. 1
figure 1

The architecture of a collaborative, planning-based dialogue system

2.1 Natural language parsing and generation

For many current task-oriented dialogue systems, the meaning representation is simply an “intent+slot” representation of an action and its arguments that, it is assumed, the system is being requested to perform [46, 54,55,56]. However, this is too simplistic a meaning representation to support logically expressive dialogues. Eva’s LF meaning representation involves more complex formulas that express both the speech actions that the parties are doing as well as the content of their utterances, which not only includes domain actions to performFootnote 9, but also complex operator combinations (e.g., comparatives, superlatives, Boolean combinations, temporal constraints, etc.), and operators over actions (e.g., “quitting smoking,” “permitting to be interviewed”). We provide in Section 12.2 a detailed comparison of the slot-filling approach to dialogue with our plan-based approach that incorporates true logical forms (see also [48]). Eva maps utterances into “surface speech acts” [9, 14], from which it infers the intended meaning of indirect speech acts.

Although this paper will not discuss natural language processing per se, we briefly mention in reference to Fig. 1 that Eva employs a semantic parser based on a pre-trained language modelFootnote 10 that we have fine-tuned on pairs of utterances and logical form (LF) meaning representations. The parser returns an n-best list of LF representations that incorporate both the surface speech act and its propositional content. These “surface” LFs are further interpreted by a domain-independent rule-based component in EvaFootnote 11 that determines the best intended speech actFootnote 12 and resolves entity references (e.g., indexicals) based on the discourse context. Finally, the resulting LF is passed to the plan recognition component that starts reasoning about the user’s speech act(s) and what s/he was trying to achieve in performing them. The training of the parser for a new domain is begun with “canonical” (affectionately known as “clunky”) utterances generated from logical forms that are derived from the backend application, as in the “Overnight” approach [58]. These canonical utterances are then paraphrased into natural language, using automated tools and/or crowd-sourcing. We typically also augment the training data with synthetic data, but reduce our reliance on such data as soon as we have collected actual user inputs during system testing and, eventually, actual usage of the system. Because the system makes use of a large-scale multi-lingual language model during the parsing process, when building the system to be used in a language other than English, a relatively small number of human-generated paraphrases of the canonical utterances can be gathered in that language and added to the training data [59].

Natural language generation uses a hierarchical generator driven off the logical form structure that creates “canonical” utterances (the “clunky form”), which are then post-processed with a small set of rules to produce reasonable English output. This output is further passed on to translation services to produce output in other languages supported by our system. We have also successfully explored using LLM-based paraphrasing to go directly from clunky form to human-like surface output. Thus, our symbolic dialogue system determines what to say while the LLM decides how to say it. We do not use large language models by themselves as generators because they are not sensitive to what the system is intending to convey, potentially resulting in inappropriate or untruthful utterances at that stage of the dialogue.

Because this paper is about collaborative dialogue, not the natural language processing itself, we will not delve further into the details of the NLP in the rest of this paper. Let us assume the parser can provide a proper logical form and a generator can produce natural language from a logical form.

2.2 Multimodal input/output

Eva has been given a multimodal avatar embodiment (Fig. 2) that accepts spoken language, camera-based input, and text, and produces spoken, textual, GUI, and face/head gestures as output. Fig. 2 is a screen shot of a recent dialogue. The system tracks various emotional states and engagement in real-time shown in the lower left panel, enabling it to generate responses to the user’s signaled states, using its model of the user’s beliefs and intentions [60] shown in the lower right panel. Conversely, the system can generate facial gestures and rapport-building utterances to signal various emotional states, such as sadness. The cues for emotions (visual and via text-to-speech) are based on the words being generated, the logical forms generated in context, and discourse-level features (e.g., topic shifts). As an example, consider if the system asked the user “Was there a reaction to that vaccination?” and the user said, “Yes.” The system can generate an empathic utterance (e.g., “That’s too bad!”), even though the user issued a “positive” or “neutral” utterance. There is more to say about Eva’s multimodal capabilities, which will be covered in another paper.

Fig. 2
figure 2

Eva avatar, vision-based emotion recognition, dialogue, and a snapshot of the system’s beliefs, goals, and intentions

3 Knowledge representation and inference

Below we present the knowledge representation that underlies this framework, which is encoded as modal logic formulas describing the system’s and the users’ beliefs, goals, and intentions. The reason for encoding the system’s knowledge in a logic is that it affords the more expressive representations required for engaging in substantive dialogues about tasks. In the sections that follow, we will provide extensive examples of those representations and the associated reasoning. Our purpose in this framework is to develop the tools sufficient to engage in expressive dialogues, not to focus on logic and reasoning per se. Rather, we believe any system sophisticated enough to engage in such dialogues will need to make the distinctions that are encoded herein.

In this section and in Appendix A, we describe the formalism and its semantics, drawn from Cohen and Levesque [1]. The Eva system uses a Horn clause-encoded first-order modal logic, with constants, variables, typed terms, n-ary predicates, conjunctions, disjunctions, negation of literals, and existential quantification. In addition, we include second-order predicates, such as superlatives, set operators, etc. Predicates and actions take a list of arguments, with each argument specified by a role name, a variable, and a type drawn from an ontology. Syntactically, we adopt the Prolog convention of signifying variables with capital letters. Thus, an argument list will be of the form: [Role:Variable#Term …]. Note that this use of the term ‘role’ is derived from natural language processing, as opposed to the planning literature. From here on, assume all arguments presented have this form, though we will only typically specify the Variable. Importantly, Eva’s knowledge representation incorporates a belief operator (bel), along with two operators defined in terms of bel, for knowing whether a formula holds (knowif) and knowing the referent of a description (knowref). Likewise, it incorporates persistent goal (pgoal) and intend as modal operators, with pgoal taking a formula as an argument, and intend taking an action expression (see Section 3.5). Attribution of mental states, as well as plan recognition and planning, involves uncertainty. Each mental state has a probability argument, which will not be discussed in this paper. The probability of a pgoal or intention declines as more rules are applied, and as fan-outs (disjunctions) are created during planning and plan recognition. Where available, prior probabilities on intended actions are multiplied, lowering the score. A separate utility calculation could be performed for each intended action, with some domain predicates having negative utility (e.g., dead(usr)), and some positive utility (e.g., vaccinated(usr)). We make no claims to having a theory of what utilities various actions and states should have.

For understanding the formulas and reasoning that Eva uses, we note briefly here:Footnote 13 Predicates over action expressions include: do, doing, and done. We also allow specific times to be arguments and have special “before” and “after” operators that apply to them. Eva takes advantage of basic Horn clause reasoning (as in Prolog), and incorporates two meta-interpreters for forward (‘→’) and backward (‘istrue’) modal logic reasoning, which are used to assert and to prove formulas, respectively. Prolog incorporates the not-provable operator ‘\+’, and does-not-unify operator ‘\=’, both of which we also use in the meta-interpreter. Thus, the system tries its best to prove a proposition P given the rest of the system’s mental states and inference rules, but non-provability of P is different than proving ~P. The system is capable of saying “I don’t know”.

3.1 Mental state modal operators

The attitude modalities we employ are Belief (bel), Persistent Goal (pgoal), and Intending to Perform Actions (intend),Footnote 14 along with two defined operators, knowif and knowref.

3.1.1 Belief

Modal logics of knowledge and belief have been well-studied, dating back to Kripke [61] and Hintikka [62]. We will use a Kripke-style possible worlds semantics of belief, such that propositions that an agent believes are true in all possible worlds that are related to the given one (the world the agent thinks s/he might be in). Thus, for a well-formed formula P we say:

Syntax: bel(X, P) agent X believes formula P, if P is true in all of X’s belief-related worlds. (See Appendix A).

If P is true in some of X’s belief-related worlds, but not in all of them, then the agent neither believes P nor believes ~P.

The bel modal operator also has a positive introspection property — if the agent has a belief that P, it has a belief that it believes that P, and conversely. However, we do not adopt a negative introspection property — if the agent does not believe that P, it does not have to believe that it does not believe that P.

For example, we will want to be able to state formulas such as (after utterance 21 in the example from Section 1.1) that the user believes that the system wants to find out the date that user wants the system to make an appointment for the user. One can see that there are multiple embeddings of modal operators in such a formula. By the end of this section, the reader will be able to see how such formulas can easily be expressed. The syntax and formal semantics of our modal operators is given in Appendix A.

3.1.2 Goal

The model of goals adopted in Cohen and Levesque [1] is that goals encode agents’ choices. This is modeled in a possible worlds semantics by having goal formulas P be true in all of the agent’s chosen worlds. These are not simple desires because desires can be inconsistent. However, the goals P can be false at some time and true at others, as would be the case for achievement goals. Among the worlds compatible with the agent’s beliefs are all the ones compatible with the agent’s goals (choices). In other words, by assumption, the chosen worlds are a subset of the worlds consistent with the agent’s beliefs.Footnote 15 That does not mean that if the agent has P as a goal, it believes P is true. Instead, we require the converse (see Appendix A) — if the agent believes formula P is true, it is true in all belief-compatible worlds, so it is true in all chosen worlds, which are a subset of the belief-compatible ones. The agent must choose what it believes to be true “now.”

Syntax: goal(X, P) — X has a goal that P is true, if P is true in all of X’s chosen worlds. Because of the way formulas are evaluated in terms of a world and time, P is true in the chosen world at the time of evaluation. Most often, however, P will incorporate a temporal predication.

Agents are not unsure of their goals – if an agent has the goal that P, it believes it has the goal that P and vice-versa. Given the realism constraint discussed above, the goal modal operator also has a positive introspection property — if the agent has a goal that P, it has a goal that it has the goal that P, and conversely. However, as with belief, we do not have a negative introspection property of goal itself. Finally, based on the semantics given in Appendix A, an agent cannot have as a goal a formula that it believes to be impossible (the agent believes it will never be true.)

3.2 Basic axioms of the modal logic

Given these operators, we provide axioms that they need to satisfy. We assume all axioms of first-order logic. The system supports belief and goal reasoning with a KD4 semantics and axiom schema [64, 65] (see below and Appendix A for details). Specifically, bel(X,P) and goal(X,P) means P “follows X’s beliefs/goals”. ‘\(\models\)’ means ‘is satisfied in all worlds’:Footnote 16

K: If P is a theorem, \(\models\) bel(X, P) and \(\models\) goal(X, P) – theorems are true in all worlds

\(\models\) bel(X, P \(\supset\) Q) \(\supset\) (bel(X, P) \(\supset\) bel(X, Q)) – agents can reason with their beliefs

D: \(\models\) bel(X, P) \(\supset\) ~bel (X, ~P) – agents’ beliefs are consistent

\(\models\) goal(X, P) \(\supset\) ~goal (X, ~P) – agents’ goals are consistentFootnote 17

4: \(\models\) bel(X, P) ≡ bel(X, bel(X, P)) – positive belief introspection

\(\models\) goal(X, P) ≡ goal(X, goal(X, P)) – positive goal introspection

Realism: \(\models\) bel(X, P) \(\supset\) goal(X, P) – agents’ chosen worlds include what they believe to be currently true

For this system, material implication ‘\({\supset }\)’ is approximated using Prolog’s Horn clause reasoning.

Whereas the axioms above license the system’s belief reasoning as sound relative to the semantic model, the system is incomplete and does not derive all the logical consequences of its beliefs.

3.3 Defined modal operators

A critical aspect of a system that reasons about its users’ mental states is that it can only have an incomplete model of them. Specifically, we need the system to be able to represent that an agent “knows whether or not” a proposition is true, without its knowing what the agent believes to be the case. For example, we might model that the user knows whether or not her car is driveable. Likewise, the system needs to represent that an agent knows the value of a function or description, without knowing what the agent thinks that value is. For example, the user knows his/her birthdate, social security number, etc. Because Eva does not use a modal operator for knowledge, we define the concepts in terms of belief below.Footnote 18 Whereas a number of recent epistemic planning systems incorporate a knowif operator, as applied to propositional formulas, none of them incorporate a modal logic knowref operator as applied to first-order formulas and their arguments.

3.3.1 Knowing whether (knowif)

Given the modal operator for belief, we can define operators that capture incomplete knowledge about another agent’s beliefs. Specifically, we define an agent “knowing if/whether” a formula P holds as follows [8, 9, 13, 14, 39, 66, 67]:

$$knowif\left(X,P\right)\ {=}_{def}\ bel\left(X,P\right) \lor bel\left(X,{\sim}P\right)$$

In words, an agent X’s ‘knowing whether or not’ P is true is defined as either X believes P is true, or X believes P is false. Notice that this is different between X’s believing (P or ~P) holds, which is a tautology. One agent X can believe that another agent Y knows whether or not P holds without X’s knowing what Y in fact believes. However, X believes Y is not undecided about the truth of P. For this reason, in our example dialogue S can ask CVS a yes-no question as to whether CVS has covid vaccine because S believes CVS knows whether or not it has vaccine.

3.3.2 Knowing the referent of a description (knowref)

Of critical importance to dialogue systems is the ability to represent and reason with incomplete information about functions, values, etc. Specifically, a system needs to represent that an agent knows what the value of the function or term is without knowing what that agent thinks that value is (or else the system would not need to ask). For example, we should be able to represent that John knows Mary’s social security number without knowing what John thinks it is. This is weaker than believing that the agent thinks the value is a constant, but stronger than believing that the agent thinks there merely is a value (i.e., Mary has a social security number). The classical way to represent such expressions in the philosophy of mind is via a quantified-in belief formula [68,69,70]. Accordingly, we define an agent X’s knowing the referent of a description asFootnote 19:

$$knowref (X, {Var}^{\wedge}\negmedspace Pred)\ {=}_{def} \ \exists Var\ bel(X, Pred), \text{where}\ Var\ \text{occurs free in}\ Pred$$

The semantics for quantifying into beliefs and goals is given in Appendix A; in short, this expresses that there is some entity of which the agent believes has the property Pred. Semantically, a value is assigned to the variable Var, and it remains the same value in all of the agent’s belief-compatible worlds. For example, the system may believe that agent X knows the referent of X’s social security number, without the system’s knowing what that number is. Notice that this is different than the system’s believing X believes X has a social security number, for which the existential quantifier is within the scope of X’s belief. We argue below that the overriding concern of the present task-oriented dialogue literature involving “slot-filling” is better modeled with goals to knowref.

Building systems that could represent and reason with this formula was a major accomplishment of the early work in plan-based dialogue systems [8, 9, 14, 34, 51, 71, 72], and see also Moore [73]. Yet, apart from the work from Allen’s group, no recent systems have undertaken to plan speech acts (or any acts) using such representations. Indeed, the many epistemic planning systems that use the PDDL language or other propositional logic base cannot do so. Cohen [48] shows that the task-oriented dialogue paradigm can be recast in our modal logic incorporating these expressions for so-called “slot-filling” goals. Moreover, we show there and here how one needs to be able to quantify into modalities other than just belief, such as persistent goals.

3.3.3 Intention as a species of persistent goal (pgoal)

Intention is a concept critical to virtually all analyses of human cognition, including common sense, legal, philosophical, psychological, practical, linguistic, pragmatic, and social analyses. The philosophies of mind and language include subspecialities on theories of intention, including notable works by Anscombe [74], Bratman [75], Grice [26], Searle [76], and many others. Following Grice, intention is a critical concept to enable an agent to reason about what a person meant in saying something, which led the Toronto group to focus on plans and plan recognition. However, the concepts that their plans used in that early work were only initial approximations to mental states. Cohen and Levesque’s analysis [28], which was further developed in Cohen and Levesque [1], provided a formalization of intention in a modal logic that shared with Bratman’s the notion of commitment. However, unlike Bratman, Cohen and Levesque [1] defined intention in terms of goals (choices) that persist subject to certain conditions driven by the semantics and by the relativization of intentions on other mental states. Such internal commitments were shown to satisfy Bratman’s desiderata for a theory of intention.

The representation of intention as a species of persistent goal (pgoal) is the basis for Eva’s operation.Footnote 20 The idea is that simple goals can be ephemeral, so one cannot count on a simple goal or choice as having any consequences on an agent’s future actions. Rather, agents need to be committed to their choices [28, 75]. We will typically assume in this paper that the phrases “want” and “need” will be translated into pgoal in the logical forms.

We represent by pgoal(X, P, Q) the fact that agent X has a persistent goal to bring about the state of affairs satisfying formula P, relative to formula Q, and define it asFootnote 21:

$$pgoal(X, P, Q)\ {=}_{def}$$
$$goal(X, \Diamond P) \land {\sim}bel(X, P) \land before((bel(X, P) \lor bel(X,\Box {\sim}P) \lor {\sim}bel(X, Q)),{\sim}goal(X, \Diamond P))$$

That is, X does not currently believe P to be true and has a goal that P eventually be true (X is thus an achievement goal); X will not give up P as an achievement goal at least until it believes P to be true, impossible, or irrelevant. By ‘irrelevant’ we mean that some other condition Q, typically encoding the agent’s reasons for adopting P, is no longer believed to be true.Footnote 22 We frequently will relativize one persistent goal to another as part of planning. If the agent does not believe Q, it can drop the persistent goal. However, if none of these conditions hold, the agent will keep the persistent goal. Cohen and Levesque [1] showed conditions under which having a persistent goal would lead to planning and intention formation. See the formal semantics in Appendix A.

For example, in utterance 19 of the sample dialogue, the system asks the user if the user wants (has as pgoal) that the system make an appointment for the user. The pgoal that led it to plan this question is represented asFootnote 23:

$$pgoal(sys, knowif(sys, pgoal(usr, P, Q)), R),$$

where

$$P=done(sys, make\_appointment(usr, Business, Date, Time))$$

That is, the system has a persistent goal relative to R to come to know whether the user has a persistent goal to achieve P, relative to Q. When the user says “yes”, the system adopts the persistent goal to make an appointment, which is relative to the user’s persistent goal that the system do so (R). If the system learns that the user no longer wants the system to make the appointment, the system can drop its persistent goal and any parts of the plan that depend on it. Otherwise, the system will keep that persistent goal and plan to achieve it.

As a second example, in utterance 21 the system asks what date the user wants the appointment. This question is planned because the system has created a pgoal, relative to its wanting to make an appointment for the user, to know the Date such that the user has a pgoal (relative to some other proposition Q) that the system make an appointment on that dateFootnote 24.

$$pgoal(sys, knowref(sys,{Date}^{\wedge }pgoal\left(usr, P, Q\right)), R),$$

where

$$P=done(sys, make\_appointment(usr, Business, Date, Time))$$

3.3.4 Intention to do (intend)

Agent X’s intending to perform action Action relative to formula Q is defined to be a persistent goal to achieve X’s eventually having done Action:

$$intend(X, Action, Q)\ {=}_{def}\ pgoal(X, done(X, Action), Q).$$

In other words, X is committed to eventually having done the action Action relative to Q. An intention is thus a commitment to act in the future.Footnote 25 Notice that intentions take (potentially complex) actions as arguments and are relativized to other formulas, such as pgoals, other intentions, and beliefs. If a relativization condition is not believed to be true, the agent can drop the intention. For example, in the sample dialogue at utterance 21, we can see the system’s forming the intention to ask a wh-question relative to the persistent goal to knowref, and then executing the intention.

This notion of intention is different than the current (mis)use of the term “intent” in the “intent+slot” dialogue literature as it incorporates an internal commitment to act as part of a plan and relative to having other mental states. “Intent” in the dialogue literature is basically taken to refer to an action that the user supposedly wants the system to perform, or occasionally, to refer to an ordinary predicate. Here, we treat intend as a modal operator, with a formal logic and semantics (see Cohen and Levesque [1]).

3.4 Defaults

Eva operates with a collection of defaults [79], many of which are targeted at its user model and domain facts. The general default schema is, using Prolog operators ‘:-‘ (is true if’), ‘\= ‘(not unifiable) and ‘\+ ‘(not provable):

$$istrue\left(P\right)\ \text{:-}\ default\left(P\right), P\ \backslash\!\!=\ {\sim}Q,\backslash\!\!+\ {\sim}P.$$

That is, P is true if P is stated as a default predicate, P is not the negation of a formula Q, and it cannot be proven that ~P. If P is a default formula, but its negation can be proven, then istrue(P) does not hold. These are among the normal defaults (Reiter, 1980).

For example, we might have the following default schemas:

  • default(driveable(car_of(usr))) — by default, the user’s car is driveable.

  • default(knowif(usr, damaged(mobile_phone(usr)))) — by default, the user knows whether the user’s phone is damaged.

  • default(knowif(usr, pgoal(usr, Pred, Q))) — by default, the user knows whether the user has a pgoal that Pred be true.

  • default(knowif(usr, knowref(usr, Var^Pred))) — by default, the user knows whether the user knows the referent of Var^Pred.

The previous two defaults schemas have schematic variables for the interesting formulas and variables (Pred and Var, and of course the relativizer Q). When those are provided with actual formulas and variables, a specific default would be queried.

3.5 Action expressions and their description

Of particular importance to a planning system is the representation of actions. In this section, we provide action expressions that incorporate both primitive and composite actions. Using an extended STRIPS-like description of actions, primitive actions are described using an action signature, which consists of the action name and a list of arguments, where each argument is of the form Role:Value#Type.Footnote 26 The only requirement for an action is that there be an agent role and filler of type agent (which is a concept in the ontology). Fig. 3 shows an English gloss of the description of the action of a vaccine center’s vaccinating a patient at a location, date and time. Composite action expressions will be detailed below. We will use the term “action” to mean both primitive and composite action expressions.

Fig. 3
figure 3

The action description of a Vaccination Center’s vaccinating a Patient

Action descriptions state, in addition to the signature, the action’s Precondition, Effect, Applicability Condition, and Constraint, which are well-formed formulas. The Precondition is that the Patient has an appointment at a certain Date and Time, and that the Patient is located at location Loc, which the Constraint stipulates to be the location of the vaccination center. The Effect states conditions that the agent wants (i.e., has as a pgoal) to be true at the end of the action. The Effect stated here is that the Patient is vaccinated by the Center.

We define four functions that operate over action expressions Action, providing the well-formed formulas P that are an action’s precondition, effect, constraint, applicability condition.

$$\begin{aligned}&precondition(Action, P)\\& effect(Action , E),\\& constraint(Action, C),\\& applicability\_condition(Action, A)\end{aligned}$$

Because the action description is just that, a description, rather than a modal operator (as in dynamic logic), the planner and reasoner cannot conclude that after an action is performed, the effect E in fact holds, or that it holds conditioned on the precondition. Furthermore, there is no attempt here to prove that after a (potentially complex) action has been performed, the complex action expression or plan is a valid way to achieve the effect (cf. [52]). Rather, as a description of an action, the stated effect E is described as the desired or intended effect, which is realized by an agent’s having a pgoal to achieve effect E.

Given a pgoal to achieve P, Eva’s planning subsystem will attempt to find an action expression whose effect E as stated in the action description unifies with P. If P is a complex formula, planning rules will decompose it, and attempt to find more primitive actions that achieve the simpler components. For example, it may decompose effects that are conjunctions into individual plans to achieve the conjuncts. But, as is well-known, this may well be problematic (cf. [80]), and thus Eva’s planning is only a heuristic approximation to finding a plan that truly achieves the overall goal P. We will discuss below a special case in which we do in fact engage in such goal decomposition. Still, this limitation may not matter for a dialogue system, as we are not attempting to reason in advance about a long sequence of utterances during an interactive dialogue. Rather, the system attempts to take a step towards its goal and react to the inputs it receives, much as other BDI architectures do [81]. However, unlike such architectures, Eva engages in backward-chaining and plan recognition, and reasons about other agents’ mental states. Moreover, other BDI architectures are oriented towards communication among artificial agents using KQML communicative actions, which we have criticized elsewhere [82]. Instead, we use a well-founded set of speech acts inspired by natural language communication.

If the agent has the pgoal that a formula E holds, and that formula unifies with the effect of some action expression A, it would add a pgoal to perform A (i.e., an intention to do A). It would then add A’s precondition as a pgoal to the plan, if it does not believe the precondition holds. This backward-chaining may continue via effect-act-precondition reasoning. Finally, the Applicability Condition (AC) states formulas that must be true, but cannot be made true by the agent. For example, Fig. 3 shows the description of the vaccination action. The AC here is that the vaccine center has the vaccine and the Patient is eligible for the vaccine. If such ACs are believed to be false (vs. not believed to be true), the agent believes the action A is impossible to perform.Footnote 27 Thus, the agent would not create a persistent goal or intention to perform action A because this violates the definition of persistent goal. If the system does not know whether the AC holds, it creates a pgoal to knowif(sys,AC) and blocks the intention from further inference. If it learns the AC is false, and the pgoal to do the action has been created, it would remove that pgoal and any pgoals that depend on it. Further discussion can be found in Section 6. Finally, the system may represent actions hierarchically with actions optionally having a Body, which would decompose into a complex action described with the dynamic logic operators below.

Notice that this action expression language is an extension of the hierarchical variants of the Planning Domain Definition Language (PDDL) [83] such as HDDL [84]. In particular, we provide descriptions of composite actions and add preconditions and effects on higher-level actions in addition to the primitives [85]. Also of importance, Eva allows action expressions as arguments to other actions, supporting directive and commissive speech acts like requesting, recommending, etc. We will see such a speech action definition in Section 5.8.

As part of the action’s signature, the system keeps track of the agent doing it. The action description action(Agent, Action, Constraint) indicates that agent Agent is the agent of the Action expression such that Constraint holds. However, Action itself has an agent role (call its value Agent1). In most cases, the two agents are the same, but they need not be. By allowing for them to be different, we can represent that Agent does something to help or get Agent1 to perform the Action. An example might be submitting an insurance claim on behalf of someone else. Doing actions on behalf of (to the benefit of) someone else may require explicit agreement or permission to do it.

3.5.1 Predicates over action expressions

These predicates allow us to say that an action will be done in the future (do), is being executed (doing), or has occurred in the past (done). Unlike in Cohen and Levesque [1], we also provide explicit time arguments for these predicates, for which it is simple to provide semantics using the logical tools in Cohen and Levesque [1]:Footnote 28

  • do(action(Agent, Action, Constraint), Location, Time) — Action will be done at location Location and Time in the future.Footnote 29

  • done(action(Agent, Action, Constraint), Location, Time) — Action has been done at Location and a past Time.Footnote 30

  • doing(action(Agent, Action, Constraint), Location, Time) — Action is ongoing at Location and Time.

One additional predicate that we adopt that was not in Cohen and Levesque [1] is failed(Action, Reason). Eva uses this predicate when it sends an action invocation request to a backend system (e.g., asking a credit card validation system to validate a particular card number). If the backend system returns a failure, the Reason parameter encodes the cause of the failure, such as the card is invalid, overlimit, etc. It can be text sent via the backend or a logical form. Eva assumes that a failure means the action is impossible, so it drops the intention to perform it, though it may try to find another way to achieve the higher-level goal.

3.5.2 Complex actions combinators

We provide complex action expressions with combinators drawn from dynamic logic [86], in addition to a combinator for hierarchical decomposition.Footnote 31 As we develop the action expressions below, the predicates will become more realistic.

Specifically, we have:

Compound actions

Action expression

Example

Sequential actions

seq(A, B)

seq(find_identification_number(X,P), informref(X,Y,Q))

Conditional Actions

condit(P, A)

condit(eligible(X),make_appointment(X,B,D,T))

Non-deterministic Mutually Exclusive OR

disj(A, B)

disj(inform(S,U,P), inform(S,U,~P))

In addition to the four action description functions discussed above, we now add two predicates for defining hierarchical action expressions.

$$\begin{array}{l}body(Action, ActionBody)\\ in\_body(ActionBodyElement, Action)\end{array}$$

The first predicate (body) maps an action into an ActionBody compound action expression. There may be more than one Action that decomposes into the ActionBody, but each Action only has one Body. The in_body predicate relates an element of an ActionBody with the higher level Action of which it is a part. It could be one of the sequential actions, or a conditional or disjunctive action. There could be more than one Action that contains ActionBodyElement. The predicate in_body looks for any action within a named complex action expression, searching through the entire complex action library.

As an example of action decomposition relationships, we have the following:

$$\begin{aligned}&body(informif(S,H,P), disj(inform(S,H,P),inform(S,H,{\sim}P)))\\& \ in\_body(inform(S,H,P), informif(S,H,P))\end{aligned}$$

This example shows that an informif speech act from S to H that P (i.e., informing whether P is true) can be decomposed into the disjunctive act of S’s informing that P or informing that ~P. The precondition and effect of the informif action are determined by the disjunctions of the preconditions and effects of the constituent inform actions [30, 39]. Thus, the precondition can be shown to be knowif(S,P)in_body shows that an inform action is part of the body of the informif action.

We define one more predicate to provide the unbound or unknown variables in the Action. In order to execute any Action, the system needs to know what that Action is [73]. Therefore, we define

$$unk\_oblig\_arg(Action, Role:Var\#Type)$$

to say that Agt does not know the value for one of the obligatory variables (namely, Var, which is the value, of type Type, of the given role Role) for an action she intends to execute. That is,

$$\backslash\!\!+\ knowref (Agt,{Var}^{\wedge }intend(Agt, Action, Q))$$

with Var being a free variable in Action. If Agt does not know the value of the variable, but a value for that variable is required for the successful execution of Action, the system will eventually create a pgoal to knowref that value which may then lead to planning a question (this process would be repeated for all other unknown obligatory arguments of Action).

4 Reasoning about mental states and their combinations

We now discuss how Eva reasons with the above formulas. There are two meta-interpreters used to reason about modal formulas, one for proving (istrue) and one for asserting (→). Modal formulas are proven with istrue invoked from a standard Prolog rule ‘:-‘. Non-modal formulas are put into negation normal form (negations only take literals as their scope), and are proven using a standard Prolog interpreter. The assertional meta-interpreter → handles assertions of modal formulas, whereby instead of asserting the left-hand side (LHS), the right-hand side (RHS) is asserted. With it, we ensure that the least embedded formula possible is entered into the database, subject to the logical semantics. This also means that the LHS clause would not be found in the database because the → meta-interpreter is rewriting the LHS into the RHS. Finally, we have a rule operator ‘\(\Rightarrow\)’ for planning and plan recognition rules. The LHS of \(\Rightarrow\)is proven using istrue, and the right-hand side is asserted with →.

The system is driven by its maintaining the rational balance among mental states. Thus, it is able to reason about one agent’s goals to get an agent (itself or another agent) to believe or come to know whether a formula is true, to intend to perform an action, or come to know the referent of a description. The semantics of the individual modalities is expressed in the possible worlds framework, which describes the meanings of the combinations of these formulas and the implications of their becoming true. Moreover, the theory of intention in Cohen and Levesque [1] is able to provide a semantics for the internal commitments that the intender takes on, and the conditions under which those commitments can be given up. The system is built to provide an operational semantics that accords with the logical ones through its inference rules and BDI architecture.

In addition to the axioms for belief and goal found in Section 3.1, we give below examples of inference and rewriting rules that the system employs with respect to cross-modal inference. These rules together are an attempt to maintain the least embedded formulae possible, subject to the semantics of bel, goal/pgoal, intend, knowif, and knowref. Sometimes the same formula will be in both forward and backward reasoning rules below because there may be cases where a condition needs to be proved, even though the system has not yet tried to assert it and thus create the more compact version.

4.1 Meta-Logical Interpreter: Proving

Proving is performed via the meta-predicate istrue, using a number of rules shown here in Prolog notation:Footnote 32

  • istrue(bel(X, bel(X,P))) :- istrue(bel(X, P))

  • istrue(bel(X, P \({\land }\) Q)) :- istrue(bel(X,P)), istrue(bel(X,Q))

  • istrue(bel(X, pgoal(X,P))) :- istrue(pgoal(X,P))

  • istrue(bel(X, knowref(X, Var^Pred))) :- istrue(knowref(X, Var^Pred)))

  • istrue(bel(X, knowref(X, Var^(Pred, Cond)))) :- istrue(knowref(X, Var^Pred))), istrue(Cond)

  • istrue(bel(X, exists(Var^Pred))) :- istrue(knowref(X, Var^Pred))

  • istrue(knowif(X, P)) :- istrue(bel(X, P)) \(\vee\) istrue(bel(X, ~P))

  • istrue(bel(X, P)) :- istrue(bel(X, (P :- Q))), istrue(bel(X,Q)) This models the system’s being able to reason about another agent’s belief reasoning. Here, the agent (X) has a belief about a Horn clause rule ‘P :- Q’. There would be assertions in the database about the agent’s domain specific Horn clause rules.Footnote 33 It approximatesFootnote 34 with Horn clauses the material implication in axiom K:

    $$\models bel(X,P\supset Q)\supset (bel(X,P)\supset bel(X,Q))$$

For the sake of concision, we omit other rules of no particular interest (e.g., meta-interpreting conjunctions and disjunctions).

4.1.1 Proving done as applied to complex action expressions

There are several predications that take action expressions as an argument, namely do, done and doing. These predicates take a list of arguments, the first of which specifies the agent, then the action, and finally the location and time. We take do and done to be satisfied at a future/prior time, respectively (as they take a specific time as an argument). In addition to taking primitive actions as arguments, these predicates are defined over complex action expressions, such as conditional, sequential, and disjunctive action expressions. For example, a disjunctive action expression has been done at some location and time if either of its disjuncts has been done at that location and time.

$$\begin{array}{l}istrue\left(done\left(action\left(Agt, disj\left(Act1, Act2\right), Constr\right), Loc, Time\right)\right) \text{:-}\\ \qquad istrue(done(Agt, Act, Constr), Loc, Time) \lor istrue(done(Agt, Act2, Constr), Loc, Time)).\end{array}$$

Currently, we have found it sufficient to say that a conditional action condit(P,Act) has been done if Act has been done and the predicate P is true.Footnote 35

$$\begin{array}{l}istrue(done(action(Agt, condit(Pred, Act), Constr), Loc, Time))\ \text{:-}\\ \qquad istrue(Pred)\land istrue(done(action(Agt, Act, Constr), Loc, Time)).\end{array}$$

We will say a sequential action of seq(Act1,Act2) has been done if Act1 has been done and Act2 has been done afterwards in a circumstance in which Act1 has been done.

$$\begin{aligned}&{}istrue(done(action(Agt, seq(Act1, Act2), Constr), Loc, Time))\ \text{:-}\\& istrue\left(done\left(action\left(Agt, Act1, Constr\right), Loc1, Time1\right)\right),\\&istrue(done(action(Agt, condit(done(action(Agt, Act1, Constr), Loc1, Time1), Act2), \\&Constr), Loc, Time)).\end{aligned}$$

The doing predicate applies to actions that have a hierarchical decomposition, such that once the higher-level action has been decomposed into its body, and the system is executing one element of the body, then the system asserts that it is doing the higher-level action. If the system has done the last element of the body, then, for the higher-level action doing is retracted and done is asserted.

We currently do not reason with the full linear temporal logic of Cohen and Levesque [1]; see Gutierrez et al. [87] for an example of how to do that in a multi-agent context.

4.2 Meta-logical interpreter: asserting as rewriting

The meta-logical interpreter uses rules of the form (LHSRHS) to infer that, whenever it is supposed to assert the left-hand side (LHS), instead it should assert the right side (RHS). Importantly, these are rewriting rules, not inference rules, in that LHS is not asserted to be true. In the rules below we will use the expression ‘and’ to mean that what follows it is a constraint. That is, when the agent is trying to assert the first literal, if the formula following ‘and’ is true, then the literal is rewritten with the right-hand side.

  • bel(X, bel(X,P))bel(X, P)

  • bel(X, P \({\land }\) Q) → bel(X, P), bel(X, Q)

  • bel(X, knowref(X, Var^Pred))knowref(X, Var^Pred))

  • knowref(X, Var^bel(X, Pred))knowref(X, Var^Pred)

  • knowref(X, Var^Pred)) and Var is a constant → bel(X, Pred)

  • pgoal(X, P \({\land }\) Q, R)pgoal(X, P, R), pgoal(X, Q, R)Footnote 36

  • pgoal(X, pgoal(X, P, Q), Q)pgoal(X, P, Q)Footnote 37

  • pgoal(X, intend(X, A, Q), Q)intend(X, A, Q), where A is a (potentially complex) action. If agent X wants to intend to do action A, then the agent in fact intends to do A.

  • pgoal(X, knowref(X, Var^Pred), Q) and Var is a constant → pgoal(X, bel(X, Pred), Q)

During reasoning, Var can become bound to a constant. Because knowref is defined to existentially quantify Var into the agent’s beliefs, the possible worlds semantics shows that the agent believes Pred is true of that constant.

  • bel(X, pgoal(X, P, Q))pgoal(X,P,Q) — If the system tries to assert that the agent believes it has a goal, then assert that it does have the goal.

Finally, a word about soundness and completeness. We make no claims that the reasoning system is complete, however relative to the approximations we have made, we do believe it is sound.Footnote 38 There is a propositional variant of the logic in Cohen and Levesque [1], from which we draw, that has been shown to be sound and complete [89]. Rao and Georgeff [90] also proposed sound and complete branching-time multi-modal propositional logics for belief, desire, and intention. Their intention operator corresponds to Cohen and Levesque’s goal operator, and they have no analogue to persistent goal or intend as we use them here. In comparison to both approaches [89, 90], however, we cannot use a propositional logic for our system, since it would make it impossible to reason about knowref, which is so essential to task-oriented dialogue. Likewise, we need to have arguments to actions that may be existentially quantified (“I intend that someone tow the car to the repair shop”). Another major difference with the aforementioned logics [89, 90] is that they do not consider intentions to perform actions, but only to make formulas true. In Eva, intentions to perform (possibly complex) actions are key to its functioning.Footnote 39

4.3 Equality and reference resolution

In the course of planning, the system generates equalities between variables appearing in different goals. Unlike graph representations that can accommodate multiple pointers to the same node, sentential reasoning requires that variables appearing in different formulas be explicitly stated as being equalFootnote 40. For example, we record that the covid vaccination center that the user (u1) wants to go to is the same as the covid vaccination center at which the user wants to be vaccinated. Likewise, the time that the user intends to be vaccinated is the same as the time at which the user wants to have an appointment. In the sample dialogue, at some point the system reasons that ‘CVS’ is a covid vaccination center that satisfies the user’s goals, which then enables all of equalities to be resolved if needed.

In general, entities that are equal should be intersubstitutable in formulas. However, equality reasoning is prevented from crossing modal operators. For example, Frege's [91] famous examples “I believe that the morning star = evening star = Venus”, but “John does not know that the morning star = the evening star” need to be jointly satisfiable. I cannot use my beliefs of equality to reason about what John believes. However, the system can reason with “the X that John believes is the morning star = the Y that Mary believes is the evening star” (quantifying X and Y into their respective beliefs). Because the variables are not in the scope of the agents’ beliefs (or in other cases, pgoals), the system should be able to reason about X and Y, but not attribute those beliefs to John or Mary. Eva reasons about equality among variables by maintaining equivalence classes of equalities. Fig. 4 shows an example of such equivalence classes created after the first sentence of the example shown in Section 1.1. Notice that these are all quantified-in variables, but the system does not attribute these equalities to the user’s beliefs because the semantics of quantified-in goals is that the value is the same in all the agent’s goal worlds, but not necessarily in all the agent’s belief worlds.

Fig. 4
figure 4

Equivalence classes of typed variables (of the form X#Type). U is the user

This same equality reasoning mechanism enables the system to represent and resolve co-referential and anaphoric references. Because the equality relationship is declarative, if an ambiguous reference is detected, the system can generate a request to the user to disambiguate the intended referent (See Section 5 for the definitions of speech acts):

$$\begin{aligned} intend(sys, request(sys, usr, disj(&inform(usr, sys, (<\negmedspace\text{referring expression}\negmedspace> = \textit{'a'}), \\&inform(usr, sys, (<\negmedspace\text{referring expression}\negmedspace> = \textit{'b'})), Q) \end{aligned}$$

Of course, how the system finds candidates for coreference or resolves them is a topic of much research (e.g., [92,93,94]). Eva incorporates mechanisms for representing and resolving certain referential expressions (e.g., “Does it have vaccine?”, “That should be 8am!”), but a lot more could be done in this area.

5 Speech acts

In the study of semantics, philosophers and linguists once only analyzed utterances as being true or false. Austin [27] upended this approach by arguing that utterances are actions – dubbed illocutionary acts or speech acts – that change the world. This radical shift in thinking began the study of linguistic pragmatics. John Searle ([76], and in subsequent books) provides a detailed analysis of many different types of speech acts, at the level of philosophical argumentation. Based in part on some initial analyses of Bruce [10], the plan-based theory of speech acts [8, 14, 51] argued that people plan their speech acts to affect their listener’s mental and social states, and showed how speech acts could be modeled as operators in a planning system. Thus, a system can likewise plan speech acts to affect its listener’s mental states, and reason about the effects that its listener’s speech acts were intended to have on its own mental states. Planning and plan recognition became essential to such pragmatic analyses of language in context because speech acts could then be incorporated into an agent’s task-related plans when the agent determines that it needs to know/believe something, or needs to get the user to intend to perform an action.

Given the logic we have provided, especially the tools for describing actions, below are some of the speech acts implemented to date. Note that speech act definitions are domain independent.

5.1 Inform

The first action expression that we will consider is that of an inform by a Speaker to a Listener that formula Pred is true. The precondition for this action is that the Speaker believes what she is saying. The intended effect is stated as the Listener’s believing that Pred holds. Recall that we said the listed effect of this action description does not become true as a result of performing the action. Rather, the Listener comes to believe that the Speaker had a persistent goal that the effect holds. In the descriptions of the speech acts we will ignore the constraint parameter and trivial applicability conditions.Footnote 41

$$\begin{array}{l}\boldsymbol{inform}(Speaker, Listener, Pred)\\ \qquad precondition: bel(Speaker, Pred)\\ \qquad effect: bel(Listener, Pred)\end{array}$$

5.2 Assert

Assertions are different from informs in that the intended effect is to get the listener to believe that the speaker believes the propositional content, whereas the intended effect of an inform is that the speaker comes to believe the propositional content. Thus, we have:

$$\begin{array}{l}\boldsymbol{assert}(Speaker, Listener, Pred)\\ \qquad precondition: bel(Speaker, Pred)\\ \qquad effect: bel(Listener, bel(Speaker, Pred))\end{array}$$

5.3 Informref

An informref is a speech act whose intended effect is that the listener know what the value of the variable Var is such that Pred is true of it (Var must be free in Pred). For example, when the user says “Monday” in response the wh-question “when do you want to eat?”, the intended effect is that the listener come to know that the referent of “the date the user wants to eat” is Monday. The precondition is that the speaker knows what the value of Var is such that Pred.

$$\begin{array}{l}\boldsymbol{informref}(Speaker, Listener, {Var}^{\wedge }\negmedspace Pred)\\ \qquad precondition: knowref(Speaker, {Var}^{\wedge }\negmedspace Pred)\\ \qquad effect: knowref(Listener, {Var}^{\wedge }\negmedspace Pred)\end{array}$$

5.4 Informif

The informif(S, L, P) speech action can be defined as a disjunctive action [30]:

$$\boldsymbol{informif}(S, L, P)\equiv disj(inform(S, L, P), inform(S, L, {\sim}P))$$

5.5 Assertref

Assertref is similar to informref in that the intended effect of the speech act is that the listener come to know the referent of the variable such that the speaker believes Pred is true of it. assertref can be used to find out what the speaker believes, even if the speaker is not trying to convince the listener. For example, teacher-student questions or verification questions employ assertref.

$$\begin{array}{l}\boldsymbol{assertref}(Speaker, Listener, {Var}^{\wedge }\negmedspace Pred)\\ \qquad precondition: knowref(Speaker, {Var}^{\wedge }\negmedspace Pred)\\ \qquad effect: knowref(Listener, {Var}^{\wedge }bel(Speaker, Pred))\end{array}$$

Note that assertref can be defined in terms of informref as:

$$assertref(S, L,{V}^{\wedge }P)\equiv informref(S,L,{V}^{\wedge }bel(S, P))$$

5.6 Wh-Questions

A Speaker asks a wh-question to Listener about the referent of Var such that Pred is true (as usual, Var must be free in Pred):Footnote 42

$$\begin{array}{l}\boldsymbol{wh\text{-}q}(Speaker, Listener, {Var}^{\wedge }\negmedspace Pred)\\ \qquad precondition: knowref(Listener, {Var}^{\wedge }\negmedspace Pred)\\ \qquad effect: knowref(Speaker, {Var}^{\wedge }\negmedspace Pred)\end{array}$$

Recall that we do not claim the effect of a speech act becomes true in virtue of the act’s being performed. Because these are planning operators, the effect becomes a pgoal of the planning agent. Conversely, on observing an agent performing an action, including a speech act, the observer comes to believe the planning agent had the effect as a pgoal.Footnote 43 So, on hearing a wh-question, the listener comes to believe that the speaker has a pgoal that the speaker come to know the referent of the description.

During backward-chaining, the planner may unify the content of a goal formula with the effect of an action and choose to consider that action as a means to achieve the goal. However, the matching of the effect may not provide a binding for the Listener. If the Listener is not specified, evaluating the precondition may enable it to determine who knows the answer and direct the wh-question to that agent.

5.7 Yes-No questions

A yes-no question is described as:

$$\begin{array}{l}\boldsymbol{ynq}(Speaker, Listener, Pred)\\ \qquad precondition: knowif(Listener, Pred)\\ \qquad effect: knowif(Speaker, Pred)\end{array}$$

Thus, if the system has the pgoal to achieve a knowif formula, it can adopt the pgoal to perform a ynq directed at someone whom it believes knows whether Pred is true. A yes-no question from speaker S to listener L whether predicate P is true can be decomposed as a sequence of the speaker’s requesting that L do an informif action, followed by the informif action. That is:

$$ynq(S, L, P)\equiv seq(request(S, L, informif(L, S, P)), informif(L, S, P))$$

5.8 Requests

Requests are a paradigmatic example of a larger class of speech actions, the directives [76] that also includes commands, recommendations, suggestions, etc. The intended effect of the request is that the listener form the intention to do the requested action. From this class, Eva currently uses requests and recommendations, which differ in terms of whether the action being requested/recommended benefits the speaker or the listener, respectively. Notice that some of the parameters must be computed based on the embedded Act.

$$\begin{aligned}\boldsymbol{request}&(Speaker,\boldsymbol{ }Listener, Act)\\&constraint: bel(Speaker, Cond)\\& precondition: bel(Speaker, Pre)\\&effect: intend(Listener,do(action(Listener, Act, Cond),Loc, Time), Q)\end{aligned}$$

where Pre and Cond are, respectively, the precondition and the constraint of the requested Act, which benefits Speaker:

$$\begin{array}{l}precondition(Act, Pre),\\ constraint(Act, Cond),\\ benefits(Act,Speaker).\end{array}$$

5.9 Verification questions

A number of application scenarios require that a user be verified by answering questions for which the system already has an answer in its database. The system’s goal here is not for the system to come to know the answer, but for the system to come to know what the user thinks is the answer. This can be accomplished via the assertref action. Thus, in planning the verification question, the system requests this assertref action.Footnote 44 Notice that the effect of the assertref involves an existential quantifier whose scope is the Listener’s belief of the Speaker’s belief. We leave as an exercise for the reader to derive the preconditions and effects of the verification question from the preconditions and effects of the constituent actions.

$$\begin{array}{l}\boldsymbol{verifyref}(Speaker, Listener, Var^{\wedge}\negmedspace Pred)\equiv \\ \qquad \begin{aligned}seq(&request(Speaker, Listener, assertref(Listener, Speaker, {Var}^{\wedge }\negmedspace Pred)),\\ \qquad\;\;\;\;\;\;&assertref(Listener, Speaker, {Var}^{\wedge }\negmedspace Pred))\end{aligned}\end{array}$$

6 Collaborative planning

As discussed in Section 1, collaboration is so essential to society that we teach our children to be collaborative at a very early age [2]. However, present day conversational systems generally do not know how to be helpful or collaborate, stemming from their inability to infer and respond to the intention that motivated the user’s utterance. To overcome this failing, we have built a collaborative planning-based system designed to assist its conversant(s) in achieving his/her goals. As previously mentioned, the approach dates back to work done at Bolt Beranek and Newman [10,11,12], at the University of Toronto [9, 13, 14], and at the University of Rochester (e.g., [15,16,17,18,19]). Such systems attempt to infer their conversants’ plan that resulted in the communication, and then to ensure that the plans succeed. Indeed, as noted in Section 1, a central feature of our approach is that the system will attempt to infer as much of the user’s plan as it can, will try to identify obstacles to its success, and plan to overcome those obstacles in order to help the user achieve his/her higher-level goals. Thus, plan recognition and planning are essential to Eva’s architecture and processing. Below we provide a description of Eva’s collaborative planning and plan recognition.

6.1 Planning rules

Rather than representing a plan as a pure graph structure of actions, Eva’s plans consist of a web of interdependent logical forms describing mental states, notably beliefs, persistent goals and intended actions (cf. [37]). Based on the epistemic planning approach first described in [8, 9, 13, 31, 51] and recast in the logic and speech act theory provided in Cohen & Levesque [1, 7], Eva’s planning and plan recognition make extensive use of reasoning, as it derives and attributes new mental states of the user. The system generalizes the original plan-based dialogue research program by planning with multiple declaratively-specified mental states, including persistent goal, intention, belief, knowif, knowref.

The system has a hybrid planning algorithm [95] as it both engages in backward chaining from desired effect to one or more chosen actions that could achieve those effects, and decomposes hierarchically defined actions into more primitive ones as a hierarchical task network planner does [80, 96,97,98]. Others have described plan recognition in terms of “inverse planning” [99, 100]. Our planning algorithm is similar in some respects to that used by BDI systems in that the purpose is to determine the next action to perform, rather than to plan sequences of actions, though it can do that. One would not expect a dialogue planning system to engage in advance planning of back-and-forth interactions as in a game, unless the world were very constrained. We also are not interested in finding a machine-learned “optimal” response, given the rather arbitrary numerical weights/probabilities and the vast space of potential logical forms (not to mention natural language utterances) that a learning system driven by a user simulator [101] might generate. Rather, we want the system to do something appropriate and reasonable, as long as it has the means to recover from errors, respond to users’ clarification questions, etc. Because the system interleaves planning, execution and so-called “execution-monitoring” that involves observing the user’s actions in response to the system’s (i.e., to have a dialogue), there are many opportunities to revector a dialogue towards success.

Eva’s planning involves the following rules, for which we use ‘\(\Rightarrow\) to separate antecedent and consequent. The formulas in the antecedent would be proven to be true via the meta-interpreter’s istrue rules). The result of applying a planning rule is to assert the consequent. These assertions can be rewritten by the → rules.

Effect-Action

If an agent Agt has pgoal to achieve a proposition P, and Agt can find an action Act that achieves P as an effect, then the planner creates a pgoal to do the action relative to the pgoal to bring about P.

  • (P1) pgoal(Agt, P, Q) and effect(Agt, Act, P) \(\Rightarrow\) pgoal(Agt, done(Agt, Act), pgoal(Agt, P, Q))

Given the definition of intention provided earlier, the formula on the right side is the expansion of the intention to do Act:

$$intend(Agt, Act, pgoal(Agt, P, Q))$$

We will use intend formulas wherever possible.

If more than one action can be found, the planner creates a disjunctive action (see also [39]).

Act-Applicability condition

If the planning agent believes an Act’s applicability condition, AC, is false, the action is impossible, the intention is marked as impossible. During the main loop of the system, intentions that are impossible are retracted, as are any persistent goal or intentions that are relativized to it.

We adopt the following rule for applicability conditions: given action A, applicability condition AC, agent Agt, and a relativizing condition, Q:

  • (P2) applicability_condition(A, AC) and bel(Agt, ~AC) \(\Rightarrow\) bel(Agt, impossible(done(Agt, A))), ~intend(Agt, A, Q)

Recall that for a given action, the applicability conditions cannot be made true. The above rule means that, if AC is an applicability condition to do action A, and the agent believes it is false, the agent itself cannot possibly do anything to make AC true, so then the agent would drop (or not adopt) an intention to do A.

If the planning agent Agt does not know whether AC holds, the following rule is used to create a pgoal to knowif that AC is true, relative to the intention to do the Act.

  • (P3) intend(Agt, Act, Q) and applicability_condition(Act, AC) and \+knowif(Agt, AC) \(\Rightarrow\) pgoal(Agt, knowif (Agt, AC), intend(Agt, Act, Q)) and blocked(intend(Agt, Act, Q))

The created pgoal to knowif potentially leads the agent to ask a question. In addition, the persistent goal/intention to perform the Act is blocked, such that no more expansion of any plans passing through that action can be accomplished until the system knows whether the AC holds. Considering the system as agent, if the system comes to believe AC holds, the relevant blocked pgoal/intention becomes unblocked and planning to achieve that pgoal/intention continues. Hereafter, we suppress the condition on all rules that the intention and/or persistent goal to do an action is not blocked.

Act-Precondition

In backward-chaining, if the planner determines that a precondition (PC) to an intended Act is believed to be false, the planner creates a persistent goal to achieve PC, relative to the intention to perform Act.

  • (P4) intend(Agt, Act, Q) and precondition(Act, PC), bel(Agt, ~PC) \(\Rightarrow\) pgoal(Agt, PC, intend(Agt, Act, Q))

If the agent does not know whether or not PC holds, then it adopts the pgoal to knowif PC is true.

  • (P5) intend(Agt, Act, Q) and precondition(Act, PC) and ~bel(Agt, PC) \(\Rightarrow\) pgoal(Agt, knowif(Agt, PC), intend(Agt, Act, Q))

Act-Body

This rule enables the planner to engage in hierarchical planning. When the planner creates an intention to perform an Act that has a decomposition (Body), it creates an intention to perform Body relative to the higher-level intention to perform Act. The intention to perform Body could then lead to planning with conditionals, disjunctions, and sequences.

  • (P6) intend(Agt, Act, Q) and body(Act, Body) \(\Rightarrow\) intend(Agt, Body, intend(Agt, Act, Q))

As discussed in Section 3.5.2, various expansions and relativizations are created between the Body action and the higher-level action. Note that the preconditions and effects of the higher-level action are derived from the structure of the Body action. In particular, the precondition of the higher-level act is the precondition of the first act in the decomposition. The effect of the higher-level act depends on the decomposition. For instance, the effect of a sequence is the effect of the last act in the sequence. The effects of the intermediate acts may or may not persist until the end, so we do not attempt to derive their status. Other rules are provided by the forward meta-interpreter which handles the assertion of intending complex actions in terms of intending its decomposed constituents.

Act-Knowref

If an agent Agt has an intention to do an action Act (relative to Q), the agent has a pgoal to knowref the value for each of the obligatory arguments of that action relative to the intention that it does now know. If it does not know what the values are, it creates for each Var, a persistent goal to know what the Var is such that the agent intends to do the Act for which that Var is a parameter.

  • (P7) intend(Agt, Act, Q) and unk_oblig_arg(Act, Role:Var#Type) \(\Rightarrow\) pgoal(Agt, knowref(Agt, Var^intend(Agt, Act, Q)), intend(Agt, Act, Q))

The creation of such goals may lead to planning and executing wh-questions (so-called “slot-filling” questions) by the agent. Conversely, because of helpful goal adoption (Section 6.3.1) after the user asks a question indicating that the user has a pgoal to know the value of the variable in a predicate, the system may come to have a pgoal that the user know what that value is and may infer an action that the user wants to perform. Such a goal could lead the system to tell the user what s/he needs to know in order to do an inferred action, even if not explicitly asked. Also, if the user changes his/her mind about intending Act, the system can drop the pgoal to find out the values of Act’s parameters.

Intended complex actions

If an agent intends to do a conditional action and does not know whether the condition holds, the agent forms a persistent goal to come to know whether the condition holds, relative to the intention to do the conditional action.

  • (P8) intend(X, condit(P,A)) and \+knowif(X,P) \(\Rightarrow\) pgoal(X, knowif(X,P), intend(X, condit(P,A)))

If an agent intends to do a conditional action, and believes the condition is true, then the agent intends to do the action relative to the intention to do the conditional action.

  • (P9) intend(X, condit(P,A), Q) and bel(X,P)) \(\Rightarrow\) intend(X, A, intend(X, condit(P,A), Q))Footnote 45

Intending a mutually exclusive disjunctive action disj(A,B) results in two intentions: an intention to do action A provided action B has not been done, and similarly for B. So, whichever gets done first, will cause the other intention to be removed because the relativized intention for the disjunctive act has been achieved.

  • (P10) intend(X, disj(A,B), Q) \(\Rightarrow\) intend(X, condit(~done(X,B), A), intend(X, disj(A,B), Q)) and intend(X, condit(~done(X,A), B), intend(X, disj(A,B), Q))

An agent X’s intending to do the sequential action seq(A,B) results in two intentions: first in the agent X’s intending to do the first action A, relative to the intention to do the sequence, and in X’s intending to do the second action when done(X,A) is true, again relative to the intention to do the sequence.Footnote 46

  • (P11) intend(X, seq(A,B), Q) \(\Rightarrow\) intend(X, A, intend(X, seq(A,B), Q)) and intend(X, condit(done(X,A),B), intend(X,seq(A,B), Q))

6.2 Plan recognition rules

The system engages in plan recognition by expanding the user’s plan, expressed in terms of appropriately relativized persistent goals and intentions, according to various rules similar to those of Allen and Perrault [9].

Act-Effect

If the system has attributed to the user Agt a persistent goal/intention to perform an action Act, then assert that Agt has a persistent goal to achieve the effect E of Act relative to the intention to do Act.

  • (P12) intend(Agt, Act, Q) and effect(Agt, Act, E) \(\Rightarrow\) pgoal(Agt, E, intend(Agt, Act, Q))

Precondition-Act

If the system has attributed to Agt a pgoal(Agt, P, Q) and P is the precondition to an act Act, then attribute to Agt the intention to do Act relative to that pgoal.

  • (P13) pgoal(Agt, P, Q) and precondition(Agt, Act, P) \(\Rightarrow\) intend(Agt, Act, pgoal(Agt, P, Q))

Note that P could enable multiple acts, e.g., A1 and A2. The system would then attribute to Agt the intention:

$$intend(Agt, disj(A1, A2), pgoal(Agt, P, Q))$$

Know-if-exists

If S has attributed to Agt the pgoal to know whether or not there is an X such that Pred, where Pred is a schematic predicate variable that has a free variable X, then attribute to Agt the pgoal to know what “the X” is. Formally,

  • (P14) pgoal(Agt, knowif(Agt, \(\exists\)X^Pred), Q) \(\Rightarrow\) pgoal(Agt, knowref(Agt, X^Pred), R) and Q = pgoal(U,knowref(Agt, X^Pred), R)

For example, if Agt wants to know whether there is a nearby vaccine center, then attribute to Agt the pgoal to know the referent of “nearby vaccine center”. This would enable the Val-Action inference below.

Val-Action

If Agt has a pgoal(Agt, knowref(Agt, X#Type^Pred)), and X#Type is a required argument in some action Act and Act has a constraint predicate C, then create a persistent goal to have done Act additionally constrained by Pred.

  • (P15) pgoal(Agt, knowref(Agt, X#Type^Pred)) and unk_oblig_arg(Agt, Act, Role:X#Type) \(\Rightarrow\) pgoal(Agt, knowref(Agt, X#Type^pgoal(Agt, and(done(Agt, Act), constraint(Agt, Act, Pred)), Q))

For example, if Agt wants to know the location of the nearest vaccination center, then Agt may want to go to that location.Footnote 47

Knowif-Action

If pgoal(Agt, knowif(Agt,P), Q), and P is an applicability condition for an Act, then attribute to the Agt the pgoal to have done that Act (i.e., the intention to do Act). Notice that because this is a plan recognition rule, the relativization argument of the pgoal to knowif is the intention to perform the Act. Formally:

  • (P16) pgoal(Agt, knowif(Agt,AC), Q) and applicability_condition(Act, AC) \(\Rightarrow\) intend(Agt, Act, R) and Q = intend(Agt, Act, R)

Normal-Activity

If Agt has a pgoal to be located at a place P, then ascribe to Agt the pgoal to do the normal activity one does at location P. For example, if Agt has a pgoal to be located at a movie theater, then attribute to Agt the pgoal to watch a movie.Footnote 48

  • (P17) pgoal(Agt, location(Agt, Place), Q) and normal_activity(Place, Act) \(\Rightarrow\) intend(Agt, Act, pgoal(Agt, location(Agt, Place), Q))

Negative state

If Agt is in a negative state (of which there are a list of domain dependent types), infer that the agent wants to be in the corresponding positive state. For example, if the agent has lost her phone, infer that the agent wants to have found her phone. If the agent’s phone has been damaged, infer that Agt wants her phone to be repaired.

  • (P18) bel(Agt, state_of(Agt, NegState)) and bel(Agt, positive_state(NegState, PosState)) \(\Rightarrow\) pgoal(Agt, state_of(Agt, PosState), Q)

Finally, if the probability of an inferred intention of the user is below a modifiable threshold (which could be user-dependent), the system generates a goal to know whether the user is wanting/intending that intention. We see this in utterance 3 of the sample dialogue in Section 1.1.

6.3 Other ways that goals arise

Eva is driven by its persistent goals, which results in its planning to achieve them, and/or helping its user to achieve his/her goals. We described above how many persistent goals are generated. Below we elaborate on other ways that pgoals arise.

6.3.1 Collaborative goal adoption

Because the system is collaborative, it will adopt as its own those goals that it attributes to the user. For example, if it believes the user wants to be vaccinated, it will adopt the goal that the user be vaccinated. However, such goal adoption is not absolute. Thus, if it believes the user or the system is not allowed to adopt the goal, it will not. More formally, if the system believes the user has P as a pgoal, and the system does not have a pgoal that ~P, then the system adopts the pgoal that P relative to the user’s pgoal:Footnote 49

  • (P19) pgoal(usr, P, Q) and \+pgoal(sys, ~P, R) \(\Rightarrow\) pgoal(sys, P, pgoal(usr, P, Q)))

For example, if the system believes the user wants to knowref P (e.g., P = the secret formula for Coca Cola), and the system does not want the user not to know it, the system adopts the goal that the user knowref P. However, should the system not want the user to knowref P, then the system does not have to adopt the user’s goal and plan to satisfy it. Notice also that if the system comes to believe that the user no longer wants P, then the system can abandon its pgoal that P, which would then lead to its dropping any intentions it had created to achieving P.

Among the consequences of the theory of joint intentions [1, 4, 7] are communications that must occur when joint intentions are achieved or are impossible. Rather than implement the entire theory of joint intentions declaratively, we choose to incorporate some of its consequences as rules within Eva. Thus, we have:

  • (P20) done(Agt, condit(bel(sys, bel(usr, pgoal(sys, P, pgoal(usr, P, Q)))),Act) and bel(sys,P) \(\Rightarrow\) pgoal(sys, bel(usr,P), bel(sys,P))

That is, if some action Act has been done prior to which the system believed that the user believed the system had a pgoal to achieve P relative to the user’s wanting the system to do so, and after the act, the system comes to believe P, then the system then has a pgoal to get the user to believe P (relative to the system’s believing P). Notice that in virtue of the definition of pgoal, the system no longer has the pgoal to achieve P because it believes P is true. For example, if the system offers to make an appointment, and the offer is accepted, then once the appointment is made, the system will inform the user of that fact. Likewise, if the user directly requests the system to make an appointment, and the system agrees, then the system will inform the user once the appointment has been made. In general, these should be mutual beliefs that the system has a pgoal dependent on the user, but we are refraining from incorporating that theoretical construct for the time being.

Notice also that similar reasoning would apply if the system came to believe that P is impossible. That is,

  • (P21) done(Agt, condit(bel(sys, bel(usr, pgoal(sys, P, pgoal(usr, P, Q)))), Act) and bel(sys, impossible (P)) \(\Rightarrow\) pgoal(sys, bel(usr, impossible (P)), bel(sys, impossible (P)))

In this case, the system will inform the user that P is impossible.

6.3.2 Generating goals to knowif by rule decomposition

Eva generates goals to know whether or not a proposition P is true in numerous ways. First, if P is a precondition to an intended action A, Eva will generate the goal pgoal(sys, knowif(sys, P), intend(sys, A, Q)) (see Section 6.1). If Eva later learns that P is false, it may then attempt to make it true. If the intention to do A is dropped, the pgoal to knowif(P), and anything depending on it, such as the likely intended yes/no question, can be dropped as well. Second, if P is an applicability condition to an intended action A, Eva will attempt to prove that knowif(sys, P). If it cannot prove it, Eva also generates the goal pgoal(sys, knowif(sys, P), intend(sys, A, Q)). In both cases, it blocks the intention to do A such that no further planning is done with that intention until it comes to knowif(sys, P). If it comes to believe the applicability condition is false, then P is impossible to achieve, so it retracts the intention and unwinds the plan subtree that depends on it. If it comes to believe a precondition is false, it will attempt to create a plan to achieve it.

Given a pgoal(sys, knowif(sys, P)), the system can plan a yes-no question (YNQ that P), provided it can find a listener L whom it can ask and of whom it believes knowif(L, P). This may involve asking someone other than the user (in the example, it is the pharmacy CVS).

Another special case of generating goals to knowif(sys, P) arises when P is defined in terms of a disjunction of Prolog clauses. For example, one might state a rule that a person is eligible for the Covid vaccine if:

  • Clause1 ~ The person’s age is greater than 65, or

  • Clause2 ~ The person’s age is between 50 and 64, and the person is caring for someone who is disabled, or

  • Clause3 ~ The person’s age is less than 50 and the person is an essential worker

If the system has a pgoal to know whether the user is eligible for a covid vaccine, i.e.,

$$pgoal\left(sys, knowif(sys, eligible(usr, covid\_vaccine))\right),$$

Eva also generates:

$$pgoal\left(sys,knowif(sys, Clause1), pgoal(sys, knowif(sys, eligible(usr, covid\_vaccine)))\right)$$

as well as pgoals to knowif Clause2 and Clause3. Notice that these three pgoals are made relative to the eligibility pgoal; if any of the pgoals is achieved, eligibility becomes true, and the other pgoals are dropped.

7 Semantics of slots

In our framework, slots are quantified-in goals, such as the time you want the system to make an appointment for a vaccination (see also [13, 51]). Assuming the user wants the system to make an appointment, the system’s goal adoption and planning would generate pgoals of knowing the date and the time for which the user wants to make an appointment.

Using the definition of knowref and the semantics in Appendix A, we can now provide a meaning to statements like:

  • The system wants to know the date for which the user wants it to make an appointment for the user at some business b.

This is represented as (showing an existential binding for other variables that we have so far been suppressing):

  • pgoal(sys, knowref(sys, Day^pgoal(usr, \(\exists\)Time^done(sys, make_appointment(usr, b, Day, Time))), Q)

Expanding knowref into its definition, we see that this formula essentially quantifies into two levels of modal operators – bel and pgoal, namely:

  • pgoal(sys, \(\exists\)Day^bel(sys, pgoal(usr, \(\exists\)Time^done(sys, make_appointment(usr, b, Day, Time)), Q))

or, in words:

  • The system wants there to be a Day of which the system thinks the user wants there to be a time such that system makes an appointment for the user at business b on that day and time.Footnote 50

To make sense of such statements, consider that bel(A, P) means that P is true in all of agent A’s B-related possible worlds (see Appendix A). The meaning of \(\exists\)X^bel(A, p(X)) is that there is some value of X (call it d) assigned to it by the semantics of the “valuation” function v in the world in which the formula is evaluated, such that the same value assigned to X in all of A’s B-related worlds satisfies p(X). Because the existential quantifier out-scopes the universal quantifier, the chosen value d is the same choice in every related world such that p(d). As modal operators are embedded, the corresponding chains of B and G-relatedness continue, with the same d being chosen in all of them (see Fig. 8 in Appendix A).

Assume the system has a pgoal that there be a day on which the system thinks the user wants the system to make an appointment at a vaccine center. This would likely result in a question like “When would you like me to make the appointment?”. The user model contains a default assertion that the user knows what s/he wants, so by default the user knows when s/he wants to have appointments. However, the user might say “I don’t know,” or might say “Mary knows the day,” or say, “Ask Mary”. We adopt the Gricean heuristic that the user would have said the day if s/he knew it, and s/he didn’t, so s/he doesn’t know the day. The general default still holds, but a specific neg(knowref(usr, Time^pgoal(…))) would then be asserted, which would cause that default to be inapplicable. This would prevent the system from asking the same question again, as the precondition would no longer hold.

The system plans the wh-question when the effect of the speech act (Fig. 5) matches the content of the system’s pgoal, provided that the system believes the precondition holds, i.e., the system believes that the user knows the referent of “the time User wants to make an appointment”. If the system does not believe the user knows the referent, but knows of someone else who does, it could then plan a question to that other agent.

Fig. 5
figure 5

Slot-filling question

7.1 Handling constraints on slots

Every wh-question has a Variable and a Predicate which constrains its value. When the user conjoins another predicate, it further constrains that value. So, if the system wants to know the time that the user wants an appointment, the system has the following pgoal:

$$\begin{aligned}pgoal&(sys, knowref(sys, {Time}^{\wedge }pgoal(usr, done(sys, \\&make\_appointment(usr, b, Day, Time), \boldsymbol{Cond}), R)), Q)\end{aligned}$$

When the user says: “after 10 am”, the system then has (assuming the variable Day has already been given a value):

$$\begin{aligned}pgoal&(sys, knowref(sys, {Time}^{\wedge }pgoal(user, done(sys, make\_appointment(usr, \\&b, Day, Time), \boldsymbol{and(Cond,after(Time, 10am))}), R)), Q)\end{aligned}$$

Critically, as shown here, the after constraint needs to be in the scope of the user’s pgoal, because it is not simply that the system believes the time is after 10am, but that the user wants it to be after 10am.Footnote 51 An example of a more complex constraint (“the earliest time available”) can be found in the sample transcript in Section 1.1.

In the course of this processing, Eva asserts that:

$$neg(knowref(user, {Time}^{\wedge }pgoal(user, done([sys, make\_appointment(usr, b, Day, Time), Cond), R)))$$

That is, the user does not know what time s/he wants the system to make the appointment.

8 Operational semantics – BDI architecture

Belief-desire-intention architectures have been researched for many years, beginning with Bratman et al. [75]. Inspired by philosophical and logical theories, reactive BDI architectures such as PRS [103] essentially sensed changes in the world that updated the system’s state (its “beliefs”). The architecture determined which of its pre-packaged “plans” could be used to achieve its “goals” in that state. These architectures expand the plans hierarchically to decide what to do. Sardiña et al. [104] show that the inner loop of such BDI architectures is isomorphic to HTN (Hierarchical Task Network) planning. However, neither BDI architectures nor HTN-based systems engage in plan formation (e.g., [81], nor do they reason about other agentsFootnote 52. Sardiña et al. [104] and de Silva et al. [106] present formal theories of how to incorporate plan formation in such architectures, in part using declarative statements of the system’s goals. In our case, the Eva system needs to be more declarative still in order to reason about the user’s beliefs, goals, and intentions. There is also some confusion in the BDI architecture literature in the use of the terms ‘intention’ and ‘plan’. For other BDI architectures, intentions consist of plans, which are contained in a fixed plan library, corresponding to our actions and their hierarchical definitions. In contrast, Eva’s plans consist of its intentions and goals, whose contents are actions (that are contained in a library) and propositions to be made true. Eva’s plan is created at run-time by the hybrid planning algorithm, whereas other BDI architectures’ plans are fixed and reactively executed. In virtue of the Cohen and Levesque [1] semantics, Eva’s intentions also contain a relativization parameter, enabling it to unwind its plans appropriately.

The Eva system performs the basic loop described below, and depicted in Fig. 6. It relies on the declarative representation and reasoning processes described above, as well as the intention formation and commitment processes in [1] that state the relationships among intentions, persistent goals, and beliefs.

Fig. 6
figure 6

The Basic BDI Architecture

8.1 The main loop

Eva’s operation could be described as looping through the following steps (where S identifies the system, and U stands for the user):

  1. 1)

    Observe the world, including U’s action(s) AU, including speech acts.

  2. 2)

    Assert that U wants (i.e., has pgoal for) the effect of AU; If AU is a speech act, assert that U wants S to believe U wants the effect. If S trusts U, this will collapse “U wants S to believe U wants” into “U wants” the effect.Footnote 53

  3. 3)

    Assert that U believes that the precondition of AU holds, and if S believes the action AU was successful, assert that S believes the precondition holds.Footnote 54

  4. 4)

    Apply plan recognition rules until no new beliefs, persistent goals, or intentions are created.

  5. 5)

    Debug U’s plan, i.e., check the applicability conditions for U’s intention(s) in the plan.Footnote 55

    1. i)

      If the applicability condition to an act is false, plan alternative ways to achieve the higher-level effect of the act.

    2. ii)

      Retract any intentions or pgoals that are marked as not_applicable, and remove their subtree of dependent pgoals/intentions. If there are no possible actions that can be performed to achieve the applicability condition, inform the user of the failure of that condition (e.g., no vaccination center has vaccine available).Footnote 56

  6. 6)

    Adopt U’s pgoals to achieve P as the system’s own pgoals, i.e., pgoal(U, P, Q) → pgoal(S, P, pgoal(U, P, Q)) if P does not conflict with system’s existing pgoals.Footnote 57

  7. 7)

    For S’s pgoal to achieve proposition E, S plans to achieve E by finding a (possibly complex) action AS that achieves it, resulting in an intention to perform action AS. If AS benefits the user, S also creates a persistent goal to know whether the user would want S to perform AS.Footnote 58

  8. 8)

    If S does not know whether the applicability condition AC for AS is true, formulate a question to find out whether it is. Also, block the intention to perform AS until the truth or falsity of AC is believed.

  9. 9)

    Execute (some of the) intended act(s) AS. We provide details regarding the choice of what intentions to execute below.

  10. 10)

    Remove the intention to perform AS if AS was done.

  11. 11)

    If AS is deemed impossible (e.g., the applicability condition for AS is believed to be false), unwind the intention to perform AS via the relativization conditions, continuing to unwind through the chain of pgoals and intentions that depend on AS.

  12. 12)

    If AS terminates a higher-level act that S was doing, retract the doing predicate and assert that the higher-level act was done.

  13. 13)

    If the execution of AS fails (e.g., failure has been propagated to Eva from some backend system), remove the intention to achieve As and plan again to achieve the effect for which AS was planned. Inform the user of the failure of the action (e.g., the credit card could not be charged), and if the backend provides that information, the reason for the failure (e.g., card number is invalid).

These steps are repeated until no more formulas have been asserted.

8.2 Intention satisfaction, abandonment, and revision

The logic of intention prescribes when intentions must be abandoned. For example, in Step 10, if an intended action has been performed, the intention is removed because it is satisfied. Likewise in Step 11, if the system discovers that an applicability condition for an intended action is false, then (because it can do nothing to achieve false applicability conditions), it concludes the intended action is impossible, and so abandons it and unwinds the subtree of pgoals/intentions that depends on the intention to perform the impossible action. For example, if the system forms the intention to achieve the conditional act, condit(P, Act), and it does not know whether P holds, it forms the pgoal to knowif(S, P), relative to the intention to do condit(P, Act). That pgoal to knowif(S, P), may well lead to an intention to ask a yes-no question as to whether P holds. Because of the chain of relativizations, if the intention to do the conditional action is abandoned, say because S no longer wants the effect of Act to which the conditional Act was relativized, the pgoal to achieve the knowif(S, P) will eventually also be dropped.

Intentions may also be abandoned by the system when users change their mind about their goals. For instance, if the system has a persistent goal to achieve P, relative to the user’s persistent goal that P, and the system comes to believe that the user no longer wants to achieve P, the system can drop its pgoal to achieve P, and unwind the plan subtree that depends on P. The plan subtree that depends on P possibly contains additional intentions and goals that have not yet been acted on, but were placed on the agenda in order to achieve P.

We note that in the scenario described in our example (Section 1.1), U has changed her mind after the system has already affected the environment to achieve U’s previously specified goal (i.e., an appointment for Monday at 8 has been made). In our example, in order to achieve U’s goal of not having an appointment on Monday at 8, the system offers to reschedule U’s appointment, since the effect of the rescheduling action is both that U no longer has an appointment at the old date and time, and also that U has an appointment at a new date and time. Thus, the rescheduling action is an example of an action to ‘undo’ the consequences of S’s having achieved U’s old goal.

Regarding Step 13, we assume for now that the system can in fact execute any of its intentions for which the applicability condition and precondition are true at the time of execution. However, ability to execute an action does not necessarily translate into successful execution when the execution is delegated to some external agent. For example, assume Eva has gathered credit card information and sends the appropriate request to the credit card company to charge the card, only to receive a failure with a response that the card is over the credit limit. In this case the system’s act failed. Eva would inform the user of that failure as an impossibility because it behaves according to the principles of joint intention theory [1], which entails informing the user when a joint intention is impossible to achieve. However, it then would replan to achieve the higher-goal of finishing the transaction by asking the user for another credit card (or some other plan). It would not retry that specific card because the intention to charge that card is impossible to achieve.

Icard et al. [107], Shoham [108], and Van der Hoek et al. [109], all consider the problem of intention revision with logics similar to or inspired by the one discussed here. The approach we are taking distinguishes itself in several ways. First, Eva relativizes intentions and goals to one another so that if a pgoal to achieve P is undermined because the intention/pgoal to which it is relativized has been given up, then that pgoal to achieve P may be given up as well. Second, if an intention is adopted relative to a belief, and for some reason that belief is retracted, so may the intention (e.g., “I intend to take my umbrella because I believe it may rain”). We cannot adopt Shoham’s [108] use of a simple database of intentions, because we need to quantify into multiple levels of modal operators (see Appendix A.2). In both van der Hoek et al. [109] and Icard et al. [107], a distinction is made between beliefs that the agent adopts because it believes its intentions will be successful, and beliefs that occur from observing the world. Unlike the cited authors, Eva does not infer beliefs because of the effects of actions stated in the action description. Those descriptions are not dynamic logic expressions per se. The agent has a pgoal for the effects, but until the agent does the action, and determines that it was successfulFootnote 59, it does not come to believe the effect. This simplifies joint intention and belief revision, as well as dialogue. Finally, as emphasized in van der Hoek et al. [109], there may be multiple reasons for forming an intention. They argue that the intention should only be retracted if all those reasons are themselves retracted. In Eva’s case, the system might have two intentions to perform the same action A, each relativized to a different proposition. If one relativization is given up causing its relativized intention to be given up, the other intention to perform A that was relativized differently would still remain.

9 What intentions to execute

At any time, the system may have multiple intentions, including communicative intentions, that could be executed. For example, at the same time it could have a rapport-building action (“that’s terrible…”, “I’m sorry that…”), a confirmation, one or more informatives, a question to ask, and a request to pose. Some of these actions may change the “topic” of discussion, or begin work on a new goal, which can be introduced by discourse markers such as “OK, now”, “so”, etc. There may be mixed initiative digressions created by the user, such as seen in the example above when the user answers the question “how old are you?” with another question “why do you ask?”, and with the conversation eventually being returned back to the original topic with a repetition of the prior speech act. Thus, there may be many possible choices for what action(s) to do next. Although one could imagine learning a policy for deciding among the actions to execute, the needed data for a new domain would be difficult to obtain, even with user simulators [46, 101]. As a result, we have implemented an initial policy in which Eva executes speech actions from its set of enabled intentions in the following order (with examples drawn from the above dialogue):

  1. 1.

    Rapport-building (“Sorry to have to ask again”)

  2. 2.

    Confirmations (“OK, you are eligible for the Covid vaccine”)

  3. 3.

    Informatives on the same topic (“The reason why ….”)

  4. 4.

    Repetition of a prior speech act in order to return to a prior goal/topic (“but how old are you?”)

  5. 5.

    Informatives on a new topic

  6. 6.

    A single directive or interrogative action, requiring a response from the user (“Are you caring for someone who is disabled?”, “What date would you like the appointment?”). Again, continuing on the same topic is preferred to switching to a new one.

In a single turn Eva may execute multiple speech acts, but no more than one directive/interrogative. The Eva framework includes an initial topic analysis based on the structure of the logical form at issue; expanding this analysis is left for future research (cf. [19, 33]).

10 Use of context

Eva maintains several types of context. First, at the logical form and linguistic levels, it maintains a chronologically-ordered done representation of all utterances and their speech acts, as well as domain actions, that have been performed by the dialogue participants. In addition, it keeps track of pgoals and intentions that it currently has, and also previously had but have been retracted. It also maintains the instantiated intended effects of both parties’ speech acts. For example, the system’s wh-question to the user “how old are you?” is represented as the act:

$$whq(sys, usr, Age\#{years}^{\wedge }age\_of(usr, Age\#years)),$$

whose intended effect the system would maintain in the context database as:

$$knowref(sys, Age\#years^{\wedge}age\_of(usr, Age\#years)).$$

This contextual information is used to derive a full logical form when the user answers a wh-question with a fragment, say “45 years old”. In this case the parser generates the expression 45#years, which is unified with the typed variable Var#Term of the contextual knowref formula above. This determines the predicate in question, enabling the system to identify the user’s speech act as an informref(user, sys, 45#years^age_of(usr, 45#years)). Likewise, in processing answers to yes-no questions, Eva searches its context database for a knowif(sys, Pred), which, if successful, results in the user’s speech act being identified as an inform(user, sys, Pred) or inform(usr, sys, ~Pred). Context also informs numerous other aspects of the system’s processing (e.g., to generate anaphoric expressionsFootnote 60).

11 Explanation

A major advantage of planning-based systems is that they provide an immediate mechanism to support the generation of explanations [110]. Unlike black-box machine-learned systems (e.g., [46]), the present system has a plan behind everything that it says or does, such that it can answer questions like “why did you say that?”. The explanation finds the path in the plan starting from the action being referred to, and follows the chain of achievements, enablements, and relativizations backwards to the intentions and persistent goals that led to the action to be explained. For example,

  • S: “how old are you?”

  • U: “why do you ask?”

  • S: “The reason is that I need to determine whether you are eligible for the Covid vaccine”

The pgoals needed to answer this request for explanation are described in Section 7. What would constitute a “good” explanation is a subject of considerable research [67, 111,112,113]. For example, in the dialog above, a good explanation would not be that the system asked for the user’s age because it wanted to know the answer! Although that is the closest pgoal in the plan, that answer is likely something that the user already knows (S and U both are taken to know what the desired effect of a wh-question speech act is).Footnote 61 On the other hand, a good explanation is also not one where the system asks the question because the user wants to be vaccinated, which is at the top of the plan tree. For the answer to the explanation request to be reasonable, Eva finds the lowest pgoal in the plan whose content the user does not already believe. In doing so, Eva needs to provide the explanation that the precondition for being vaccinated is eligibility.

The system also engages in proactive explanation by informing the user of preconditions of actions as goals the system needs to achieve. For example, in an insurance domain, it says: “In order to help you file a claim for this incident, I need to have identified your phone”. Importantly, as with any action, the system checks whether the effect is already true, i.e., for an inform, whether the user already believes the propositional content. In particular, some of the preconditions are stated as being general common knowledge. For example, to have someone receive a vaccination at a vaccination center, the person needs to be located at the vaccination center. Because that proposition’s being a precondition (as opposed to its truth) is common knowledge, the system does not plan an inform speech act of the need for this proposition to be achieved. But notice that it does recognize the user’s plan to achieve her being at the vaccination center, and collaborates by telling the user what she would need to know, namely how to drive there. We leave more work on explanation to further research. However, we note that the overall planning-based dialogue framework inherently supports explanation since elements of the plan are causally linked to one another.

12 Related work

In this section we discuss four threads of research that strongly relate to the topics discussed here: epistemic planning, slot-filling dialogue systems, plan-based dialogue systems, and today’s prevalent approach to building conversational assistants, which relies on pre-trained large language models (LLMs) using the transformer architecture [115].

12.1 Epistemic planning

The planning community has considered planning coupled with sensing actions in order to overcome incomplete knowledge (e.g., [50, 66, 116, 117]). Epistemic planning systems generate plans to influence agents’ beliefs, both those of the planner and of other agents. We argue that an essential feature for any epistemic planner to support a dialogue system is that it be able to handle the incomplete belief/knowledge described by knowref, and be generalized to modal operators other than belief. However, most epistemic planning systems rely on the content of a belief (or multiply-embedded belief operator) being propositional (i.e., not first-order) and therefore do not support such reasoning. There are a few notable exceptions, including the work of Liberman et al. [118] who studied a first-order extension of dynamic epistemic logic (DEL) and appealed to term-modal logics to define the semantics for their language and corresponding epistemic planning system. Liberman et al.’s framework was applied to epistemic social network dynamics, which concerns the flow of information and knowledge through social networks, and how individuals’ and groups’ beliefs and behaviors are impacted (e.g., one might wish to model the spread of misinformation) [119]. Eva goes beyond Liberman et al.’s work by offering a complete implementation of a collaborative dialogue system. Importantly, in addition to belief or knowledge, Eva can reason with a number of modal operators essential to collaborative dialogue (e.g., persistent goal, intention) and can quantify into all of these modalities. Finally, Liberman et al. [118] offer a compact characterization of action schemas, inspired by PDDL, which bridges between research on planning formalisms and DEL. Future work could explore whether Liberman et al.’s action and domain representations could be used for our collaborative dialogue purposes.

The closest epistemic planning work to ours is the PKES planner of [66, 117, 120]. They developed a modal logic-based dialogue planning framework that uses one database to keep track of an agent’s knowing whether a proposition is true, along with a different database for an agent’s knowing the values of terms (cf. [51]). Driven by PKES, Petrick and Foster [121] developed an impressive human-robot social interaction system in a bartending domain. While the problems are similar to ones we tackle, the approach itself is limited by the use of a single level of databases, and its use only for the system’s belief modality. In order to plan speech acts in dialogues, Eva’s planning system needs to create plans to influence another agent’s beliefs, (persistent) goals and intentions. Specifically, we showed here and in Cohen [48] the desirability of having quantifiers whose scope includes multiple modal operators. For example, the representation of “John wants to know whether Mary knows the date that Sue wants to eat”, which could lead to the question to Mary “when does Sue want to eat?”. To represent these would require a goal database within which is a know-if database, containing a know-value database within which is a goal database. Also, one could in fact quantify over agents, as in “John wants to know who knows the secret.” Then it is not clear which database should be used. Eva currently reasons and creates plans with quantified-in formulas directly and without such databases.

12.2 Slot-filling dialogue systems

The dialogue research community has concentrated on a form of task-oriented dialogue that emphasizes slot-filling, which dates back to the Gus frame-based dialogue system [122]. Wen et al. [46] say this about task/goal-oriented dialogue (emphasis ours):

Given a user input utterance, ut at turn t and a knowledge base (KB), the model needs to parse the input into actionable commands Q and access the KB to search for useful information in order to answer the query. Based on the search result, the model needs to summarise its findings and reply with an appropriate response mt in natural language.Footnote 62

Building a system solely to execute actionable commands is a very limited conception of goal-oriented dialogue. For example, the original study of task-oriented dialogue from Grosz [123], had the system giving the user instructions on performing a task (assembling an air-compressor). Numerous researchers (e.g., [14, 19, 124]) have emphasized the need for collaborative tasks, in which both parties can get the other to perform actions, though typically they have not emphasized slot-filling per se. Our framework is squarely in the camp of collaborative dialogue, for which slot-filling is a necessary component.

Although slot-filling is an important step, intent classification and slot-filling are only part of what it takes to engage in a task-oriented dialogue. The Dialogue State Tracking Challenge [55, 125] attempts to standardize a corpus-based test for systems that acquire values for slots. The most explicit definition of “slot” we can find is from Henderson [55] in describing the Dialog State Tracking Challenge (DSTC2/3):

The slots and possible slot values of a slot-based dialog system specify its domain, i.e. the scope of what it can talk about and the tasks that it can help the user complete. The slots inform the set of possible actions the system can take, the possible semantics of the user utterances, and the possible dialog states…

The term dialog state loosely denotes a full representation of what the user wants at any point from the dialog system. The dialog state comprises all that is used when the system makes its decision about what to say next.

Under this approach, slots are considered to be parameters of an action that either are filled by an atomic symbol, are unfilled, or are filled by the atoms dontcare, dontknow, or none. As we discussed in Cohen [48], the meaning of these values, or lack of values are unclear — are they quantified variables? Is there a negative somehow involved in dontcare and dontknow? If so, how are those embedded negatives used in reasoning?

Overall, this intent+slot meaning representation is far too limiting, e.g., it does not handle true logical forms including Booleans, conditionals, superlatives, comparatives, temporal qualifications, etc. More difficult still, dialogues often provide constraints on the values for slots, rather than providing an atomic value, such as ‘7 pm’. As discussed in the introduction, these situations may lead to the conversants collaboratively filling the slot, rather than just one party’s doing so. Our framework offers a more general approach.

12.3 Planning-based dialogue systems

Although we have provided extensive references to planning-based dialogue systems throughout the paper, there are a number of important works with which to compare. The closest implementations have been the ARTIMIS system [39] and systems from Allen’s group at the University of Rochester, (Trips, Trains, Cogent), and a recent plan-based dialogue system from IBM [126].

First, there have been research systems that partially attain some plan-based capabilities. RavenClaw [127] employed a fixed hierarchical descriptions of dialogue moves, but did not engage in on-the-fly planning and reasoning. TrindiKit [128] provides the dialogue system developer with generic tools to build rule-based dialogue systems using a simplified dialogue state and set of communicative actions. While these systems were worthy developments, they eschewed representations of mental states of the participants, which Eva adopts and reasons with directly.

Beyond the early plan-based dialogue work at the University of Toronto, the first system to incorporate a variant of the Cohen and Levesque [1] logic was ARTIMIS [30, 39], which reasoned about beliefs and intentions for a deployed system that engaged users in spoken dialogue about finding services in the French Audiotel telephone network. The system engaged in mixed-initiative question-answering, and had to deal with substantial numbers of speech recognition errors. While very similar in spirit, our framework develops and makes more extensive use of the plan structure, collaboration (plan recognition, plan debugging, planning), goal/subgoal relativization, and explanation. The use of quantified modal operators in Eva is also more extensive, especially as applied to quantification through multiple bel/pgoal/intend operators.

Lemon et al. [129] developed the WITAS multimodal conversational system that drives an autonomous helicopter. WITAS bears some similarities to Eva in terms of its ability to reason about domain actions and maintaining multiple threads of conversation. The system is a derivative of the Information State approach to dialogue [128]. The Information State implementation that WITAS maintains include a/an: Activity Tree, Dialogue Move Tree (DMT), System Agenda, Pending List, Salience List, and Modality Buffer. The Activity Tree corresponds to Eva’s hierarchical action expressions, but only seems to allow decompositions into sequential actions vs. Eva’s use of sequential, conditional, and disjunctive actions. The actions are described as having preconditions and effects, but plans are not composed on-the-fly into future-directed plans. The actions can be obviated if the effects are found to be true in the system’s database, but there is no mechanism to create goals to find out whether the preconditions are true if the system cannot prove those preconditions. Thus, unlike Eva, the robot system is essentially operating in a closed world. The Dialogue Move tree precomputes how user speech acts can “attach” to nodes in the DMT, though which nodes receives the attachment seems to depend solely on the dialogue act types rather than the effects of those acts (which are not specified). The generation of output employs the Gemini unification grammar [130] that renders logical forms as utterances. This is more general than Eva’s natural language generator, which uses a structural decomposition of logical forms to generate simple utterances. WITAS’ LFs are either on the system agenda or on the “pending list”, which stores questions that the system has asked (but not those that the user has asked). Eva’s pending list keeps track of questions from both parties, and also incorporates requestive speech acts, enabling it to more generally engage in multi-threaded interactions. WITAS’ generation process may “aggregate” multiple clauses, rendering “I will fly to the tower and I will land at the parking lot” as “I will fly to the tower and land at the parking lot” (see also [71] for similar plan optimizations). Finally, anaphoric expressions are generated if an object in a logical form is at the top of a “salience list”. Eva generates an anaphoric expression for a logical form element if that element is present in the LF of the last dialogue turn (including both party’s contributions). Apart from these differences, WITAS appears to do no plan recognition, nor obstacle detection or helpful behavior. Finally, its state representation is only of facts that the system believes to be true and goals that the system is pursuing, not a general-purpose representation of both system and user’s mental states.

An interesting and useful point of comparison is the COGENT framework [19], the latest evolution of the TRIPS system developed by James Allen’s team at University of Rochester, and, more recently, the Institute of Human and Machine Cognition (IHMC). Like Eva, COGENT is specifically addressing the need for dialogue systems to tackle much more complex tasks than what the current generation of conversational assistants are capable of [131]. Interestingly, COGENT evolved from the same ideas of using planning and plan recognition to drive dialogue [9], with speech acts serving as essential operators in such a plan [8, 9, 13]. Unlike Eva, however, the TRAINS/TRIPS/COGENT series of systems placed emphasis on natural language understanding, built around the TRIPS parser [131, 132], to create a very rich, domain-independent utterance meaning. The pursuit of domain independence led the group towards a theory of dialogue based on Collaborative Problem Solving (CPS), whereby the agent’s competence in carrying out collaborative dialogue is completely separated from the competence to carry out domain actions. Thus, in COGENT the dialogue model proper is incorporated in the CPS Agent (CPSA), whereas domain-specific planning and execution are incorporated in a separate Behavioral Agent (BA). Dialogue in COGENT is reflected in updates to the CPS state, which keeps track of the status of joint intentions, and those updates are only happening after fairly extensive communication between the CPSA and the BA. In this model, a user’s communicative act leads to a CPS act that change the CPS state, which in turn leads to a change in the problem solving state of the BA; generation of communicative acts by the system follows this path backwards.Footnote 63 This division of labor between managing the dialogue itself and the generic linguistic aspects of the interaction on one hand, and managing the individual problem-solving aspects of the system on the other hand is meant to make the COGENT framework attractive to potential developers of collaborative dialogue systems without requiring them to have sophisticated linguistic expertise. However, they would still need to master the complex logical form language embedded in the CPS acts to create mappings to and from their internal representations. And, when utterances in their chosen domain fail to parse, they are out of luck.Footnote 64

By contrast, Eva uses a ML-based semantic parser, which may generate a less-detailed LF, but is much easier to train and improve. Although Eva currently does not have as explicit a model of joint intention as COGENT does, its dialogue model does, in fact, ensure that the system’s and user’s plans mesh appropriately in true collaborative fashion. Eva has a more explicit model of the participants’ beliefs and intentions, which provides it with a solid basis for implementing additional reasoning about joint intentions and shared knowledge, if need be (cf. [133]). In fact, COGENT lacks a deep model of user’s beliefs, relying on the BA to do so. CPSA’s dialogue model does not depend on having a “theory of mind”, which is a potentially severe limitation. For example, COGENT cannot model a multi-user conversation, in part because it cannot distinguish the different users’ mental states. We think these mental states are crucial, so much so that they are at the core of Eva’s planning-based dialogue model. Whereas Eva has the ability to explain its own behavior, COGENT’s current model has the ability to justify answers, and proposed modifications or failures of the CPS acts, but in its current incarnation it cannot answer a question such as “why do you ask?”.Footnote 65

Finally, we consider the recent work of Muise et al. [126], which we will refer to as PGODS (based on the paper title). This work is very much on the same track as ours, but concentrates on the planning aspect rather than the dialogue per se. PGODS attempts to provide a high-level specification methodology for simple dialogue systems centered around a restricted form of planning. For PGODS, the system developer specifies dialogue actions and representations of back-end actions, as well as system responses for each, from which a planner essentially compiles a large tree of all possible actions/utterances the system and end user can take. Based on this, the authors nicely show how a planning-based approach is more compact than having to specify each step of a scripted dialogue.Footnote 66 Prominent among approaches to dialogue from which PGODS leverages is the classical “intent+slots” approach, which we have discussed in Section 12.2. Unlike those approaches, the PGODS approach allows one to specify more general types of dialogues, but still restricted in their complexity.

A few specific comments will serve to differentiate the PGODS approach from ours. First, the FOND (Fully Observable Non-Deterministic) planner employed [134] uses the PDDL language [83] for expressing preconditions and effects, which only allows atomic symbols to be expressed, rather than the modal logic expressions that Eva uses. This very much restricts the logical forms that the system can handle and the kind of reasoning it can support. PGODS (and PDDL) does not distinguish between preconditions and applicability conditions, such that the former can be made true, but the latter cannot. Likewise, Eva’s use of hierarchical action descriptions is more expressive than PDDL (and the hierarchical variant, HDDL). The planning formalism used by PGODS encodes all of the possible outcomes of an action in that action’s definitions, which would need to be similarly encoded for every action. For example, all the possibilities that might arise when an action is deemed to be impossible are folded into that action’s PDDL representation. For all such conditions, PGODS assumes that the developer is specifying the system’s response utterances by hand. In Eva, these are technically not part of the plan but result from its execution and are handled by the BDI architecture, which itself depends on the semantics of intention (specifically, the conditions for giving up an intention).

A very substantial difference between the PGODS system and Eva is the representation of incomplete knowledge about the user’s mental states. In particular, the PGODS system appeals to the well-known 0-approximation [135] and represents knowledge and uncertainty via a simple 3-valued logic that supports the use of efficient planning tools. In contrast, Eva’s modal logic framework offers a much richer way of encoding (multiple) agents’ knowledge as well as other mental states such as beliefs, goals, and intentions.

Finally, while not its focus, the PGODS system does not handle multi-agent settings, plan recognition, obstacle detection, goal adoption, and collaboration. There are many more differences, but suffice it to say that Eva operates at a more theoretically-grounded level.

Recent systems [136, 137] have parsed utterances into “data flow” graphs that provide a graph query or retrieval “plan” of execution or operations on that graph. Data flows from one query action to another is exactly analogous to shared logical variables in a unification framework. Indeed, the parsing of utterances into logical forms, dating back at least to the Chat system of Warren and Pereira [138] as executed by a Prolog-based interpreter essentially provides “data flow” during execution. The representation of meta-operations on the dataflow graph, such as referring to an entity, or replacing one entity or subset of logical form elements with another, can be handled by conjoining additional constraints, and by replacing predicates in logical forms with others. More generally, the above-cited works do not discuss domain or speech action planning, plan recognition, nor collaboration and thus do not provide a framework for collaborative planning-based task-oriented dialogue.

12.4 Neuro-symbolic dialogue systems

The current trend towards developing conversational assistants is to use prompt-based LLM frameworks for most functionality, with tools (function calling) for interfacing to external services, databases, or knowledge bases. The LLM functions as the de facto agent, its actions being either specified in the prompt or implemented via the aforementioned tools; hence, such systems are often referred to as agentic or simply as LLM agents. To the extent that such tools involve forms of symbolic computation, these approaches qualify as neuro-symbolic approaches. In some situations, it may be useful to split functionality among different LLM sub-agents, managed by a master LLM agent, in a manner similar to distributed AI applications of yore (see, e.g., O’Hare and Jennings [139]). However, this approach should not be confused with one that involves multiple autonomous agents collaborating together and with one or more humans. Whereas there are many possible configurations for such agentic systems (platforms, models, prompts, tools, etc.), the distinguishing feature is that LLMs play the dominant role in reasoning and behavior, with symbolic components, when present, relegated to a subservient role. Thus, we can draw on the substantial body of research on LLMs (and on our own experience) to make some general comments on such approaches and contrast them to the Eva framework, which implements reasoning and agency in a formal, symbolic system, with LLMs playing important, but non-central roles.

The system prompt for an LLM agent is used to describe its core competencies, in particular its domain knowledge and actions it may take (via tools or via its generative capabilities), largely in the form of a natural language description. It may also include data, additional documents with background knowledge, and examples of usage. All these serve to constrain the otherwise unrestricted generative capabilities of the pretrained LLM to the application domain. In Appendix D.1 we provide an example of a system prompt, usable with multiple LLMs to create a dialogue system that can partially handle conversations in the Covid vaccination domain similar to the example in Section 1.1.

There are obvious advantages to this approach: a minimal NL specification of the agent’s capabilities can be provided even by non-experts, thus enabling very fast development.Footnote 67 It is assumed that LLMs have already absorbed large amounts of commonsense knowledge, so there is no need to specify it in detail, or at all. In contrast, formal reasoning requires a lot of detail to work; formalizing this knowledge and making sure the resulting theory is sound and adequate for the envisioned application is generally seen as a difficult task, performed by expert knowledge engineers. Also, LLM agents generally do not require the inclusion or specification of a dialogue model – it is assumed that LLMs know how to carry out dialogue by virtue of having been trained to engage in conversational behavior.Footnote 68 Importantly, all LLM-based TOD dialogue systems we are aware of use the slot-filling approach described in Section 12.2 to model dialogue state, and thus retain the shortcomings we identified earlier. But, even with this limitation, to trust that an LLM agent will behave appropriately during a task-oriented dialogue with a user, we have to assume that it will follow prompt instructions, will understand correctly what users want, will use the correct knowledge (via its latent representations), will perform sound inference with that knowledge, will perform the correct actions at the right time to achieve users’ goals, and will perform the appropriate communicative actions to allow users to track progress on the task. As we describe below, these assumptions are not fully warranted.

LLMs as actors

Tools are at the core of the agentic approach: they help the otherwise disembodied LLM sense and act on its environment, and access real-time data about it. For all intents and purposes, tools are the means by which agency is conferred to LLMs in most TOD systems. Yet evidence suggests LLMs struggle to call tools appropriately: they can invent non-existent ones, and they pass incorrect arguments to them. τ-bench [143] is a dataset specifically designed for the evaluation of LLMs’ ability to use tools while engaged in task-oriented dialogues in two domains: retail (ordering products, canceling/modifying orders, etc.) and airline ticket reservations (including canceling/modifying them, getting refunds, etc.). The domains only have a relatively small number of tools available (15 and 13, respectively), yet even the most recent and powerful LLMs struggle to call them correctly.Footnote 69 Error analysis is not generally available, but Yao et al. [143] estimate that about 45% of task failures can be broadly attributed to reasoning/planning failures, and over 55% of them can be attributed to incorrect tool use. This is not reassuring. In the Eva framework none of the above problems can occur: first, only properly instantiated actions are doable, with all the required parameters instantiated (i.e., the system knowrefs their value), and all their preconditions satisfied; second, only actions that are necessary for moving the task state towards the user’s goal are ever executed; and third, Eva’s collaborative action and dialogue model ensures that its plans are transparent and mutually agreed upon.Footnote 70

LLMs as planners

Despite substantial effort being spent trying to demonstrate that LLMs can plan, there is plenty of evidence to the contrary [145,146,147]. Even the most sophisticated models struggle to develop correct plans that have more than 4-5 steps. Moreover, experiments show that small deviations in problem formulation, that do not change its semantics, drastically reduce performance, which suggests that LLMs are likely performing a form of approximate retrieval [148, 149]. Crucially, LLMs do not plan the way formal planners do; in particular, there is no guarantee that the plans they generate are actually realizable. This can be readily demonstrated by removing tools that define actions. For example, if the LLM agent for the vaccination domain lacks tools for checking vaccine availability or making appointments, it should reject such goals; instead, the agent plans to achieve them (e.g., by saying “I’ll verify if the selected center has vaccines in stock.”) simply based on the text of its instructions (see Dialogue #2 in Appendix D.2). Eva’s planner reasons through all action preconditions, so it would never plan to execute an action that can be determined (statically) to not be doable. Furthermore, as we described in Section 6, Eva can inherently handle multi-agent collaborative plans and conversations; currently this problem is hardly given any attention by advocates of LLMs as planners.

LLMs as reasoners

Planning is but one form of reasoning, and even more considerable effort has been spent on the ability of LLMs to perform correct inference, despite evidence that LLMs are not genuine reasoners (e.g., [150,151,152]) and that their performance on common benchmarks might be inflated due to data contamination [153]. Recently large reasoning models (LRMs) have emerged, which are essentially LLMs with additional training for generating step-by-step solutions to various problems, primarily in coding, math and other STEM domains (e.g., o1 and o3 from OpenAIFootnote 71, and DeepSeek’s R1Footnote 72), and, crucially, with the ability to consider and choose among multiple possible solutions at test time. Whether reasoning capabilities learned from traces of solutions for math problems can transfer to other situations remains to be seen. While LRMs are generally found to perform better than regular LLMs on reasoning benchmarks, the current evidence is that the performance of all models drops precipitously when the complexity of the problems increases [154]. This is evidence that scaling up (larger models, with more test-time compute) is unlikely to lead to sound, reliable logical reasoning. At the same time, LRMs’ use of test-time inference leads to very high latency – tens of seconds to minutes – which makes them completely unsuitable for use in dialogue systemsFootnote 73 (costs for proprietary models are also significantly higher than for non-reasoning LLMs). For this reason, we have not yet tested LRMs, but we have carried out reasoning experiments with several top-tier LLMs and found them lacking. To our surprise, even the simple eligibility criterion for vaccination (over 65 years old, or over 50 and caring for someone with a disability, or working in an essential role) consistently posed trouble to all the models we tried. The two most frequent errors were not recognizing that someone who is an essential worker is eligible regardless of age, and that for persons under 50 it does not matter whether they care for someone else or not (we show examples of each in Appendix D). Symbolic reasoners do not make these kinds of mistakes; they offer guarantees and are orders of magnitude faster. They are also debuggable: had Eva made such reasoning errors, they would be straightforward to fix. However, when LLMs make reasoning or other kinds of errors, it is not at all clear how to correct them.

LLMs as knowledge bases

Even if the capacity of LLMs to reason were to improve, there remains the problem that LLMs – and, as a consequence, LLM agents themselves – are not capable of maintaining a (mostly) consistent set of beliefs about the world. Their training data is not constructed to promote such consistency, and there is no mechanism to “erase” incorrect beliefs. Thus, even correct reasoning steps may result in incorrect results. LLM agents can be endowed with tools to retrieve knowledge from external sources, either in structured form (e.g., Knowledge Graphs, databases), or in unstructured from (documents). In particular, retrieval-augmented generation (RAG) [155] is often used to retrieve relevant information from data stores, which is then passed on to the LLM as additional context for generation. RAG is mostly used for question answering, but can be used in TOD systems, as well, for example to replace database lookup with an approximate lookup [156]. While these methods do, in general, augment LLMs with information that is more relevant, accurate and up-to-date than the knowledge embedded in an LLM’s parameters, they are no panacea: retrieval methods are imprecise, so they offer no guarantees, and, even when retrieval is successful, during generation LLMs often combine irrelevant internal knowledge with that which was retrieved, ignore or misconstrue the retrieved information, etc.Footnote 74 Systems based on formal methods, like ours, can carry out truth maintenance and belief revision to keep their knowledge internally consistent. Moreover, if new knowledge is provided by the user that conflicts with the system’s own knowledge, our system might confront them and seek clarifications or reject the new knowledge altogether. LLMs, on the other hand, are prone to accept users’ stances uncritically and modify their responses to align with those stances (a phenomenon dubbed sycophancy [158]).

LLMs lack a theory of mind

Despite their linguistic capabilities, and despite claims that Theory of Mind (ToM) capabilities have emerged in LLMs [159, 160], LLMs have significant limitations in understanding and reasoning about others' mental states. Recent studies (e.g., [161]) demonstrate that even the most advanced models struggle with ToM tasks, particularly as scenarios become more complex. This limitation is especially pronounced when dealing with multiple agents possessing different knowledge states. Galitsky's recent research [162] identifies several critical weaknesses: LLMs often fail to generate accurate inferences about mental states, struggle to resolve ambiguity between competing possibilities, and hesitate to commit to single interpretations even when they can generate correct explanations. This “reluctance to prioritize or commit to the likeliest inference” fundamentally undermines their ability to perform robustly in tasks requiring nuanced social understanding and contextual reasoning. The core issue is that LLMs lack the cognitive architecture necessary to maintain coherent representations of beliefs, intentions, and knowledge across conversational contexts. While recent work [161] has introduced “thought-tracing” as an inference-time reasoning approach to approximate mental state attribution, these techniques still fall considerably short of the ToM capabilities inherent in symbolic frameworks like Eva, which explicitly represent and reason about mental states. This deficiency may be particularly apparent in multi-party conversations, where tracking different participants' knowledge states becomes essential but challenging for current models. Such limitations severely impact their ability to understand users' intentions and commitments in collaborative dialogues—a critical aspect of the intentional structure identified by Grosz and Sidner [33] as fundamental to meaningful human communication. In contrast, Eva's formal symbolic representation of beliefs, goals, and intentions enables precise reasoning about mental states across multiple agents, which is essential for collaborative task-oriented dialogue where understanding users'intentions, not just their utterances, is paramount.

LLMs cannot explain their behavior

Reviewing the different definitions of explainability found in the literature, Rosenfeld and Richardson [163] proposed that the core function of an explanation is for the explainee to understand the explainer’s logic. On this account, LLMs cannot properly explain themselves, since they do not have access to their internal logic; the explanations they generate are more akin to post-hoc rationalizations, which may well be persuasive, but which cannot be said to faithfully describe what led to the generation of the explanandum. Moreover, when LLMs produce factually incorrect output, they are happy to provide an explanation for it [164]. There is ample evidence that systems able to explain their behavior are trusted more by users; on the contrary, when users have little understanding of how a system works, they tend to distrust it and potentially stop using it [165]. This can be particularly problematic for LLM agents, which make mistakes very unlike the ones made by humans. Of course, explainability and interpretability are important not just to users, but also to developers, for diagnostic and repair [166]. Eva systems’ decisions and actions are fully explainable (Section 11) and traceable to facts and rules in their knowledge base.

LLMs as dialogue managers

It appears to be an article of faith that LLMs have close to human-level ability in understanding and generating language. Dialogue is a cooperative activity, which imposes certain expectations on the participants’ behavior [167]. Participants should, at the very least, track what the current topic of the conversation is, and should not bring up information that is not relevant to the topic, or switch haphazardly from topic to topic. We find that LLMs are often not very effective communicators: they are often too verbose, potentially overwhelming the user with information, and repetitive, disregarding the fact that previously provided information is already in the common ground. Referring again to the criteria for eligibility in our vaccination scenario, the Eva system reasons through the definition, infers what it needs to know, and plans to ask specific information (age, occupation, etc.). We find that LLMs are rarely capable of doing that; instead, they simply rephrase the full eligibility criteria given in the prompt (even when we add additional disjuncts). In Fig. 7 we show an excerpt from a dialogue using Google’s Gemini 2.0 Flash that evidences multiple breakdowns in the conversation, as well as other problems of understanding and reasoning (the full dialogue and other examples can be found in Appendix D.2). The fact that such errors are even possible is as clear an indication as any of the difference between LLMs’ approach to generating language, backed by text (from context, instructions, etc.), vs. systems like Eva, where dialogue is backed by an intentional structure (cf. Grosz and Sidner [33]).

Fig. 7
figure 7

Excerpt from a dialogue with Gemini 2.0 showing multiple conversational failures (in red)

In summary, whereas we recognize that LLMs are a powerful technology that has opened up the field to applications that would have been difficult to imagine with prior NLP technologies, we do not see them as the solution to all the problems. LLM agents may work fine for open-ended dialogue and for many question answering applications (though factuality is a problem for these settings, as well), but, in our opinion, there are many downsides to their use to drive the behavior of goal-directed dialogue systems, particularly in domains where the cost of errors could be significant (e.g., healthcare, financial transactions, logistics). We have so far limited their use to: (i) semantic parsing, which is a critical component in our architecture, due to LLMs’ ability to handle robustly variation in natural language; and (ii) surface generation, due to their ability to rephrase text to sound more natural. In both of these components we are also relying on LLMs’ ability to translate between languages (e.g., NL to the formal language of logical forms). The implementation of a dialogue system requires extending Eva’s ontology and defining domain-specific predicates and actions. We see potential in the use of LLMs to help with this knowledge acquisition task (cf. [168]), and we are already pursuing research in this area. While most knowledge-based systems have assumed complete knowledge, which made them prone to sharp drops in performance at the edge of their competence, we are also envisioning the possibility of incorporating LLMs as soft reasoners for limited cases. We are taking a cautious approach to introducing LLMs into the Eva architecture, as we would like to retain the desirable properties of the symbolic approach while extending its capabilities with those functionalities that neural techniques such as LLMs can provide that are robust and predictable. Somewhat similar ideas about neuro-symbolic integration are at the basis of the LLM-Modulo framework proposed by Kambhapathi [147], although that framework is specifically for solving planning and scheduling problems, and limits the role of symbolic solvers to that of verifiers of plans produced by LLMs. Mahowald et al. [169] find LLMs to be proficient at formal linguistic competence (the rules of language), but deficient at functional competence, i.e., the ability to use language in real-world situations (formal reasoning, world knowledge, situation modeling, social reasoning). Consistent with how the human brain separates linguistic from non-linguistic cognitive capabilities, they call for modular architectures that “integrate language processing with additional systems that carry out perception, reasoning, and action”. The Eva framework is one such modular neuro-symbolic architecture.

13 Limitations

The current implementation of the Eva framework does have limitations. Eva incorporates algorithms for which some researchers have surely built superior renditions for isolated situations. However, they have not typically been adapted for dialogue. For example, there are better probabilistic planners and plan recognizers that have been developed for academically interesting problems. Yet their limitations may preclude their use here if they have not been adapted for multi-agent interaction, or for reasoning with adequate representations of mental states. Of course, it goes without saying that such a system will eventually need to deal with uncertainty. We have left room for probabilistic reasoning as well as utilities with regard to planning and plan recognition. Likewise, the belief operator takes a probability argument, and reasoning could in principle take advantage of it. However, Eva often has to deal with embedded beliefs and goals. It is still an open question how probabilistic reasoning would incorporate those embedded operators. Still, we believe the basic structure of this system can function quite well until researchers have further developed probabilistic multi-agent reasoning for dialogue, which would then require a massive data collection effort in order to incorporate reasonable probabilities. Our contribution here is to situate the problems and initial approaches in the context of a useful cooperative dialogue system. Other examples of algorithms incorporated here that could use improvement are the modal logic reasoner (e.g., to deal better with equality, negation, defaults, uncertainty, and causality), semantic parsing of natural language that incorporates anaphoric expressions, failed presuppositions, multi-utterance speech acts, multi-act utterances, time and tense, etc.

The LLM-based natural language parser needs to be trained to produce logical forms appropriate to the domain, which typically involves the collection of considerable amounts of data to achieve robustness to a wide variety of user inputs. We have been able to successfully supplement such training data with LLM-based synthetic data generation, and we expect further work in this area will likely ease the process and improve performance. Likewise, the language generation component can be improved to better deal with a variety of syntactic constructions: grounding (cf. [53, 170]), confirmations [4, 171], and generation of anaphoric expressions [172] and discourse markers. Currently both parsing and generation process full utterances, which does not allow us to model phenomena such as concurrent utterances (which are common in spoken dialogue). A useful improvement, which could also reduce latencies and increase multi-modal interaction naturalness, would be to process inputs and outputs incrementally (cf. [173, 174]).

Another ongoing topic of our research is to build an LLM-assisted tool to rapidly acquire the knowledge needed for instantiating Eva to a specific domain (ontology, predicate definitions, actions, as shown in Appendix C). This would greatly improve the efficiency with which Eva-based conversational assistants can be developed, tested, and maintained. Our early forays in this area are encouraging, but it remains to be seen to what extent development and maintenance of the domain knowledge can be shifted from knowledge engineering experts to domain experts, which is our ultimate goal. Whether domain knowledge is to be created and maintained manually or in a semi-automated way, a consideration to be taken into account is its complexity. So far, the domains we have implemented have relatively small numbers of concepts, predicates and actions.Footnote 75 It is possible to imagine domains where the scale of the necessary conceptual model would be so large as to pose serious challenges, say, if one were to use Eva to model an open-domain dialogue system. However, we note that all current benchmarks that purport to contain data for commercially relevant domains have very small numbers of intents and slots, which we take as imprecise but reasonable proxies for the size of the conceptual knowledge and the set of actions required to model those domains in Eva (cf. [175]). Scale also matters for execution time because very large knowledge bases could in principle slow down the system to the point where dialogue becomes impossible. We believe the system should continue to perform well for much larger knowledge basesFootnote 76 than the ones we have worked with so far, by using optimization techniques commonly used in modern Prolog implementations (e.g., tabling, external databases, multi-threading).

We developed the Eva framework to account for certain critical aspects of task-oriented collaborative dialogue. A realistic, though necessarily simplified application domain was used throughout to illustrate how Eva’s capabilities manifest themselves in a practical dialogue system (interested readers are encouraged to also view the multi-modal interaction demonstration video referenced in footnote 8). Nevertheless, it remains to be empirically demonstrated to what extent the features afforded by the Eva framework contribute to the development of dialogue systems that achieve high levels of performance on measures such as task completion and user satisfaction. As we have said, the data collection and parser training required to strengthen the system for such user tests are considerable. To analyze the trade-offs between the development costs of dialogue systems based on the Eva platform and their benefits both to users and to businesses deploying such systems, an evaluation of Eva as a platform would be valuable. As noted above, there are several additional developments we think should be pursued before doing so. In this paper our focus was limited to describing the theory and implementation of a neuro-symbolic approach to developing collaborative dialogue systems based on a planning-based model of dialogue and action. We also presented arguments for why and how this approach yields important capabilities lacking in the prevalent alternatives for implementing commercial conversational assistants, particularly the LLM-based agents in favor today.

14 Concluding remarks

Eva is a planning-based framework for implementing dialogue systems that engage users in cooperative task-oriented dialogues by inferring the users’ plans, and planning to facilitate them. It formalizes the planning-based approach to speech acts [8,9,10, 13, 51] with the analysis of intention and speech acts in [1, 7]. It adapts that approach with principles derived from existing theories of collaboration [4, 6] in order to provide a principled analysis and reasoning process underlying conversational interaction. Whereas there have been many research works that have investigated aspects of this general approach to dialogue, none recently have been done within a declarative BDI reasoning and planning framework. It has been supposed by many that this approach is too computationally intensive to function as a real-time conversational agent. Eva is a counterexample to that supposition, as it engages in domain-dependent real-time mixed-initiative collaborative dialogues, using multimodal inputs and outputs. The conversational capabilities include its planning of a variety of speech acts, context-dependence, constraint processing in response to slot-filling questions, mixed initiative, over-answering by system and user, multi-agent dialogue based on models of the mental states of more than one agent. The system’s conversational capabilities are domain independent as applied to domain dependent actions and knowledge.

The advent of LLMs, particularly the recent crop of large instruction-following models, has opened up possibilities not available using earlier NLP techniques. For example, the aforementioned RAG technique can be used to quickly ramp up a question-answering system backed up by thousands of documents, with minimal or no pre-processing. True, these systems can and do make mistakes, sometimes egregious ones, in large part because they use superficial similarity scoring for retrieving possible answers; but this could happen with prior information retrieval techniques, as well. Still, LLM technology has made it easier than ever to build such systems, with out-of-the-box robustness to variations in inputs and extremely good fluency in answer generation. Those same characteristics, and the large amount of their training data culled from social media make LLMs very adept at carrying out open-domain conversations (chit-chat). However, LLMs struggle to achieve the level of precision (in planning and reasoning, in accessing the correct pieces of knowledge needed, etc.) required in task-oriented dialogue. For example, someone’s claimed insurance loss either is or is not covered, and that decision often depends on complex considerations that need to be carefully assessed according to the insurance policy, not based on general knowledge induced from data culled from online sources, or even from combining “relevant” snippets of text retrieved from the policy itself. Whereas we find LLMs a useful technology for a variety of tasks, in their current form, these models cannot be entrusted with carrying out the kind of collaborative dialogue envisioned in this paper. Thus, we advocate for a hybrid neuro-symbolic architecture, exemplified here by Eva, that uses LLMs for tasks that rely on their ability to handle surface language (such as semantic parsing, surface language generation, as an aid for knowledge acquisition, etc.) and symbolic mechanisms for tasks that involve knowledge and reasoning. We include in the latter category the fundamental issue in dialogue modeling of what the system should say next – as we have shown here, LLMs’ outputs, based on textual context, can be quite inappropriate, whereas planning-based systems like ours produce utterances based on the intentional structure behind the dialogue. This enables Eva-based systems to explain the direction of the dialogue, both proactively and on the user’s request.