Small Intent Models
26 Dec 2024If we were to model humans as a (grossly oversimplified) system, then their inputs and outputs are:
- What they hear, say sound(t), where f(t) means a function of time t,
- What they see, say sight(t), and
- What they speak, say speech(t)
I will collectively refer to these as stimuli(t).
The hidden/internal variables (again grossly oversimplified), that cannot be easily captured/recorded, are:
- What they feel, say feel(t)
- What they intend, say intent(t)
- and so on
These variables are also dependent on stimuli(t)
In a simple scenario, humans convey their intent to LLMs use a text prompt, say prompt(t), which is a function of intent(t) and stimuli(t).
Today, we cannot easily capture intent(t). But we can record stimuli(t) and obtain an estimate of intent(t), using which we can again estimate prompt(t). Essentially, the inputs to the new system are stimuli(t) and the output is prompt(t), which internally captures intent(t).
The process of generating intent(t) from stimuli(t) involves the human brain. The process of generating prompt(t) from intent(t) again involves the human brain. Using this prompt(t), the LLM generates an answer(t) which is then read and interpreted by the human brain.
Oftentimes, the answer(t) provided by the LLM for prompt(t) is unsatisfactory, because it does not serve the intent(t). The user then generates a prompt1(t) to obtain answer1(t), and this process continues say n times, until the user obtains a satisfactory answer for intent(t). While promptn(t) is not significantly different from prompt(t), the user’s satisfaction is different.
Users would like to arrive at promptn(t) and answern(t) as soon as possible. Ideally, n should be one.
If there were an “intent model” which takes as input stimuli(t) and directly generate promptn(t), allowing LLMs to generate answern(t) that satisfies intent(t), it would:
- Help users save time and effort
- Provide users the best answer for their intent(t)
- Reduce the cost of inference for obtaining the best answer, since users do not iterate on (prompt, answer) pairs
- Allow LLMs to iterate without RLHF, as the intent model can serve as feedback
While this appears similar to prompt engineering, there is a critical distinction. This model attempts to understand the user’s intent, given their stimuli, and generate the best prompt for that stimuli and intent. All the improvements in prompt engineering would be necessary, but not sufficient.
The way users would interact with this intent model is by communicating whether the prompt(t) generated by the model best represents their intent. This can be easily done by showing the user (say) three potential prompts that they may query the LLM for, given the stimuli they recieved in the past (say) 10 minutes. The user clicking on a particular prompt implies that the prompt best represents their intent.
Depending on how frequently a user may choose to interact with this model, the model should be able to digest the stimuli(t) and quickly provide promptn(t), helping the user query for almost every intent(t). This requires the “intent model” to be small. Moreover, smartphones can capture the stimuli(t) and would have enough compute to run the “small intent model” to directly generate promptn(t), which can be sent to an LLM to obtain answern(t).