Siri has recently been attempting to describe images received in Messages when using CarPlay or the announce notifications feature. In typical Siri fashion, the feature is inconsistent and with mixed results.
Nevertheless, Apple forges ahead with the promise of AI. In a newly published research paper, Apple’s AI gurus describe a system in which Siri can do much more than try to recognize what’s in an image. The best part? It thinks one of its models for doing this benchmarks better than ChatGPT 4.0.
In the paper (ReALM: Reference Resolution As Language Modeling), Apple describes something that could give a large language model-enhanced voice assistant a usefulness boost. ReALM takes into account both what’s on your screen and what tasks are active. Here’s a snippet from the paper that describes the job:
1. On-screen Entities: These are entities that are currently displayed on a user’s screen
2. Conversational Entities: These are entities relevant to the conversation. These entities might come from a previous turn for the user (for example, when the user says “Call Mom”, the contact for Mom would be the relevant entity in question), or from the virtual assistant (for example, when the agent provides a user a list of places or alarms to choose from).
3. Background Entities: These are relevant entities that come from background processes that might not necessarily be a direct part of what the user sees on their screen or their interaction with the virtual agent; for example, an alarm that starts ringing or music that is playing in the background.
If it works well, that sounds like a recipe for a smarter and more useful Siri. Apple also sounds confident in its ability to complete such a task with impressive speed. Benchmarking is compared against OpenAI’s ChatGPT 3.5 and ChatGPT 4.0:
As another baseline, we run the GPT-3.5 (Brown et al., 2020; Ouyang et al., 2022) and GPT-4 (Achiam et al., 2023) variants of ChatGPT, as available on January 24, 2024, with in-context learning. As in our setup, we aim to get both variants to predict a list of entities from a set that is available. In the case of GPT-3.5, which only accepts text, our input consists of the prompt alone; however, in the case of GPT-4, which also has the ability to contextualize on images, we provide the system with a screenshot for the task of on-screen reference resolution, which we find helps substantially improve performance.
So how does Apple’s model do?
We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.
Substantially outperforming it, you say? The paper concludes in part as follows:
We show that ReaLM outperforms previous ap- proaches, and performs roughly as well as the state- of-the-art LLM today, GPT-4, despite consisting of far fewer parameters, even for onscreen references despite being purely in the textual domain. It also outperforms GPT-4 for domain-specific user utterances, thus making ReaLM an ideal choice for a practical reference resolution system that can exist on-device without compromising on performance.
On-device without compromising on performance seems key for Apple. The next few years of platform development should be interesting, hopefully, starting with iOS 18 and WWDC 2024 on June 10.
FTC: We use income earning auto affiliate links. More.
Comments