Skip to main content

How the iPhone maker ensures Apple Intelligence safety: Triggering, red teaming, and more

A research paper explains how Apple Intelligence is designed, and the steps the company takes to ensure the safety of the models.

The paper also gives a glimpse into the scale and complexity of the on-device AI capabilities, noting that the core model which runs entirely on an iPhone, iPad, or Mac has around three billion parameters …

The paper, spotted by John Gruber, was published a couple of weeks ago.

We present foundation language models developed to power Apple Intelligence features, including a ∼3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute [Apple, 2024b].

These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

Proactively seeking out problematic material

One of the big challenges with generative AI is, because it’s been trained on a wide range of user-generated content on the web, it can end up echoing the worst of humanity. Apple says that it proactively seeks to identify and exclude problematic material.

We work continuously to avoid perpetuating stereotypes and systemic biases across our AI tools and models. We take precautions at every stage of our process, including design, model training, feature development, and quality evaluation to identify how our AI tools may be misused or lead to potential harm. We will continuously and proactively improve our AI tools with the help of user feedback […]

Additionally, extensive efforts have been made to exclude profanity, unsafe material, and personally identifiable information from publicly available data.

      Testing with trigger phrases

      One specific approach used is to deliberately test the models with trigger phrases likely to generate unacceptable responses, and then apply a decontamination process to exclude these.

      Apple says it does this with datasets it has licensed, as well as with websites crawled by Applebot.

      Validating output against Apple’s values

      Apple then applies a process known as post-training, which is essentially reviewing outputs in order to validate and fine-tune it.

      We conduct extensive research in post-training methods to instill general-purpose instruction following and conversation capabilities to the pre-trained AFM models. Our goal is to ensure these model capabilities are aligned with Apple’s core values and principles, including our commitment to protecting user privacy, and our Responsible AI principles.

      Four criteria for human review

      Human review is used to compare different outputs, with reviewers asked to rate them on a range of criteria:

      • Accuracy
      • Helpfulness
      • Harmlessness
      • Presentation

      Those ratings are then used to further enhance the model’s understanding of what it is aiming to produce.

      Red teaming

      The company also uses an approach known as “red teaming,” which is effectively penetration testing for AI models. This uses a mix of human and automated attacks to try to find vulnerabilities in the model.

      Red teaming is a fundamentally creative endeavor that requires red teamers to employ combinations of attack vectors to probe known model vulnerabilities, and try to discover new ones. Attack vectors used when engaging with language models include jailbreaks/prompt injections, persuasive techniques [Zeng et al., 2024], and linguistic features known to cause model misbehavior (e.g. slang, code-switching, emojis, typos).

      We employ both manual and automatic red-teaming [Ganguli et al., 2022] to elicit potentially unknown failure modes of the aligned models. More recent works [Touvron et al., 2023] suggest that automated processes can potentially generate even more diverse prompts than humans, previously seen as the “gold” standard for data collection.

      The paper goes into a huge amount of detail of this and more.

      Photo by Kevin Ku on Unsplash

      FTC: We use income earning auto affiliate links. More.

      You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

      Comments

      Author

      Avatar for Ben Lovejoy Ben Lovejoy

      Ben Lovejoy is a British technology writer and EU Editor for 9to5Mac. He’s known for his op-eds and diary pieces, exploring his experience of Apple products over time, for a more rounded review. He also writes fiction, with two technothriller novels, a couple of SF shorts and a rom-com!


      Ben Lovejoy's favorite gear

      Manage push notifications

      notification icon
      We would like to show you notifications for the latest news and updates.
      notification icon
      You are subscribed to notifications
      notification icon
      We would like to show you notifications for the latest news and updates.
      notification icon
      You are subscribed to notifications