Real AI Safety: Threat Modeling a Retrieval Augmented Generation (RAG) System

Or, AI Safety For Cynics With Deadlines (with apologies to the Django Project).

Previously: Why AI safety is impossible as the problem is currently framed, how to better frame the problem instead.

I know I’ve been on a tear recently.  Let’s bring this all home with a worked example.

I’m going to use my Principals-Goals-Adversities-Invariants rubric to threat model an intentionally oversimplified version of a system using Retrieval Augmented Generation (RAG) methods.

(If it seems confusing to be using a “security” rubric on a “safety” problem: sorry, I get that a lot.  I tend to think of ‘security’ as a subset of the overarching problem of ‘safety’—if ‘safety’ is freedom from unacceptable loss, ‘security’ is specifically concerned with unacceptable losses of confidentiality, integrity, and availability and a couple other more contextual properties.  And since this rubric was inspired by systems safety work that we specialized and specifically framed in a security context, it’s easy enough to zoom back out and use the rubric to consider the broader safety question.)

Threat Model

A system diagram. 1. A user sends arbitrary, untrusted user input to an LLM (ideally, a natural language query). 2. The LLM sends arbitrary, untrusted LLM output to Postgresql (ideally, a valid SQL query). 3. Postgresql sends structured database output to a different LLM ("ideally... no, this really just is what it is"). 4. The second LLM sends arbitrary, untrusted LLM output back to the user ("ideally, a reasonably correct and complete summary of the structured database output"). A couple extra arrows are sketched in with question marks suggesting that perhaps the output of the first LLM, and the structured database output, should be sent directly to the user as well as to the next step in the loop.
An intentionally oversimplified diagram of a system using Retrieval Augmented Generation (RAG) methods. It is intended to allow a user to make queries of a database using natural language and receive back natural language summaries of the results.

Principals

  1. A user.
  2. An LLM prompted and potentially specialized to generate SQL queries from natural language queries.
  3. A SQL database (here, without loss of generality, Postgresql)
  4. A second LLM prompted and potentially specialized to generate natural-language summaries from structured database output.

Goals

Allow a user to make queries of a database using natural language and receive back natural language summaries of the results.

Adversities

(“What bad things can happen in the system, either by adversarial action (😈) or accident (Murphy).”)

You may find these overlapping or a little bit brain-storm-y.  Never fear, we’ll get to the meat in a minute.  Skip down to the ‘Invariants’ section if you get bored.

I tend to group these by portion of the system diagram that they apply to, so, for example, we’ll start with adversities which apply to the edge where the user sends a message to the first LLM.

User–Query Generation LLM

  1. The user can send arbitrary text.
  2. The user can send non-English text.
  3. The user can send text in an arbitrary character set (or a mix of character sets).
  4. The user can send text which contains invalid characters in the current character set.
  5. The user can send text which does not contain a question.
  6. The user can send a SQL query.
  7. The user can send text containing a prompt which instructs the LLM what SQL query to generate.
  8. The user can send text which is intended to override the LLM’s prompting.
  9. The user can send text asking for a query which they do not have permissions to execute on the database directly.
  10. The user can send text asking for a query which affects the behavior of the database for other users (e.g. DROP TABLE)
  11. The user can send text asking for a query which asks for an arbitrary amount of data from the database.
  12. The user can send text asking for a query which is not properly quoted.
  13. The user can send text asking for a query where the user’s input is not properly quoted.
  14. The user can send text asking for a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
  15. The user can send text asking for a query which accesses database system tables or internal variables.
  16. The user can send text asking for a query which accesses any data type (e.g. JSON objects, GIS data, binary blobs)
  17. The user can send text asking for a query which uses arbitrary temporary tables or stored procecures
  18. The user can send text asking for a query which creates arbitrary temporary tables or stored procecures
  19. The user can send an arbitrary amount of text.
  20. The user can send text at arbitrary frequency, e.g. thousands of times a second.
  21. The user can send additional text before the LLM has finished processing the previous text.
  22. The user can send additional text before the LLM has received the previous text.
  23. The user can send text which asks for an invalid query.
  24. The user can send text which asks for a query of arbitrary computational complexity.
  25. The user can send text which asks for a query which will not terminate.
  26. The user can send text at arbitrary speed, i.e. very fast or very slow.
  27. The user can not send text to the LLM.

Query Generation LLM–SQL Database

  1. The LLM can send arbitrary text to the database.
  2. The LLM can send an arbitrary amount of text to the database.
  3. The LLM can send text in an arbitrary character set to the database
  4. The LLM can send text in multiple character sets to the database
  5. The LLM can send text containing invalid characters in the current character set.
  6. The LLM can send additional text before the database has finished processing the previous text.
  7. The LLM can send additional text before the database has received the previous text.
  8. The LLM can send arbitrary SQL queries to the database.
  9. The LLM can send invalid SQL queries to the database.
  10. The LLM can send a SQL query to the database which it does not have the permissions to execute.
  11. The LLM can send a SQL query to the database which the user does not have the permissions to execute.
  12. The LLM can send a SQL query to the database which affects the behavior of the database for other users (e.g. DROP TABLE)
  13. The LLM can send a SQL query which is not properly quoted.
  14. The LLM can send a SQL query where the user’s input is not properly quoted.
  15. The LLM can send a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
  16. The LLM can send a query which accesses database system tables or internal variables.
  17. The LLM can send a query which accesses any data type (e.g. JSON objects, GIS data, binary blobs)
  18. The LLM can send a query which uses arbitrary temporary tables or stored procedures
  19. The LLM can send a query which creates arbitrary temporary tables or stored procedures
  20. The LLM can send an arbitrarily large SQL query.
  21. The LLM can send SQL queries at arbitrary frequency, e.g. thousands of times a second.
  22. The LLM can send a SQL query which accesses an arbitrarily large amount of data from the database.
  23. The LLM can send a SQL query which accesses an arbitrarily large number of tables.
  24. The LLM can send a SQL query which accesses an arbitrarily large number of rows.
  25. The LLM can send a SQL query which accesses an arbitrarily large number of columns.
  26. The LLM can send a SQL query which returns an arbitrarily large amount of data.
  27. The LLM can send a SQL query which returns an arbitrarily large number of rows.
  28. The LLM can send a SQL query which returns an arbitrarily large number of columns.
  29. The LLM can send a SQL query of arbitrary computational complexity.
  30. The LLM can send a SQL query which won’t terminate.
  31. The LLM can not send text to the database.
  32. The LLM can send text arbitrarily fast or arbitrarily slow.
  33. The LLM can send a SQL query which does not correctly capture the user’s explicitly-expressed intent.
  34. The LLM can send a SQL query which draws incorrect inferences from the user’s poorly- or un-expressed intent.

SQL Database–Summarization LLM

  1. The database can send arbitrary data to the LLM.
  2. The database can send an arbitrary amount of data to the LLM.
  3. The database can send text in an arbitrary character set to the LLM
  4. The database can send text in multiple character sets to the LLM
  5. The database can send text containing invalid characters in the current character set.
  6. The database can send additional data before the LLM has finished processing the previous data.
  7. The database can send additional text before the LLM has received the previous text.
  8. The database can send the output of a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
  9. The database can send the output of a query which accesses database system tables or internal variables.
  10. The database can send data of any data type (e.g. JSON objects, GIS data, binary blobs)
  11. The database can send data from arbitrary temporary tables or stored procedures
  12. The database can send an arbitrarily large amount of data.
  13. The database can send an arbitrarily large number of rows.
  14. The database can send an arbitrarily large number of columns.
  15. The database can send data at arbitrary frequency, e.g. thousands of times a second.
  16. The database can send data at arbitrary speed, e.g. very fast or very slow.
  17. The database can not send data to the LLM.

Summarization LLM–User

  1. The LLM can send arbitrary text to the user.
  2. The LLM can send an arbitrary amount of text to the user.
  3. The LLM can send text in an arbitrary character set to the user
  4. The LLM can send text in multiple character sets to the user
  5. The LLM can send text containing invalid characters in the current character set.
  6. The LLM can send additional text before the user has finished processing the previous text.
  7. The LLM can send additional text before the user has received the previous text.
  8. The LLM can send text in any language to the user including both human and computer languages
  9. The LLM can send text in no particular language to the user (i.e. gibberish)
  10. The LLM can send text which is incomprehensible to the user
  11. The LLM can send text to the user with what that user would consider incorrect or unclear grammar
  12. The LLM can send text to the user which that user would consider socially or culturally offensive
  13. The LLM can send to the user any information which it receives from the database.
  14. The LLM can not send to the user any information which it receives from the database.
  15. The LLM can appear to send to the user information which it received from the database, but which it did not in fact receive.
  16. The LLM can send text to the user which is not properly quoted (e.g. does not clearly or correctly distinguish between a one-row result which is “carrots, apples, peas” and a three-row result which is “carrots”, “apples”, and “peas”).
  17. The LLM can send to the user the results of a query which uses any SQL verb (e.g. EXPLAIN ANALYZE) if that information is sent to it by the database.
  18. The LLM can send to the user the contents of database system tables or internal variables if that information is sent to it by the database.
  19. The LLM can send to the user data of any data type (e.g. JSON objects, GIS data, binary blobs) if that information is sent to it by the database.
  20. The LLM can send to the user information from arbitrary temporary tables or stored procedures if that information is sent to it by the database.
  21. The LLM can send an arbitrary amount of text to the user.
  22. The LLM can send text to the user at arbitrary frequency, e.g. thousands of times a second.
  23. The LLM can send text to the user representing results over an arbitrarily large number of tables.
  24. The LLM can send text to the user representing results over an arbitrarily large number of rows.
  25. The LLM can send text to the user representing results over an arbitrarily large number of columns.
  26. The LLM can send text to the user representing results of arbitrary computational complexity.
  27. The LLM can not send text to the user.
  28. The LLM can send text arbitrarily fast or arbitrarily slow.
  29. The LLM can send text to the user which misrepresents the context of the data (e.g. attributing results from a previous year to the current year; more likely if the year isn’t included in the query output)
  30. The LLM can send text to the user which misrepresents the data itself (e.g. based on not understanding or misunderstanding the import of column names, or inferring the wrong meaning of a column named with a word which has a homonym)

This is almost certainly not a full and complete list, but I think it’s a start.  If there’s anything glaringly obvious that I missed please leave a comment or let me know directly!

Also I feel like people in other disciplines should have thought long and hard about all the ways that summaries can lie, where I only scratched the surface here—linguistics? communication theorists? library science?—and if you have pointers there I would love them.

Invariants

(“What must be true about this system in order for it to be safe?”)

This is where we get political, and I’m going to keep this brief both because this post is already long enough as it is, and also because every system is different and I don’t think I feel comfortable making sweeping judgments at this juncture.

That said, if I were designing a system like this with the intent to build it, a few things I would try to enforce leap immediately to mind:

  1. Queries generated by the LLM must never run with greater permissions than they would run with if they were generated and run by the user directly on the database.
  2. The user must always have access to the full text of the query generated by the LLM.  (This is one of the dashed lines on the system diagram—I was already anticipating this as I was drawing it.)
  3. The user must always have access to the underlying data returned by the database.  (This is the other dashed line on the system diagram.)

This is by no means all the invariants I can come up with or which might be necessary for the system to be safe, but I think these start to get at the things which scare me most about these systems.

What other invariants potentially stand out to you from the lists of adversities?

 

For more: Why AI Safety Can’t Work, What AI Safety Should Be