An Elixir tool for computing "runs" of text-to-image and image-to-text models (with outputs fed recursively back in as inputs) and analysing the resulting text-image-text-image trajectories using topological data analysis.
If you've got a sufficiently capable rig you can use this tool to:
- specify text-to-image and image-to-text generative AI models in a "network" (a cycling list of models)
- starting from a specified initial prompt, recursively iterate the output of one model in as the input of the next to create a "run" of model invocations
- embed each output into a high-dimensional embedding space using one or more embedding models
- compute persistence diagrams and cluster them to identify topological structure in the trajectories
The results of all the above computations are stored in a local SQLite database for further analysis.
This tool was initially motivated by the PANIC! art installation (first exhibited 2022) --- see DESIGN.md for more details. Watching PANIC! in action, there is clearly some structure to the trajectories that the genAI model outputs "trace out". This tool is an attempt to quantify and understand that structure (see why? below).
- mise for managing Erlang/Elixir versions (see
mise.toml) - a GPU which supports CUDA (for running the genAI and embedding models)
- SQLite3
# install Erlang & Elixir via mise
mise install
# fetch deps and set up the database
mise exec -- mix setupExperiments are configured via JSON files and run with Mix tasks. Here's an example configuration:
{
"network": ["SD35Medium", "Moondream"],
"prompts": ["a red apple"],
"embedding_models": ["Nomic"],
"max_length": 100,
"num_runs": 4
}Fields:
- network: a list of models that cycle (T2I -> I2T -> T2I -> ...)
- prompts: initial text inputs; each prompt creates
num_runsruns - embedding_models: models used in the embeddings stage
- max_length: number of model invocations per run
- num_runs (optional, default 1): how many runs to create per prompt
Then, to run the experiment:
# run an experiment
mise exec -- mix experiment.run config/my_experiment.json
# check the status of an experiment (by ID prefix)
mise exec -- mix experiment.status abc123
# list all experiments
mise exec -- mix experiment.list
# resume an interrupted experiment
mise exec -- mix experiment.resume abc123| Type | Models |
|---|---|
| text-to-image | SD35Medium, Flux2Klein, Flux2Dev, ZImageTurbo, QwenImage, HunyuanImage, GLMImage |
| image-to-text | Moondream, Qwen25VL, Gemma3n, Pixtral, LLaMA32Vision, Florence2 |
| text embedding | STSBMpnet, STSBRoberta, STSBDistilRoberta, Nomic, JinaClip, Qwen3Embed |
| image embedding | NomicVision, JinaClipVision |
| dummy (testing) | DummyT2I, DummyI2T, DummyT2I2, DummyI2T2, DummyText, DummyText2, DummyVision, DummyVision2 |
Measured on a single NVIDIA RTX 4090 with NF4 quantisation where applicable. Times include model loading/swapping overhead.
| Model | Single | Batch of 3 (per image) |
|---|---|---|
| Text-to-image | ||
| SD35Medium | ~9s | ~3s |
| ZImageTurbo | ~8s | ~6s |
| Flux2Klein | ~20s | ~7s |
| GLMImage | ~44s | ~28s |
| QwenImage | ~46s | ~23s |
| Flux2Dev | ~100s | ~75s |
| HunyuanImage | ~124s | ~109s |
| Image-to-text | ||
| Moondream | ~4s | ~3s |
| Qwen25VL | ~12s | ~5s |
| Gemma3n | ~16s | ~6s |
| LLaMA32Vision | ~17s | ~8s |
| Pixtral | ~19s | ~8s |
| Florence2 | TBD | TBD |
The design space of different models is vast; with both fundamentally different architectures and many different finetunes of the same base models. This project's goals involve asking questions about both: are different architectures more likely to diverge (long-term trajectory-wise) than finetunes of the same model? Or is there no particular pattern there?
Tests use ExUnit with dummy models (no GPU required):
mise exec -- mix test
GPU smoke tests (all real model combinations) are tagged :gpu and excluded by
default:
mise exec -- mix test --include gpu
For further info, see the design doc.
At the School of Cybernetics we love thinking about the way that feedback loops (and the connections between things) define the behaviour of the systems in which we live, work and create. That interest sits behind the design of PANIC! as a tool for making (and breaking!) networks of hosted generative AI models.
Anyone who's played with (or watched others play with) PANIC! has probably had one of these questions cross their mind at some point.
One goal in building PANIC is to provide answers to these questions which are both quantifiable and satisfying (i.e. it feels like they represent deeper truths about the process).
- was it predictable that it would end up here?
- how sensitive is it to the input, i.e. would it still have ended up here with a slightly different prompt?
- the text/images it's generating now seem to be "semantically stable"; will it ever move on to a different thing?
- can we predict in advance which initial prompts lead to a "stuck" trajectory?
- how similar is this run's trajectory to previous runs?
- what determines whether they'll be similar? the initial prompt, or something else?
- does a certain genAI model "dominate" the behaviour of the network? or is the prompt more important? or is it an emergent property of the interactions between all models in the network?
Ben Swift wrote the code, and Sunyeon Hong is the mastermind behind the TDA stuff.
MIT