GrowlErval
Big idea:
- Growler currently creates ground truths to evaluate the quality of its metadata, improving it in a loop
- Eval uses ground truths to evaluate Chat Plot
- These could be the same thing
A high-level view of the workflow:
- Human opens a GitHub issue and assigns it to Growler.
- Growler does its magic in a branch and opens a pull request.
- Eval is kicked off and runs against the data in this new branch.
- Growler can be prompted for changes until Eval is passing and we are satisfied with the metadata.
- Merge it and use it in Chat Plot
Eval
Do we need a better name?
- Lave is eval backwards.
- Valet because it’s almost an anagram for eval and it drives the car (Chat Plot) for us.
-
Since the name Growler is now all about beer, I arrived at Cicerone, which I learned today is a
Sommelier of beer. That is also a lot of syllables and kinda high brow for us. But it makes sense in that Growler brings
beer (data) to the beer taster.
- Taster would be just a less fancy way to say cicerone.
- Judy, the world’s most famous judge.
- Another famous judge or driver? There are many
Requirements for an ideal Eval setup:
- Eval happens on a VM and has a web frontend that lets us view and compare runs.
- We can inspect chats to see how Chat Plot or our metadata are failing.
-
We can pick commits to evaluate from the data-catalog and chat-plot repos individually. So if we have a new commit to
Chat Plot and a new commit to the Data Catalog we can either run them together or in isolation as needed.
-
Phase 2: It can be kicked off from GitHub using commands in a comment like
/eval limit=20. It responds with
results in a GitHub comment.
How does this work?
- Ground truths live in the data-catalog next to the data.
-
Our CI/CD is already making docker images for every commit to Chat Plot, so Eval uses the container registry to pull
them.
-
It starts our databases in docker on the VM and asks the Chat Plot container to import the data needed for a given run
by running the import script.
-
It creates all the needed chats, records the results, and grades them. It stores the results data in its own Postgres
DB, including the chat sessions so we can inspect them.
- Stops the containers and it’s ready for another one.
I should note that one blocker for this is Gallup World Poll since it can’t be imported without huge amount of RAM. We
would need to partition it for this to work but that’s already on our to-do list.
Growler
The beauty of the above change to Eval is that it makes Growler simpler. It runs in a shorter loop with fewer
steps and eval happens out of the main agent’s control in a more realistic setting (actual Chat Plot tools and agent).
Growler would have a series of agents/prompts/loops that are good at different tasks:
-
Researcher Finds publications, information on the web, etc. and stores them as source documents next to
the data.
- Claim Finder Extracts claims for the ground truths
- Writer Uses source documents to write and expand metadata
- Editor Can evaluate work on the metadata
- Fact Checker Checks the metadata to find errors
The editor and writer are a natural loop that doesn’t stop until we’ve got solid work. Then the Fact Checker comes in
and verifies that changes were not hallucinated.
Data Catalog
Our catalog expands to hold more things:
-
Ground Truths Facts that Eval will ask Chat Plot to replicate. Creating these becomes part of the
process of onboarding new data and our eval works right away.
-
Sources These are the publications, web pages, code artifacts, etc. This is all of the material we have
gathered to create our metadata and ground truths. As I understand it, Growler currently throws this away after a run,
but storing it in the catalog will make it available for future runs. We can curate our sources the same way we are
curating our data and metadata.
Chat Plot
Required changes to Chat Plot for this to work:
- Create a new environment called “eval” that disables authentication entirely (easy)
- Allow calling the import-data script with a specific commit (easy; already in progress)
- Allow calling the import-data script to only operate on one dataset at a time (easy)