AI Investigation Overview

Much like everyone else in the tech industry as of late, I've been doing some investigation into various AI tools. Results have been...mixed. In this post I'll go over some of the results of my experiments and thoughts on where it does and does not make sense to use it today.

You may recall that I've written about AI before, although in that case my conclusions had less to do with AI/ML and more to do with the nature of learning a new skill and some aspects of that I had forgotten. In the intervening year and a half the state of AI technology has changed dramatically and this time around I took a very different tack. Rather than trying to do foundational work on AI technology itself, I tried to use it to assist with the work I do day-to-day. This was much more successful, but it wasn't all sunshine and puppies either. Which is good, because "I tried AI and everything worked perfectly" would make for a very boring blog post.

Meeting Summaries

One of the first productive uses of AI that I engaged in was meeting summarization and transcription. This is one area where I'm actually quite a fan of AI. It's not perfect (more on that in a moment), but it works well enough that it can legitimately replace the role of designated note taker for a meeting. Since almost no one likes doing that, it's one of the rare-ish instances of AI being used to make everyone's life better. The now-common counter-example is "I don't want AI to replace artists, I want it to do my dishes". Note taking in meetings is the work equivalent of doing the dishes, so I like it for this purpose.

As I said, it's not perfect. It struggles with names, and technical terms can result in some fairly hilarious word salad as it tries to translate something like "NNCP from Kubernetes-NMState" into words it knows. In a vacuum this can be confusing, and I struggle with it sometimes when reading other teams' meeting summaries if I wasn't present myself and don't have context for what the AI was trying to say there. However, as a quick reference for meetings I was present in I haven't found this limitation to be as serious. So I think my advice would be that if you're trying to communicate the outcome of a meeting discussion to someone who wasn't in the meeting itself, maybe don't use the AI-generated summary (at least not verbatim). If you just want a record for the attendees to refer back to, then by all means let the AI handle it. Honestly, even with the limitations it's still better than a lot of the human-written meeting notes I've seen, and it doesn't pull the attention of an entire person away from the meeting (in my experience most people can't participate and take notes at the same time).

Slack Chat Bot

I did a preliminary post about this a few months back in which I covered exporting Slack channel history for use in training AI chat bots. Since then I actually tried using that data in NotebookLM (which was delightfully easy to use for this) to create a chat bot that could (potentially) answer some common or repetitive questions in our team's public Slack channel. This went less well and I currently have no plans to actually make the bot available to the public, but all is not lost because I do still think there may be a valid use for this system.

First, let's talk about the results I found. Because I had collected the channel history a month or two before actually doing my test, I had a good opportunity to ask the bot some questions that had come up since I exported. I also asked a few synthetic, very basic questions to see how it would handle fundamental queries about what our team does. Interestingly, as I pulled questions from the Slack channel I unintentionally went all the way back to before I had exported and also got to see how it would handle answering a question that had already been answered. Spoiler: It did very well with that one.

The others? Not as much. Initially I was impressed, but as often seems to happen with generative AI, when I started looking a little closer at the answers some pretty significant cracks appeared. While I have very detailed notes including my queries, the responses from the bot, and my analysis of the responses, I unfortunately can't post it verbatim because this is not a public channel and I am not at liberty to share some of the discussions that happened there. However, I think I can still convey enough information by talking in generalities to be useful.

One common thing that happens in our channel is we get questions our team is not well-suited to answer. Because we have "networking" in our channel name, we get all manner of networking questions unrelated to on-prem host networking. It would actually be super helpful to have a bot that could accurately identify those questions and quickly redirect the asker. This bot...sort of did that. I asked it multiple questions that should not have come to our team, and although in some cases it correctly identified that the question was not in the right place, it also tended to continue answering the question, often with blatantly incorrect information. This makes sense given that it was trained on our channel and no others, so it doesn't have any context for answering SDN questions (for example), but the bot's inability to not answer questions was actually a major weakness. I don't want a chat bot to speculate wildly on an answer if it doesn't know. Here's one conclusion I have in my notes that I feel I can safely include here:

The correct answer was “Ask the SDN team”. The bot spewed 467 words that essentially boiled down to “Ask the SDN team”. No wonder AI is so resource-intensive. ;-)

In another instance, a question that was not relevant to our team was asked and the bot gave a completely nonsense answer, which is actually worse than just being too wordy about redirecting people to another team. The fact that it doesn't know what it doesn't know makes it unfit for public consumption right off the bat.

However, that was not the end of my testing. There were also questions asked that were appropriate for our team, and the answers given were interesting. A couple of them were particularly good for this because they were very representative of the types of questions we get asked a lot. One was a question we've gotten before in various different forms, and the bot actually handled that pretty well. It seemed unaware of some recent developments in the area and as a result the answer was somewhat incomplete, but at least what was there was accurate. I can't say the same for its answer to a completely new question related to one of our components. In that answer the bot made assertions that it could not justify, looking at the references it claimed to have used. The chat history it referred to was completely unrelated to the question and did not support the answer in any way. To be honest, I didn't know the answer to the question either, but the bot confidently made assertions that IMHO it should not have. It may have even been correct about some of it, but only in the way that you're going to be correct in predicting a coin flip about 50% of the time. I found this answer very concerning because it sounded plausible, but even I, as an SME, couldn't confirm its accuracy. Someone unfamiliar with the area could not be expected to make that determination.

Finally, I sent some synthetic queries that were not taken from the channel, but that I thought represented things people might want to ask about. For example, "Tell me about the on-prem internal DNS server." and "Can I get a review on https://github.com/openshift/baremetal-runtimecfg/pull/350 ?". It did okay describing the components we maintain, but there were still definitely some subtle (and not-so-subtle) issues. For example, it tended to dredge up very old conversations about the components that were no longer valid, like bugs from many years ago that have been fixed almost as long. It also didn't understand the boundaries between components, and started describing DNS behaviors in the loadbalancer answer (which might be valid if we were using DNS-based loadbalancing, but we don't). For the most part it wasn't wildly inaccurate in these answers (well, except one...), but it did struggle to stay on topic and limit itself to relevant details. I was looking for a high level overview, and I got an answer that went down some deep rabbit holes, some of which were dead ends.

And about that review request...yikes! I didn't really expect this to go well since it wasn't a dedicated code review bot, but it managed to exceed my expectations for how bad the answer would be. It proceeded to talk about completely the wrong PR, which made it difficult to even know whether its comments were valid since I wasn't sure which PR it was actually looking at. This answer was a trainwreck.

Conclusion

LLM chatbots need the ability to say "I don't know". Making up an answer when they have no data to back it up is unhelpful. Even where it could reasonably have answered the questions, the amount of incorrect information (some of it difficult to recognize as such) was sufficiently concerning that I wouldn't be comfortable exposing this to visitors in our channel. There were only one or two answers that I think stand on their own without correction or clarification from a human team member. That's not good enough.

I'm unsure if it would be possible to improve the results. I considered training it on other related channels, but I'd be concerned that it would start answering questions that weren't appropriate for our channel and further muddy the waters. As I've mentioned we get plenty of questions not appropriate for us, and if we start having those discussions in our channel it will only get worse. One possibility is to train it on the OCP docs, which might help it have a better understanding of our components. If we pull in the entire docs, that might cause cross-pollination issues again. It's something to try in the future though.

With all that said, I think there could still be some value to this tool. Although the answers were not necessarily worthwhile in isolation, they did often provide a good starting point for research. In some answers it was able to recall long past discussions that were relevant. If there's a question that comes in and we aren't sure of the answer off-hand, we could put it in the bot and see if it can come up with anything. Essentially, as a glorified search engine it's acceptable. As an authoritative source of truth, very much not.