I'm doing an experiment with training a chat bot on the history of our team Slack channel to see if it could be used to answer frequently asked questions. The first step is getting the channel history into a format I can feed into a model, and because I'm not an admin on the server I can't use the built-in tools to do it. Fortunately there are other ways, and in this post I'm going to go over the process I used.
First, I found the slackdump tool and attempted to run it against our internal Slack server. At first it looked like everything was working, but after a few seconds it failed on an authentication error. Oddly, logging in through the browser window it opened also logged me out of my standalone Slack app. It seems this is because we use an SSO system for Slack authentication and that can be problematic with the normal browser-based login.
Fortunately there's another way to provide authentication data. Note that if you use Firefox you can get the cookie value from the developer console, similar to how you got the token in the first step. You can use the values retrieved in the slackdump config for connecting to the workspace.
Once the export is complete, you'll be left with a directory containing a sqlite database. It may be possible to directly use this to train a model, but for the purposes of this exercise I wanted to get all of the chat messages into a flat text file. I did that in two steps. First, I converted the database into a flat JSON file representing all of the messages:
./slackdump convert -f dump -o dump slackdump_20250509_142712/
However, this had a lot of largely extraneous details in it (I don't care about avatars or images, at least not right now), so I decided to filter it further. To do that, I used the following Python script:
#!/usr/bin/env python
import json
with open('dump/CG6252WAY.json') as f:
unfiltered = json.load(f)
sdtr = 'slackdump_thread_replies'
for message in unfiltered['messages']:
if sdtr in message:
for m in message[sdtr]:
try:
print(m['text'])
except Exception:
pass
I'm not certain this includes everything, although I did find that the slackdump_thread_replies list also includes the original message. Whether it is perfect or not, it gives me a substantial data set to train a model so it should be a good starting point. If it works out (or if it doesn't, for that matter) I can refine how I filter the dumped data and try to improve it.
That's as far as I've gotten with the experiment so far, but since this is a fairly common thing to do I thought I would make a standalone post about it until I've had a chance to try actually training a model on this.