PONDERING ‘OPEN’ ARTIFICIAL INTELLIGENCE
Has the Open Source Initiative got it right (well, maybe?)
First, let’s talk about why this is HARD, HARD, HARD!
My open journey started a few years ago (well, quite a few years ago) when the wonderful Steve George took me on a date to see someone speak about open source software. I know, top date, what a guy!
But actually, what an eye-opener for the Big Law person I was at the time. And how far we have come in those past years - from open source being hobbyist to sitting at the core of most enterprise tech stacks.
Whilst open source isn’t easy, it is graspable - even if not everyone likes it (just look at the discussions around GNU licenses).
If we think back to the pre-open days, you got your binary code, you installed how the license said you could, and you spun it up. What you didn’t get to do was see under the hood or have any unilateral ability to make it better (or, if we are honest, fix the bugs no-one wanted to deal with!).
Fast forward to today and Git is our friend and a forest of (broadly) supportive licenses say: “hey you, feel free to dive in and fix it and make it better and do new amazing things off the back of it!”
And that’s the point really: with open source code you have the ability to “dive in” to the “source code”. You can actually look under the hood. You can see the lines of human readable code, and you can understand the libraries it pulls on, the logic flows, etc. That is the essence of the source being “open“. (The other good stuff, in the open source definition, flows from that.)
BUT WITH ARTIFICIAL INTELLIGENCE, YOU CANNOT LOOK UNDER THE HOOD IN THE SAME WAY - SO WHAT MIGHT OPEN REALLY MEAN?
And that is what makes it HARD. For ‘simple’ AI (e.g. expert systems) it’s very possible they may be capable of being open sourced. The code will tell the story. But for ‘trained’ models, that is not the case.
The way I think of the difference is like this:
Software stores ‘instructions’: if this, then that (if you are told that a car’s coming, you stay on the pavement)
Machine learning models store ‘experience’: if something is like this, my experience tells me to do that (if you see a car coming, you know crossing the road is not ideal)
That experience is the ‘learning’ in machine learning. The model is trained on experience examples and learns based on them. It doesn’t keep a copy of all the examples. The best way I’ve seen this described is it’s like looking at a photo, but not keeping a copy of it. You still remember what is in the photo.
If you don’t keep the source material, what can you then open up? THAT IS WHAT THE OSI HAS BEEN TRYING TO FIGURE OUT.
Second, let’s look at the Open Source Initiative’s definition
This page is the current draft (as at 24 October 2024 when I wrote this blog post).
Specifically, they say that open source AI is an AI system made available under terms and in a way that grant the freedoms (based on free software freedoms) to:
Use the system for any purpose and without having to ask for permission
Study how the system works and inspect its components
Modify the system for any purpose, including to change its output
Share the system for others to use with or without modifications, for any purpose
The OSI’s current draft goes into some detail on what modify means, but it is broadly this:
Sufficiently detailed data information about training data “so that a skilled person can build a substantially equivalent system”
Complete “source code used to train and run the system”
Model parameters “such as weights and other configuration settings”
This is A LOT of information and freedom. So I would say this is a good definition for something that’s more complicated than pulling down a code base.
The knowledge about training data and parameters is key: they are the things which affect how good or bad the model is when used on input data post training/tuning.
This definition is a work in progress and has the right motivations behind it. Will it evolve over time, of course.
Third, is this the beginning or the end?
I would say neither.
But for now, it’s a start - and it definitely goes some way towards addressing a little bit of open-washing in AI that’s not gone unnoticed.
Finally, should you even care?
The answer has got to be yes: this is open vs closed all over again, and if we don’t fight for open, we will never reap the huge benefits that open can bring:
Technical advantage & credibility by being better able to service, test, upgrade and maintain
Access to code and information to customise models to a business need
The more established/popular open projects tend to benefit from active communities of users and developers who contribute to innovation, offer support, share knowledge, and help troubleshoot issues – this to some extent also creates the circumstances where model vulnerabilities (e.g. hallucinations and other errors) are flushed out and fixed rapidly too
Avoidance of vendor lock-in by not becoming overly dependent on a single vendor’s technology or pricing
So, a good start for all the right reasons!
PS: Of course, don’t forget about the Open Data Commons license!
THANKS FOR READING