Where is the magic AI-wand for data?
by A M Howcroft, SWARM CEO
Part 2 of 3
You have a clear definition of your problem, you just need to get the data, and then you can start building a solution. Great, your IT department says, we think we can get that for you in about six months. Arghhhh!
There is an alternative, though, the business department already has most of that data, in exports from transactional applications plus home-grown spreadsheets they manage themselves – perhaps you could use that? Ten minutes after starting to explore the business data, you will discover it is incomplete, with duplicates, typo’s quality issues, and may be more than a year out of date. That’s bad news, but what's even worse is that the business uses this data every day to make critical decisions. Surely there must be a better way of getting data accessible, cleaned, and ready to drive a new solution?
There are various approaches and some complex tools to extract, transform, clean and load data, and a whole mature industry that has been working on this challenge for decades, but it still seems to be a major blockage in most organizations. What if AI has new answers: is there a magic-wand for data, which will eliminate this problem for us?
The multi-million-dollar lake
Several of the largest AI vendors have tackled the data problem, but perhaps not in the way you might imagine. Before you can run a machine learning algorithm you need training data, preferably a large amount, and well-labelled – which is a human-intensive process. The big AI vendors realised this early on, which is why their typical engagement model frequently starts with the construction of a data lake. It is also why these projects often run into years and have eye-watering budgets. Data lakes are not new, and neither are the technical challenges associated with them.
Gathering, cleaning, identifying and consolidating data into a structure that makes sense is tough. Maintaining it is even harder. You may need to incorporate a data dictionary, and make decisions about terminology – is a sale made when the customer signs a contract, or when they pay the first invoice? Do they still count as a customer if they have an exit clause up to 90 days after signing the contract? What happens if a customer stops paying us, is there a category of lapsed customer or ex-customer? When does a late paying customer become lapsed, or move to ex-customer status? Far too often, the business rules for these types of definitions vary by company departments, regions, or even individuals. Not to mention that data is dynamic, with new products being added, new people and sites, and we may acquire or divest whole companies or divisions, resulting in wholesale changes to the data.
A data lake can rapidly come to look more like Minnesota, the land of ten thousand lakes, with multiple areas of different size and shape, some stagnant and others churning with fresh data. It can become a swamp, or dissolve into a million puddles. Trying to build on top of this infrastructure, or even worse, having to wait until it is ‘complete’ before you can create a solution is frustrating, and ultimately damaging to the business.
How can AI help?
If AI can help product pandemic-beating drugs, write essays that can pass SAT tests, create decent poems and art, why can’t it clean up our data for us? Good question. I asked my new best friend, chatgpt. This is the reply:
Despite the rapid improvements in Artificial Intelligence (AI) technology, there appear to be a few hurdles still stopping AI from completely taking over the gathering and cleaning of data. Firstly, most models of AI are not able to adjust on their own to deal with unexpected changes that can occur when collecting data which often leads to errors that humans must eventually sort out. Secondly, it is difficult for AI models to understand complex abstract concepts such as human emotion or cultural context when collecting data. Finally, although sophisticated robots can be used in physical tasks such as surveying certain environments, they are limited in terms of potential creative uses due to their reliance on much less flexible programming. For example, robots can't establish relationships with people that could provide natural insights into what is being collected. All these challenges suggest there is still room for humans when it comes to gathering and cleaning data - at least for now!
A better way to think of AI and data cleaning is to imagine it as an assistant, rather than a professional expert. It can’t complete the job on its own, but it can help you be more efficient. Increasing use of AI in data tools will make us get faster at manipulating information, and in the near-term humans will continue to be involved in training the AI so it keeps learning.
We worked with a company that had a column of data that contained the values CPK30, CPK45, CPK60, or nulls. We had to ask the human operators what the code meant, as there was no way to deduce it from the other data columns. It turned out they were packaging choices, with either 30, 45 or 60 elements in a tray. However, 45 units was a non-standard choice that not all customers wanted or would accept if delivered. An AI data tool was able to highlight that only one customer ever ordered trays of 45 elements. This was confirmed by the most experienced human operator. A business rule needed to be constructed that would ensure only that one customer could buy 45 tray packs. Without AI picking this anomaly up, there would have been potential for issues in a solution that might have tried assigning this package option to other customers. The combination of people + AI as a smart assistant worked really well in this case.
How can we help the AI?
Machine Learning is a two-way street. Google has trained you to ask questions in a certain way, just as you have trained the google algorithms on your preferences. For data management, we need to help AI learn, by understanding how computers comprehend data. For example, spreadsheets are quite easy for a computer to read, if you lay out your data in a single-table per page approach. However, very few business users work that way. Perhaps as Microsoft embeds chatgpt into excel, we might see the AI suggesting and recommending better approaches for users to layout data.
The other thing we can do is to increase data validation on entry: having typo’s automatically corrected, ensuring values are in reasonable ranges, and so on. Again, AI tools built into our data entry can help here significantly.
We should also encourage the venture community to invest more heavily in AI+data combinations. This can be hard, because it is not always easy for business-focused individuals at the venture firms to grasp what can be tricky technical concepts, but I’m sure any VC would see the potential market opportunity for a chatgpt equivalent in data management.
What could/should AI do?
At SWARM we are looking at AI to engage the humans in a conversation to assist in learning (for both humans and machines – remember it’s a two-way street). Rather than our data scientists or sales people quizzing customers with questions like ‘How much do you spend on outbound logistics in the central region?’ and then subsequently ploughing through reams of data to see if we can validate the executive’s answer, wouldn’t it be better if AI could perform the analytics and ask for confirmation from the executive: ‘It looks like we spent approximately $55million on outbound logistics in the central region, which is more than 20% higher than other regions. Does that sound correct?’
If that conversation were happening between the AI and the customer directly, the AI could take this a step further and suggest potential improvements. ‘Would you like me to ask the VP logistics to model the challenge here, and see if we can find a way to improve that?’
We are building a tool that will act as a virtual consultant or advisor for a C-level exec. It could engage in a conversation, answer questions about the business and offer suggestions – all while learning more about the data and the business.
This could happen at multiple levels in the organisation, which would rapidly aid the learning and therefore help with data management. However, there is still a requirement for well-managed data to feed the AI, so nobody in data management is likely to find themselves redundant in the immediate future.
We can see some exciting possibilities in the future but going back to Einstein’s original 60 minutes to save the world thesis (see the first part of the blog series here), we think at least 15 mins should currently be set aside for data management. If anyone does have a magic AI wand for data, we (and our customers) would love to hear about it.
Let’s see: 40 minutes to define the problem, 15 minutes to wrangle the data, that leaves only 5 minutes to build and deploy a solution. Impossible, right? Maybe not… stay tuned for the final part in this series next week!