How to use AI to map your organisation’s data

Rachael Greaves, co-founder and CEO of the AI cybersecurity SaaS business Castlepoint Systems, gives a masterclass
Rachael Greaves
‍Rachael Greaves

In March 2023 alone, 41,970,182 people's records were breached, that we know about. Latitude Financial alone spilled 14 million records, including 8 million drivers licences and 53,000 passport numbers. And they didn’t know their own data: they originally thought the compromised dataset only affected 300,000 people (a fraction of the actual scale). If your network was breached, would you be in a similar situation? Would you know what you had lost? And if you didn’t know, and couldn’t explain, what repercussions would you face?

It’s not just big financial institutions having these experiences. Last month, universities, health providers, school districts, car makers, phone providers, sports bodies, retailers, and even a gun auction company reported breaches. This last one is hugely concerning, as the data linked owners, and all their contact details and addresses, with the specific weapons they purchased, making them highly vulnerable to targeted burglaries.

Why you need to map your data

These numbers came from just 100 reported breaches. We know that around 75% of breaches are not reported. We also know that 87% of small businesses, who don’t have to report, have customer data that could be compromised, and nearly half of all breaches happen to organisations with fewer than 1,000 staff. Whatever size your business is, and whatever vertical, you are at risk.

Let’s consider a case study that really highlights why knowing your data is so important. Optus, Australia’s second-largest telco, suffered a breach in September 2022. Optus had to rally 120 staff to help find out what data was spilled. Their vendors couldn't help them. They had to build their own software application from scratch, just to try to map and understand what data was in the spill. Optus was strongly criticised by the government, journalists, and on social media for how long this took – and also for having so much data that they didn’t need to keep. They earmarked $140M to cover the immediate cost of the breach, but are now facing class action in the billions.

If we don’t know what we have, we just can’t manage the risk. We can’t dispose of sensitive information we no longer need, or make sure it’s stored in secure systems, or track what happens to it. And if we don’t actually know what data we have, we certainly aren’t getting any value from it! So, hoarding this ‘dark data’ has significantly more risk than reward.

What data do we need to ‘know’?

High-risk information takes many forms and lurks in many places. It’s not just personal information or classified records that we need to worry about. Many information assets can cause harm if seen by people who shouldn’t see them, and a lot of the time, the people handling those records just aren’t aware of the risk.

The first thing governance teams in any business should do is map the Business Impact Level of their information assets. The process is fairly simple: to begin, make a list of the type of assets you probably have. This will include personal information, legal information, financial records, health records, intellectual property, audit records and so on: broad categories that will usually align with organisational functions. The next step is to consider the impact of a breach of those records, from a confidentiality perspective as well as integrity (if the data is corrupted) and availability (if the data is destroyed or encrypted). You can make your own risk matrix, or use an existing one: most governments have a Business Impact Assessment tool freely available online that helps you consider types of harm, and rank them.

The second thing to do is understand your threat sources. Governments publish cyber threat reports regularly, and these will help you understand the players in the cybersphere. Who might want your data? How motivated are they likely to be? And how capable? If you have any data that a foreign state would want, you are immediately at risk, because they are such sophisticated hackers.

Third, know your secrecy obligations. There are secrecy provisions sitting quietly in dozens of pieces of legislation, and they will result in civil or criminal penalties (sometimes both) if breached. Certain, sometimes very specific, types of information can be time bombs waiting to go off – you can check with your legal team what legislation you need to abide by, and whether those Acts or Regulations have any secrecy provisions in them.

Finally, know your other obligations for your data, specifically, how long you have to keep it under law. Data is like uranium. It’s powerful, and valuable, but gets dangerous as it’s allowed to decay. You need to be destroying data that has risk, but no longer has value, as soon as you legally can.

How does AI come into it?

By now we can see that we have:

- A lot of data

- A lot of threats

- A lot of inherent risk, and

- A lot of legal obligations.

And as well as a big volume problem, we also have a velocity problem. Our information stores are constantly growing and changing. So are our legal obligations, and so is the threat environment. How can we manage this huge scale, with all this complexity, when nothing stands still?

The answer is, we can’t. Not using traditional approaches. That’s why Optus still had so many records hanging around from people who weren’t even customers any more, and didn’t know where those records were sitting, and couldn’t tell that they’d been taken. Optus had actually pushed back on proposed government privacy reforms in 2020, saying that implementing systems to be able to destroy customer data on request would have ‘significant hurdles’ and ‘significant cost’.

That was before AI.

How can we use AI to know our own data?

Artificial Intelligence is a catch-all name for a range of technologies that aim to automate some aspect of work that people have traditionally had to do manually. Previously, to know where all our risky and valuable information was, we had to have people manage our file structures with a lot of rigour. We had to ask people to use metadata and naming conventions to clearly mark information. This worked ok with paper files, but all started to go out the window very early in the digital era, as the volume and velocity of data took off.

So, for the last 30 years or so, even though technology has become better and better, information control has become worse and worse. We recognised this in our role as auditors around ten years ago, and conceptualised and invented a new kind of AI to help solve this problem.

We knew we had to be able to read everything in the whole environment, and automatically classify it for known risk, rules, and reusability. AI was (and remains) the only way to do this at scale. For context, it would take you 130 years to read one terabyte of Word documents – even longer to try to manually match them to all the applicable rules. AI can do this in hours or days, without needing anyone to assign any metadata or follow a file plan.

When considering using AI to help you know your own information, think about Optus. Do you have 120 staff and $140M spare to start the process of knowing your own data after you have already been breached? If not, it pays to start the process of understanding what you have, where it is, and who is doing what to it. What risk and value it has, and what rules apply (and whether they are being met). These days, we need to start assuming being breached is not an ‘if’ question, it’s a ‘when’. We need to minimise the impact of any breach by protecting, and destroying, high-risk information in accordance with regulations, so that our exposure to those bad actors is minimised. Transparent, ethical AI can help to do that, and keep you defensible in the event of a spill.

Written by
April 17, 2023