Could AI exfiltrate your data?

Richard Ford, CTO at Integrity360, reveals the hidden dangers of ChatGPT and its rivals
Richard Ford
AI graphic

The commercialisation of Generative AI (GenAI) is now in full swing with successive generations now launching pro versions and subscription-based services. The idea is to regulate the technology to avoid issues over copyright, for example, and to train the AI within the context of the organisation to make it a more valuable tool. But we’re in uncharted territory here and the truth is that learning to use AI is going to entail a steep learning curve. Even enterprise-grade Language Learning Models (LLMs) based on ChatGPT, Google Palm and Gemini, and Meta’s Llama, will need to be handled carefully.

There’s already evidence to suggest that data management is being disrupted by GenAI. The Cloud and Threat Report: AI Apps in the Enterprise report found sensitive data being shared with these applications in large businesses on a daily basis. What’s more it recorded 183 incidents of sensitive data per 10,000 users being sent to ChatGPT every month.

The most frequent type of data being shared was source code (posted by 22 out of 10,000 users resulting in 158 posts per month) which could also potentially see passwords and keys embedded in this code exposed to GenAI. The next most common data type was regulatory data such as financial, healthcare or personally identifiable information (PII) with 18 incidents per 10,000 users followed by intellectual property (IP) at 4 incidents per 10,000 per month.

Misconfiguration presents a massive risk

If the GenAI is properly configured, this data sharing does not necessarily present a threat. The problem comes if the app being used has not been correctly set up at which point the sensitive data being ingested by the AI and could be presented to other members of the business who would not normally have access to that information.

There’s then the risk of data leakage via these staff many of whom are now being recruited as malicious insiders by organised criminal gangs (OCGs), with 52% of businesses seeing an increase in insider related incidents due to the economic downturn, according to the ISC2 Cybersecurity Workforce Study 2023. Or guest or external users, who commonly are granted access in collaborative working apps, could also act as the point of egress.

Varonis recently looked at the type of information that users could obtain over an unregulated Copilot installation. Worryingly, it determined that 'prompt hacking' could be used to obtain sensitive information ranging from employee data with PII, bonuses awarded to staff, files with credentials, APIs or access keys housed within them, M&A activity, or those labelled as sensitive under the document management policy.

Urgent action required

Commercial GenAI providers are aware of these issues and have taken great pains to educate the market. The likes of Microsoft’s Copilot, for example, have stated that “permission models in all available services, such as Sharepoint, [should be used] to help ensure the right users or groups have the right access”, indicating the need to make sure that the business is AI-ready before committing to these models and the need to manage that data on an ongoing basis.

Copilot is an invaluable tool that spans the Microsoft services suite and is integrated into everything from Microsoft Dynamics for ERP and CRM, to Glint for HR, and of course the ubiquitous Office 365. In the latter, it can enable the workforce to utilise corporate data to summarise email trails, draft responses, encapsulate the key points in a meeting and suggest action points in realtime, for instance. But it’s pervasiveness make it essential that the tool is properly deployed.

Copilot does of course have its own guard rails and security controls. Prompt responses take place within the Microsoft365 Trust Boundary, so no data is shared with OpenAI or used to train the LLM, and Copilot makes use of existing user permissions to determine what it will analyse and surface. It also enforces 2FA, compliance boundaries, and privacy protection and permissions to regulate access. However, the collaborative nature of the modern work environment can quickly see the concept of least privilege eroded, with sensitive data stored in personal folders over the Office365 ecosystem at which point it becomes accessible to others.

Making GenAI safe and secure

To securely configure GenAI such as Copilot it’s therefore necessary to lay the necessary groundwork beforehand. Key to this is determining the data you have, how and where it is stored and controlled, how it is labelled and who has access to it. You’ll also need to know which third party apps are in play and the data they have access to. This is a huge undertaking that will necessitate the automated scanning of all identities, accounts and entitlements, files and folders, shared links, and data labels as well as those instances where such labelling is missing.

Data classification is a must but it needs to be accurate to prevent it becoming unnecessarily restrictive and the circumvention of data loss prevention (DLP) controls. Automatically scanning and labelling all data can ensure this doesn’t happen as well as ensuring that data is correctly labelled when it is changed. One thing to bear in mind, for instance, is that GenAI will generate new content based upon existing sensitive data and without that new content also being labelled as sensitive it effectively becomes invisible and can no longer be tracked by DLP.

The next step is to look at applying the concept of least privilege so that only those that need it are granted access to certain data sets. According to the 2023 State of Cloud Permissions Risks Report, 50% of identities are super admins and so have unrestricted access, illustrating just how open most data is and the potential risk of compromise. Removing these open permissions and enforcing least privilege policy via an automated system ie Microsoft Purview to align access privileges with roles can again help remedy this issue. But these access privileges and sharing rights should also be reviewed, with old links or permissions removed, to remain relevant.

Only when these steps have been performed and automated processes put in place can GenAI be safely deployed but even then it will require ongoing maintenance in the form of monitoring. Data access, authentication, link creation and usage, and object changes related to the data and the context in the way it is used should all be monitored by the incident response team. In this way, unusual access patterns or the pull down of sensitive data, such as those detailed in the prompt hacking examples cited above, will be detected and flagged.

There’s little doubt that GenAI will see our relationship with data change exponentially, with more information being created and shared than ever before. It signals the beginning of a whole new way of working and promises increase the ease with which we access and use data, making us more productive in the process. But it also poses an enormous risk to that data. It's only be using automation to identify, classify and monitor access and usage that we can safely use GenAI and mitigate the threat of data exfiltration.

Written by
Richard Ford
CTO at Integrity360
April 22, 2024
Written by
April 22, 2024