GDPR data breach notification – Get a grip on the technicalities
One of the most hotly talked about requirements of the EU GDPR is the need to notify the authority within 72 hours of when a data breach is detected (in the UK this is the ICO – www.ico.org.uk). This requirement for data breach notifications is not unique to the EU, GDPR is supra-national so it applies to all organisations that process the data of EU citizens. Additionally, other countries have, or are planning, similar rules to mandate. The UK will have to implement equivalent rules after Brexit in order to continue to exchange information with the EU; but countries like Australia have also set out their own mandatory data breach notification requirements which are similar to the EU.
Data breach notification obligations – Meeting the 72 Hours timeline
While not specifying a time limit within which to make the notification, the Australian regulator has used terms such as “promptly” and “expeditiously” in the OAIC rules.
Timescales notwithstanding, there is expected information required to be included in a notification and these trickle down to the technical systems that would be involved in handling, storing and processing data and in responding to any breach. The information needs to be collected, retained and made accessible.
The clearest example of this is the need to inform data subjects of their exposure – this means having a way to identify and reliably contact them.
This blog post contains a deeper look at the difficulties in the data breach notification process as companies try and ascertain the true nature of a breach. It might take 20 minutes or so to read but is divided into the types of answers to questions businesses will need to find:
- Who has the data breach affected and who stole the data?
- What data was affected by the breach?
- When did the data breach occur?
- Why was the data stolen by an attacker?
- Where is the stolen/leaked data and the affected users/customers now?
- How did the data breach happen?
Data breach notification: Tough questions to answer
The technological requirements of meeting the notification timelines are driven by the questions and expectations contained in the Regulations. The EU GDPR (in Article 33) asks for:
The notification … shall at least:
1) describe the nature of the personal data breach including where possible, the categories and approximate number of data subjects concerned and the categories and approximate number of personal data records concerned;
2) communicate the name and contact details of the data protection officer or other contact point where more information can be obtained;
3) describe the likely consequences of the personal data breach;
4) describe the measures taken or proposed to be taken by the controller to address the personal data breach, including, where appropriate, measures to mitigate its possible adverse effects.
This holds unless:
… the personal data breach is unlikely to result in a risk to the rights and freedoms of natural persons.
The Australian equivalent (to continue the comparative example) insists that:
This notification must set out:
– the identity and contact details of the organisation;
– a description of the data breach;
– the kinds of information concerned and;
– recommendations about the steps individuals should take in response to the data breach.
Additionally in Australia, notifications are only required for an ‘eligible data breach’:
An eligible data breach arises when the following three criteria are satisfied:
1) there is unauthorised access to or unauthorised disclosure of personal information, or a loss of personal information, that an entity holds
2) this is likely to result in serious harm to one or more individuals, and
3) the entity has not been able to prevent the likely risk of serious harm with remedial action.
The challenge in the breach notification process comes when these questions are asked during the critical early detection/investigation stage of a breach because often the underlying technology might not provide easy, complete or accessible answers.
In a simple case, given a database with “x” million rows and holding columns for name, address, data-of-birth, password and identifying information. If this data is known to have been stolen then you might be in with a chance of describing the contents and identifying those affected, but even in those cases where a dump of data is posted externally in a viewable form (e.g. on a site like pastebin) all you can see is what the attacker has posted, not what they actually managed to get away with.
Who has the data breach affected & who stole the data?
“Who?” has as least three meanings in the context of a data breach:
- Who was affected?
- Who accessed/stole/leaked the data?
- Who has access to it now?
There is quite possibly a fourth:
- Who allowed this to happen (whose fault is it)?
Starting with the people affected, if you have a record of the file of database contents that was accessed you might be able to work out from that, as in the example above. However, often data sets are comprised of separate categories, files, sections or chunks – there might be a table of active customers in one system and another might hold lapsed accounts. An extract from an application that is transmitted across the network might contain a portion of the data, or a lost file may have been created as an extract on a particular date and hold records matching a specific criteria as at that specific time.
In short you need good and detailed logs on database accesses, network session information, server activity records and original copies of specific files that were retrieved, disclosed or leaked in order to come up with a reliable list of who was affected.
The challenge is best illustrated by the data breach at Talktalk where their initial answer to “How many people were affected?” was non-committal and uncertain. The media assumed the worst and reported a number based on their entire customer base, some 4 million people. The reality was closer to 150,000 and only 15,000 of those were “sensitive records”, containing financial data. Had the initial response been “Only about 15,000” the flurry of coverage might have been more muted.
The question of who accessed the data will often also prove difficult. If the data is leaked from within there might be clues as to its origin or internal owner. The problem is that this might not be the person that caused or allowed its exposure. If data was accessed externally then used, posted or shared you might get as far as a pseudonym or hacker alias but the individual concerned and their motivation might only become clear much later.
More worrying is considering who ends up with access to the stolen files or records. The initial breach might be due to carelessness, whistleblowing, or an attempt to highlight weak application security. However, once the data set is exposed, especially if it is indeed sensitive information and hence valuable to a fraudster or attacker; the motivations of those who might exploit it could range from conducting phishing attacks to blackmailing data subjects or to conducting fraud (including against other sites due to password sharing and reuse). If data is disclosed to, or accessed by, a single party the motivation or risk might be able to be presumed; if the entire Internet has access to it then the intentions of anyone accessing it are much harder to quantify.
What data was affected by the breach?
The number of records (i.e. the customers or data subjects) affected, or the volume of data is one dimension of a data breach. The second is the nature of the fields/columns within the data set or the extent of the data affected for those individuals.
In simple terms, what columns of the database were captured – just names/addresses or is there identity information, passwords, bank details, security questions, sensitive fields etc. involved.
The published dataset may give a hint, but again might not contain a complete picture of the information. If you are able to retrieve the queries used by an attacker from database log records it may be possible to replay the queries used to access data and hence to deduce the contents. If it is a file that has been leaked/posted/stolen then the contents will be clear as long as you have a reference copy of the file to refer to.
However, in the absence of detailed database query logs, server logs, or network session information, knowing how exposed the affected data subjects are could be a matter of assuming the worst, hoping for the best or applying some educated guesswork.
This is especially the case when sensitive fields such as financial information, card numbers or passwords might or might not have been encrypted – did the data contain encrypted versions or hashes of the actual data and how well protected were these? In past breaches, the media and senior executives have a poor track record of accurately answering or reporting on this.
Encrypted data is not the same as hashed data when it comes to the effects of a breach. Questions like “Were the decryption keys also affected?”, “Were the hashes salted?”, “Was the data encrypted or just encoded/obfuscated?” all get asked but are quite technical details that make a big difference. Affected organisations have been known to assume data is encrypted when actually it is either obfuscated or scrambled in some trivial way, or that the keys for decryption are easily discernible from the same systems where the data is held.
When did the data breach occur?
When did the breach occur, when was it noticed and when did the organisation start to recognise and escalate it? The delta between these three times is often embarrassingly long and it’s a factor that is likely to influence fines when the authority starts to impose them.
Even a few days makes an affected organisation look like it was slow off the mark, and many breaches are found to have happened a much longer period before they were detected, and then not disclosed for a longer period still.
If the 72-hour window achieves anything it will speed up the time that elapses between a breach being detected and the publication of the initial assessment. However, the breach still has to have been detected for the 72-hour countdown to start ticking and whether you receive a fine (and the size of it) may be affected by how long it goes undetected (i.e. how diligent and comprehensive are your detection systems?).
One challenge is that if a breach is detected now, but which took place some time ago, the nature of the data stolen becomes very hard to diagnose. The configuration of systems, the content of files or data sets, the time at which the queries or access occurred are likely to be uncertain and so the time of exposure, how long accounts have been at risk, the extent of any usage or fraud committed starts to become evident only through much more detailed investigation – especially where the delay has been months or in past cases, years.
It is a common statistic that the time for breach detection and resolution is often several weeks if not months; in the future climate we ought to see these figures reduce dramatically as organisations are forced to lift their game in system monitoring, security analytics and threat response. This means:
- Comprehensive collection of system, network, application, database and user activity data;
- Real-time correlation and behavioural detection of external AND internal abuses of access to data;
- Efficient ways to process alerts and diagnose the nature of an attack, breach, theft or loss;
- Suitable tool sets for investigation, diagnosis, understanding, decision making and resolution;
- A team that has sufficient skills, resources and technology to do this job effectively;
As a final consideration here, the 72-hour window might be what the regulator expects, but public and consumer expectations are likely to outstrip this and demand notifications in a much shorter window before taking to social media to start the speculation and complaining.
Why was the data stolen by an attacker?
Understanding why a breach occurred also has a number of dimensions… you may need to understand why your systems, users or data were attacked, but also what motivated the attacker. This could be a desire to cause you (or your customers) embarrassment, there may be a social or political goal, it could be a part of a wider fraud or account takeover attack, or the aim may be to extort money and commit blackmail.
The likely impact of the breach is therefore intrinsically linked to the rationale for the breach and therefore may colour the notification and the incident response process both in terms of what you do and what priorities you assign to particular steps.
In cases where the information pertains to characteristics of the individual (such as surname, gender, date of birth or medical history) and therefore cannot be changed, reissued or reset, then the motivation and outcomes are even more key to the severity assessment – a medical fact or personal characteristic that is exposed will stay exposed; hence the rationale for its theft or disclosure becomes pertinent.
Motivation also affects the level and type of publicity; a small-scale breach with a narrow target could be highly significant for the individual but not have wider ramifications (e.g. the accessing of photographs of a celebrity they would rather remain hidden or a limited small scale fraud against an individual). However, the breaching and disclosure of a few million credit card numbers by an attacker seeking to make a point about lax web application security could attract vastly different publicity as a result of the completely different scale of impacts to the wider population – even though no individual ends up materially affected in the long term and no one has to feel embarrassed in any way.
Where is the stolen/leaked data and the affected users/customers now?
Expecting technology to answer questions with “Where” in them is often difficult. In the commonest case the disparity between the real world and the IT world become very evident.
Let’s consider some examples:
Where is this attacking IP Address?
Sometimes you will know what you think is the origin of an attack in terms of network address and it seems obvious to show that as a real world location. GeoIP databases are better than they used to be, but still can be inaccurate (just ask the Vogelman family – https://splinternews.com/how-an-internet-mapping-glitch-turned-a-random-kansas-f-1793856052).
You might get a town, street or building – but you might not. Also if the building turns out to be a coffee shop or hotel with free wireless it might not be much help. The result might also be the system nearest to your attacker’s servers, not the true origin of the attack – or possibly the user at the location indicated just happens to have a compromised system themselves.
However, this is still worth doing – knowing the location, even roughly, can give you some clues as to the provenance of an attack, the likely severity, the motives or other useful clues. Make sure your analytics and monitoring systems can access and retrieve this kind of location information on demand when alerts are raised.
Where has the stolen data gone?
You’ll have the same problem with a target/destination IP address if the information was noticed flowing out of your network somewhere. But once it’s gone it could be anywhere, possibly in lots of places. If stolen data was posted on an Internet forum then it could have been downloaded multiple times and been accessed by numerous people with diverse motivations.
Any use of this data (for example credit card fraud or identity theft) could similarly happen anywhere once the data is lost and published. The eventual destination may be beyond your control but knowing the egress route – from your internal platforms to whatever cloud or Internet-based storage location it ended up on, might help with attribution and also gives you some ideas of things to monitor for, or block, in future. Do you really need to allow flows of several Gb of data to cloud storage folders, if not then filter that traffic or alert on that kind of use.
Where are the affected systems or users?
Attacked servers are often easy to locate as they’ll be in defined locations and on known network addresses (unless they are in the cloud in which case they could be anywhere, and everywhere).
User workstations are harder to pinpoint – one organisation was confident that upon discovering a breach of policy or a malware alert they could narrow down the origin to the correct floor of the building, but only as long as they knew about it within a few days. That was the best they could manage!
Tracking down mobile users, laptop users in the building and remote third-party users is difficult. Finding the places where they have sent information, uploaded or shared it is even harder.
Data loss prevention and network management tools may give some of the answers, but if you have an IP address on a wireless network that looks to be part of a compromised botnet or is causing internal network issues due to lateral movement of an attacker, then the challenge is real. Having a view of network activity from the network infrastructure or gateways, as well as systems themselves, will be the only way to triangulate the location and do something about the affected system or user.
At this point full integration of security oversight processes with the identity and system management tools will be essential to quarantine a system, disable a user account or sever access in such a way that data is safe and the culprit (or innocent victim) can be located.
How did the data breach happen?
How did this happen and how do we stop it happening again?
The best outcome from any breach has to be to learn from it – if for no other reason that getting hit in the same way twice is pretty hard to defend. Salvaging the reputation of the business, restoring the faith of customers and placating the regulator or supervisory authority means showing a degree of remorse, culpability and acknowledging mistakes as points to learn from.
A forward looking statement of improvements and fixes to technology and process based on the diagnosis of a breach is now, more than ever, vital to produce. This means truly getting under the skin of what went wrong, why a user had the access they had, why data ended up stored where it could be reached and what countermeasures, processes, technology or recovery capabilities that were missing this time will be in place to defend the organisation in future.
Again, clarity over the diagnosis and method of loss is as useful at this stage as it has been shown to be in all the cases above. There is one very common theme in all breaches that is a lesson that can be learned immediately, without needing a catastrophic data loss to base the conclusions on:
The impact of a breach is always greater the longer the problem goes undiscovered and unresolved.
In extreme cases there is the breach at Yahoo that managed to wipe a $1bn off their acquisition price by Verizon and as recently as October 2017 was elevated from affecting one third of their users to all of them (3 billion accounts).
At the other end of the scale is where you want to be – assuming a breach does affect you at some point. A place where you are able to head off a network flow of sensitive data as it happened to limit the damage, intercept a user just as they access systems with malevolent intent or find the source of a rapidly spreading malware instance while it can be hemmed in and contained.
Knowing how it happened and how to prevent a future incident is therefore a very important question to be able to answer confidently.
The data breach notification is not the end of the process
Filling in a form or composing a data breach notification to the relevant authority and working out what to tell affected data subjects and customers is difficult in many ways.
The volume, quality and completeness of information – the continuous and comprehensive monitoring of networks, systems, applications and users – is vital not just to detect a cyber security breach but also to understand, recover and learn from it.
Answering questions in the high-pressure situation of a cyber incident response is much, much easier if you have a way to find out the answers; the technological capability to gather data, query and interrogate it, and analytics tools to help you understand what happened and why.