TL;DR – The CVE dataset does not allow you to determine how many vulnerabilities were disclosed in 2017.
I’ll try to keep this fairly short and to the point, but who am I kidding? Every year for a decade or more, we see the same thing over and over: companies that do not track or aggregate vulnerabilities decide to do their own review and analysis of disclosures for the prior year. Invariably, most do it based on the publicly available CVE/NVD data, and they do it without understanding what the dataset really represents. I know, it seems simple on the surface, but the CVE dataset is not easily understood. Even if you understand the individual contents of the export, you may not understand how it was created, what shortcomings there are, what is missing, and what statistical traps you face in digesting the data. Just doing the basic parsing and automated ‘analysis’ of that data via your tool of choice (be it grep or something fancier) means very little unless you can disclaim and properly explain your results. Either way, follow along with the advice below before you publish your ‘vulnerability stats for 2017’ please!
So let’s start with the basics of CVE data analysis. Begin by grabbing the latest CVE dump, a gzipped CSV file, that represents MITRE’s CVE dataset. Note, this is different than the exports NVD offers and welcome to the first hurdle. While the base vulnerability data is 100% equivalent between the two, NVD does additional analysis and creates metadata that is useful to many organizations. NVD provides CVSS scoring and CPE data for example. The relationship between CVE and NVD is interesting if you observe it over time, where it used to be a clear ‘MITRE publishes, a day later NVD publishes’ relationship. For the last year or two, NVD will sometimes open up a CVE ID before MITRE does for various reasons. This also gave way to Bill Ladd observing and writing about how the Chinese National Vulnerability Database (CNNVD) is actually opening up CVE IDs faster than both NVD and MITRE. Consider that for a minute and understand that the relationship between these three entities is not straightforward. Then consider the relationship between many other entities in the bigger picture, and it gets even more convoluted.
See? You start by grabbing a data dump, a paragraph later you have the start of disclaimers and oddities as pertains to the larger CVE ecosystem. Next, decompress the CVE dump so you have a CSV file to work with. Now, before you eagerly start to parse this data, stop for a moment. Did you do this same analysis last year? If so, great! Do you understand what has changed in the last 18 months with regards to CVE and more specifically MITRE? If you can’t quickly and readily answer that question definitively, the kind of changes that are the first in almost 19 years for the program, reconsider if you should be commenting on this data. In case you missed it, Steve Ragan published an article about MITRE / CVE’s shortcomings in September of 2016. The article pointed out that MITRE was severely deficient in vulnerability coverage, as it has been for a decade. Unlike other articles, or my repeated blogs, Ragan’s article along with additional pressure from the industry prompted the House Energy and Commerce Committee to write a letter to MITRE asking for answers on March 30, 2017. When a certain board member brought it up on the CVE Board list, and directly told MITRE that their response should be made public, MITRE did not respond to that mail in a meaningful manner and ultimately never shared their response to Congress with the CVE Board. It is important for you to understand that MITRE operates CVE as they wish and that any notion of oversight or ‘Board’ input is only as it is convenient to them. The board has little to no real influence over many aspects of MITRE’s operation of CVE other than when they set an official vote on a given policy. Additionally, if you point out how such a vote that impacts the industry is not adopted by certain entities such as CNAs, many years down the road? They don’t want to hear about that either. It’s up to the CNAs to actually care, and fortunately some of them care very much. Oh, you know what a CNA is, and why they matter, right? Good!
OK, so you have your data dump… you better understand the state of CVE and that it is so deficient that Congress is on MITRE’s case. Now, as experienced vulnerability professionals, you know what this means! The rubber-band effect, where MITRE responds quickly and disproportionately to Congress breathing down their neck, and their response impacts the entire CVE ecosystem… and not necessarily in a good way. So welcome to the second half of 2017! Because it took roughly a year for the Congressional oversight and subsequent fallout to strongly influence MITRE. What was their response? It certainly wasn’t to use their abundant taxpayer funded money to directly improve their own processes. That isn’t how MITRE works as I far as I have seen in my career. Instead, MITRE decided to use their resources to better create / enhance what they call a “federated” CNA system.
First, spend a minute looking at the ‘federated’ term in relation to CVE, then look at the use of that term in the recently edited CNA Rules. Notice how the use of ‘federated’ in their context appears to have grown exponentially? Now check the definition of ‘federated’ [dictionary.com, The Free Dictionary, Merriam Webster]. While sufficiently vague, there is a common theme among these definitions. In so many words, “enlist others to do the work for you“. That, is quite simply, what the CNA model is. That is how the CNA model has meant to work from day one, but this has become the saving grace and the crutch of MITRE as well as the broader CVE ecosystem in the last few months. On the surface this seems like a good plan, as more organizations and even independent researchers can do their own assignments. On the downside, if they don’t follow the CNA rules, assignments can get messy and not as helpful to organizations that rely on CVE data. One thing that you may conclude is that any increase in CVE assignments this year may be due, in part, to the increase of CNAs. Of course, it may be interesting to you that at least two of these CNAs have not made a single assignment, and not disclosed any vulnerabilities in prior years either. Curious why they would be tapped to become a CNA.
OK, so you have your data dump… you know of one potential reason that there may be an increase in vulnerabilities this year over last, but you also know that it doesn’t necessarily mean there were actually more disclosures. You only know that there are more CVE IDs being assigned than prior years. Next, you have to consider the simple numbers game when it comes to vulnerability statistics. All CVE IDs are created equal, right? Of course not. MITRE has rules for abstracting when it comes to disclosures. Certain criteria will mean a single ID can cover multiple distinct vulnerabilities, and other VDBs may do it differently. It is easy to argue the merit of both approaches, so I don’t believe one is necessarily right or wrong. Instead, different abstraction rules tend to help different types of users. That said, you will typically see MITRE assign a single CVE ID to a group of vulnerabilities where a) it is the same product and b) it is the same type of vulnerability (e.g. XSS). You can see an example in CVE-2017-16881, which covers XSS vulnerabilities in six different Java files. That is how they typically abstract. Search around for a couple minutes and you will find where they break from that abstraction rule. This may be due to the requesting party filling out separate requests and MITRE not adhering to their own rules, such as CVE-2017-15568, CVE-2017-15569, CVE-2017-15570, and CVE-2017-15571. Then you have to consider that while MITRE will largely assign a single ID to multiple scripts vulnerable to one class (e.g. CSRF, SQLi, XSS), their CNAs do not always follow these rules. You can see examples of this with IBM (CVE-2017-1632, CVE-2017-1549) and Cisco (CVE-2017-12356, CVE-2017-12358) who consistently assign in such a manner. If you think these are outliers that have minimal impact on the overall statistics you generate, reconsider that. In keeping with their abstraction policy, IBM issued two advisories [#1, #2] covering a total of nine CVE IDs for unspecified XSS issues. If MITRE had assigned per their usual abstraction rules, that would have been a single ID.
OK, so you have your data dump… and now you are aware that parsing that dump means very little. MITRE doesn’t follow their own abstraction rules and their CNAs largely follow different rules. So many hundreds, likely a thousand or more of the IDs you are about to parse, don’t mean the same thing when it comes to the number of distinct vulnerabilities. That is around 10% of the total public CVE IDs issued for 2017! OK, forgetting about that for a minute, now you need to consider what the first part of a CVE ID means. CVE-2017-1234 means what exactly? You might think that 2017 is the year the vulnerability was disclosed, and the 1234 is the unique identifier for that year. Perhaps. Or does 2017 mean the year the vulnerability was found and an ID requested? The answer is yes, to both, sometimes. This is another aspect where historically, MITRE made an effort to assign based on when the vulnerability was discovered and/or disclosed to a vendor, not when it was published. Under the old guard, that was an important aspect of CVE as that standard meant more reliable statistics. Under the new guard, basically in the last two years, that standard has disappeared. Not only do they assign a 2017 for a vulnerability discovered and disclosed to a vendor in 2016 but published in 2017, but also they assign a 2017 ID for a vulnerability discovered and disclosed in 2017. Worse? They are also now assigning 2017 IDs to issues discovered and disclosed in previous years. If you need examples, here are MITRE-assigned (as opposed to CNAs that do the same sometimes) 2017 CVE IDs for vulnerabilities disclosed prior to this year; 2016, 2015, 2014, 2013, 2011, 2010, 2008, 2004, and 2002. Notice the missing years? Some of the CNAs cover those gaps! Note that there are over 200 cases like this, and that is important when you start your stats. And we won’t even get into the problem of duplicate CVE assignments that haven’t been rejected, like the first two assignments here (both are invalid assignments and that CNA should know better).
OK, so you have your data dump… you’re ready! Let loose the scripts and analysis! While you do that, I’ll save you some time and math. As of December 24, 2017, there are 18,251 CVE identifiers. 7,436 of them are in RESERVED status, and 133 are REJECTed. As mentioned above, 238 of them have a 2017 ID but were actually disclosed prior to 2017. So a quick bit of math means 18,251 – 7,436 – 133 – 238 = 10,444 entries with 2017 CVE IDs that were disclosed in 2017. This is an important number that will be a bit larger if you parse with Jan 1, 2018 data. This should be your starting point when you look to compare aggregated disclosures, as captured by CVE, to prior years. Based on all of the above, you also now have a considerable list of disclaimers that must be included and explained along with whatever statistics you generate. Because MITRE also stopped using (1) consistent (2) formatting to (3) designate (4) distinct (5) vulnerabilities in a CVE ID, you have no way to parse this data to actually count how many vulnerabilities are present. Finally, know that Risk Based Security’s VulnDB tracked 7,815 distinct vulnerabilities in 2017 that do not have CVE coverage.
Cliff notes? The CVE dataset does not allow you to determine how many vulnerabilities were disclosed in 2017. Hopefully this information helps with your article on vulnerability statistics!