owned this note
owned this note
Published
Linked with GitHub
# Analysis of the data of vulnerabilities - BoF @GSVS
## Moderator
Brandon Lum
## Attendees
Shripad
Emily
Brandon
Josh Buker
Allan Friedman
Art Manion
Paul Scarrone
Jonathan Leitschuh
CRob
Jamie Magee
GH folks
Christopher Turner
## Discussion Areas
* Completeness of the data,
* usefulness (missing content),
* any data science needs
## Summary
_a tl;dr of the discussion_
## Notes
What are we defining as vulnerability data? the CVSS score, packages, version, how you fix it, CWE, what all do you see as data around vulns?
We have a lot derived data points and can be used for lots of analysis. how have the derived data points come to be and how accurate are they? Lots of black box sources that don't tell you that.
What are some of the sources of bad data? is it because certs are not doing their job or are there biases coming from the data generation
are there systemic biases, why don't we have all the fields that we want.
Are there biases or are the tools insufficient to be able to accept the data.
Everyone is driven to do it for different reasons, everyone wants to jump to results.
Emily's Question: What are the fields we want that allow us to better analyse the information presented? What kinds of things do we want out of the data?
minimum what is affected and what versions are affected. they must be done in a machine readable way. if we can tie vuln management to open source development workflows, if any vulnerability had any automation to get hte range from when we introduced to fix the vuln, there is a lot of ways to do automation on it.
Idea: matching the commit(s) in open source to the vuln to tie a clean fix
If we run it on an SBOM, we will get a lot of informaiton. VEX helps in this way. Having that data would be able to automatically create these VEX type documents
In most cases, CVE fix is not only part of 1 commit, but multiple commits.
Many reporters have no incentive to give us good data. This needs changed how to do we change it?
Thought on what matters on the data ;
- am i affected
- how bad is it
- how can i fix it
- What kind of vuln is it, is it easy to understand
Adding financial incentives around good data, paying bounties for writing up good reports. Including , xyz, information, provide monetary incentive
People don't like the answer of cve exists but there is no patched version
A lot of companies create new versions that still have the vulns. i.e. they deprecate the method but don't fix the vuln
Part of incentives, another part of it is making the process easier. Making it easy = incentive to submit data
A possibly fundamental issue is about creating a format, and it boils down to the similar things.
CVE description fields - what is the machine readable version of local priv escalation. That language problem may not be totally sovled.
Human experts can sort of understand, but hard to encode. Thus, slow progress.
With all the information - if we fill the gaps, what can we do with it?
Machines need to understand it.
There is a general lack of knowledge of this kind of information that want to give it. Maintainers from my community is well intention and supported, but don't know who to contact and what information to give. (Github)
Can we add call graph info into the IDs? is this the right thing to do.
The data is encodable, but the utility, im not sure
Data has to be very certain in order to make it useful, getting that certainty is very hard.
Can we express certainty to a given degree of confidence?
Doing the call graph accurate is difficult but having it 50% would be helpful as well.
Sounds like templates of expected content is needed that can be easily converted to machine readable formats. perhaps doing this in the command line with a utility? Just a note and thinking out loud
Vuln - is it a vuln, what's fixed, what's affected.
Then the next level is which function is affected. But we can't get the basic information right. Better descriptions are good.