Deduplicate CVEs

fricklerhandwerk commented

2024-09-26 16:50:29 +00:00

(Migrated from github.com)

In our ingested data, many CVE numbers appear multiple times since they seem to come from different sources.

For example, stage the ingested data before presenting it for triage, and deduplicate it automatically as far as possible:

Some of the data is redundant but not identical, e.g. differing only in capitalisation; pick one variant consistently
Some data fields are filled in one variant but not the other; merge
Some data fields conflict; present for manual resolution

In our ingested data, many CVE numbers appear multiple times since they seem to come from different sources. For example, stage the ingested data before presenting it for triage, and deduplicate it automatically as far as possible: - Some of the data is redundant but not identical, e.g. differing only in capitalisation; pick one variant consistently - Some data fields are filled in one variant but not the other; merge - Some data fields conflict; present for manual resolution

fricklerhandwerk commented

2024-10-30 11:05:30 +00:00

(Migrated from github.com)

Discussed with @RaitoBezarius and @erictapen:

This issue is blocking UI development for the #203
Also relevant for #288
Raised priority to "high"
The main problem is that we don't want duplicate CVE IDs in the suggestions view
We'll move towards a unidirectional processing pipeline, where merged CVEs are like a cache that can be re-created by a full bulk run
Merged CVEs should refer to the raw data

Discussed with @RaitoBezarius and @erictapen: - This issue is blocking UI development for the #203 - Also relevant for #288 - Raised priority to "high" - The main problem is that we don't want duplicate CVE IDs in the suggestions view - We'll move towards a unidirectional processing pipeline, where merged CVEs are like a cache that can be re-created by a full bulk run - Merged CVEs should refer to the raw data

fricklerhandwerk commented

2024-10-31 10:17:56 +00:00

(Migrated from github.com)

This seems to be a problem with the suggestion engine.

This condition apparently does not hold:

    if CVEDerivationClusterProposal.objects.filter(cve=container.cve).exists():
        logger.warning(
            "Proposals already exist for '%s', skipping linkage.", container.cve
        )
        return

This seems to be a problem with the suggestion engine. This condition apparently does not hold: ```python if CVEDerivationClusterProposal.objects.filter(cve=container.cve).exists(): logger.warning( "Proposals already exist for '%s', skipping linkage.", container.cve ) return ```

RaitoBezarius commented

2024-11-14 12:55:15 +00:00

(Migrated from github.com)

The root cause is not duplicate proposals but duplicate ingested CVEs.

👀 1

erictapen commented

2024-11-15 11:43:26 +00:00

(Migrated from github.com)

I just found a weird thing which I think is related.

I am seeing duplicate proposals in the suggestions list.

Note how the suggestion_id is 864 in both cases. This directly represents pk field in the ListView, showing CVEDerivationClusterProposals. I wonder how this is even possible, that we get duplicated entries with the same primary key? Which curiously even show different affected package names?

The code running is my branch remove-package-selection, though I doubt it's specific to my changes.

I just found a weird thing which I think is related. I *am* seeing duplicate proposals in the suggestions list. ![tmp 0vtBs2tcYd](https://github.com/user-attachments/assets/dcb2c20e-eb26-496e-90df-d753b5f07559) Note how the `suggestion_id` is `864` in both cases. This directly represents `pk` field in the ListView, showing `CVEDerivationClusterProposal`s. I wonder how this is even possible, that we get duplicated entries with the same primary key? Which curiously even show different affected package names? The code running is my branch `remove-package-selection`, though I doubt it's specific to my changes.

RaitoBezarius commented

2024-11-15 11:52:49 +00:00

(Migrated from github.com)

It can occur due to a JOIN that repeats some fields.

fricklerhandwerk commented

2024-11-20 09:29:59 +00:00

(Migrated from github.com)

Did some twiddling on my local instance after nuking all suggestions.

queryset is what we want to operate on.
queryset2 is what the view is currently doing.

In [28]: queryset = CVEDerivationClusterProposal.objects.select_related("cve").filter(status=CVEDerivationClusterProposal.Status.PENDING).prefetch_related("
    ...: derivations","derivations__parent_evaluation")

In [29]: queryset2 = queryset.annotate(
    ...:     package_name=F("cve__container__affected__package_name"),
    ...:     base_severity=Coalesce(
    ...:         F("cve__container__metrics__base_severity"), Value(Severity.NONE)
    ...:     ),
    ...:     title=F("cve__container__title"),
    ...:     description=F("cve__container__descriptions__value"),
    ...: )

In [30]: queryset.count()
Out[30]: 8

In [31]: queryset2.count()
Out[31]: 66881

Did some twiddling on my local instance after nuking all suggestions. `queryset` is what we want to operate on. `queryset2` is what the view is currently doing. ``` In [28]: queryset = CVEDerivationClusterProposal.objects.select_related("cve").filter(status=CVEDerivationClusterProposal.Status.PENDING).prefetch_related(" ...: derivations","derivations__parent_evaluation") In [29]: queryset2 = queryset.annotate( ...: package_name=F("cve__container__affected__package_name"), ...: base_severity=Coalesce( ...: F("cve__container__metrics__base_severity"), Value(Severity.NONE) ...: ), ...: title=F("cve__container__title"), ...: description=F("cve__container__descriptions__value"), ...: ) In [30]: queryset.count() Out[30]: 8 In [31]: queryset2.count() Out[31]: 66881 ```

fricklerhandwerk commented

2024-11-20 11:33:49 +00:00

(Migrated from github.com)

It seems the simplest way to both solve this issue and keep the query time down despite a suggestion containing lots of information is caching the entire suggestion (including all related models, filtered down to what we need) by ID, already at creation time. Suggestions only change once we touch them, and even then we could cache everything except the moving parts (currently the list of linked derivations).

Then the view query can only get the item IDs, and the rest is taken directly from the cache. Opened #377

@alejandrosame btw. we may want to do the same for activity logs: #366

It seems the simplest way to both solve this issue and keep the query time down despite a suggestion containing lots of information is caching the entire suggestion (including all related models, filtered down to what we need) by ID, already at creation time. Suggestions only change once we touch them, and even then we could cache everything except the moving parts (currently the list of linked derivations). Then the view query can only get the item IDs, and the rest is taken directly from the cache. Opened #377 @alejandrosame btw. we may want to do the same for activity logs: #366

fricklerhandwerk commented

2024-11-20 12:07:05 +00:00

(Migrated from github.com)

Actually it doesn't really matter how we cache those aggregate data, it could just as well be in a helper table.

We discussed 2-3 weeks ago with @RaitoBezarius and @erictapen that we'd probably want the data to flow across intermediate tables in a way that can be recomputed if needed, but accessed quickly when it's available.

Actually it doesn't really matter *how* we cache those aggregate data, it could just as well be in a helper table. We discussed 2-3 weeks ago with @RaitoBezarius and @erictapen that we'd probably want the data to flow across intermediate tables in a way that can be recomputed if needed, but accessed quickly when it's available.

fricklerhandwerk commented

2024-11-25 11:53:53 +00:00

(Migrated from github.com)

I think this is resolved now with #378, but I need to try it out.

fricklerhandwerk commented

2024-11-28 16:31:54 +00:00

(Migrated from github.com)

This seems fixed.

Rows
Columns

Deduplicate CVEs #201