Deduplicate CVEs #201

Closed
opened 2024-09-26 16:50:29 +00:00 by fricklerhandwerk · 10 comments
fricklerhandwerk commented 2024-09-26 16:50:29 +00:00 (Migrated from github.com)

In our ingested data, many CVE numbers appear multiple times since they seem to come from different sources.

For example, stage the ingested data before presenting it for triage, and deduplicate it automatically as far as possible:

  • Some of the data is redundant but not identical, e.g. differing only in capitalisation; pick one variant consistently
  • Some data fields are filled in one variant but not the other; merge
  • Some data fields conflict; present for manual resolution
In our ingested data, many CVE numbers appear multiple times since they seem to come from different sources. For example, stage the ingested data before presenting it for triage, and deduplicate it automatically as far as possible: - Some of the data is redundant but not identical, e.g. differing only in capitalisation; pick one variant consistently - Some data fields are filled in one variant but not the other; merge - Some data fields conflict; present for manual resolution
fricklerhandwerk commented 2024-10-30 11:05:30 +00:00 (Migrated from github.com)

Discussed with @RaitoBezarius and @erictapen:

  • This issue is blocking UI development for the #203
  • Also relevant for #288
  • Raised priority to "high"
  • The main problem is that we don't want duplicate CVE IDs in the suggestions view
  • We'll move towards a unidirectional processing pipeline, where merged CVEs are like a cache that can be re-created by a full bulk run
  • Merged CVEs should refer to the raw data
Discussed with @RaitoBezarius and @erictapen: - This issue is blocking UI development for the #203 - Also relevant for #288 - Raised priority to "high" - The main problem is that we don't want duplicate CVE IDs in the suggestions view - We'll move towards a unidirectional processing pipeline, where merged CVEs are like a cache that can be re-created by a full bulk run - Merged CVEs should refer to the raw data
fricklerhandwerk commented 2024-10-31 10:17:56 +00:00 (Migrated from github.com)

This seems to be a problem with the suggestion engine.

This condition apparently does not hold:

    if CVEDerivationClusterProposal.objects.filter(cve=container.cve).exists():
        logger.warning(
            "Proposals already exist for '%s', skipping linkage.", container.cve
        )
        return
This seems to be a problem with the suggestion engine. This condition apparently does not hold: ```python if CVEDerivationClusterProposal.objects.filter(cve=container.cve).exists(): logger.warning( "Proposals already exist for '%s', skipping linkage.", container.cve ) return ```
RaitoBezarius commented 2024-11-14 12:55:15 +00:00 (Migrated from github.com)

The root cause is not duplicate proposals but duplicate ingested CVEs.

The root cause is not duplicate proposals but duplicate ingested CVEs.
erictapen commented 2024-11-15 11:43:26 +00:00 (Migrated from github.com)

I just found a weird thing which I think is related.

I am seeing duplicate proposals in the suggestions list.

tmp 0vtBs2tcYd

Note how the suggestion_id is 864 in both cases. This directly represents pk field in the ListView, showing CVEDerivationClusterProposals. I wonder how this is even possible, that we get duplicated entries with the same primary key? Which curiously even show different affected package names?

The code running is my branch remove-package-selection, though I doubt it's specific to my changes.

I just found a weird thing which I think is related. I *am* seeing duplicate proposals in the suggestions list. ![tmp 0vtBs2tcYd](https://github.com/user-attachments/assets/dcb2c20e-eb26-496e-90df-d753b5f07559) Note how the `suggestion_id` is `864` in both cases. This directly represents `pk` field in the ListView, showing `CVEDerivationClusterProposal`s. I wonder how this is even possible, that we get duplicated entries with the same primary key? Which curiously even show different affected package names? The code running is my branch `remove-package-selection`, though I doubt it's specific to my changes.
RaitoBezarius commented 2024-11-15 11:52:49 +00:00 (Migrated from github.com)

It can occur due to a JOIN that repeats some fields.

It can occur due to a JOIN that repeats some fields.
fricklerhandwerk commented 2024-11-20 09:29:59 +00:00 (Migrated from github.com)

Did some twiddling on my local instance after nuking all suggestions.

queryset is what we want to operate on.
queryset2 is what the view is currently doing.

In [28]: queryset = CVEDerivationClusterProposal.objects.select_related("cve").filter(status=CVEDerivationClusterProposal.Status.PENDING).prefetch_related("
    ...: derivations","derivations__parent_evaluation")

In [29]: queryset2 = queryset.annotate(
    ...:     package_name=F("cve__container__affected__package_name"),
    ...:     base_severity=Coalesce(
    ...:         F("cve__container__metrics__base_severity"), Value(Severity.NONE)
    ...:     ),
    ...:     title=F("cve__container__title"),
    ...:     description=F("cve__container__descriptions__value"),
    ...: )

In [30]: queryset.count()
Out[30]: 8

In [31]: queryset2.count()
Out[31]: 66881
Did some twiddling on my local instance after nuking all suggestions. `queryset` is what we want to operate on. `queryset2` is what the view is currently doing. ``` In [28]: queryset = CVEDerivationClusterProposal.objects.select_related("cve").filter(status=CVEDerivationClusterProposal.Status.PENDING).prefetch_related(" ...: derivations","derivations__parent_evaluation") In [29]: queryset2 = queryset.annotate( ...: package_name=F("cve__container__affected__package_name"), ...: base_severity=Coalesce( ...: F("cve__container__metrics__base_severity"), Value(Severity.NONE) ...: ), ...: title=F("cve__container__title"), ...: description=F("cve__container__descriptions__value"), ...: ) In [30]: queryset.count() Out[30]: 8 In [31]: queryset2.count() Out[31]: 66881 ```
fricklerhandwerk commented 2024-11-20 11:33:49 +00:00 (Migrated from github.com)

It seems the simplest way to both solve this issue and keep the query time down despite a suggestion containing lots of information is caching the entire suggestion (including all related models, filtered down to what we need) by ID, already at creation time. Suggestions only change once we touch them, and even then we could cache everything except the moving parts (currently the list of linked derivations).

Then the view query can only get the item IDs, and the rest is taken directly from the cache. Opened #377

@alejandrosame btw. we may want to do the same for activity logs: #366

It seems the simplest way to both solve this issue and keep the query time down despite a suggestion containing lots of information is caching the entire suggestion (including all related models, filtered down to what we need) by ID, already at creation time. Suggestions only change once we touch them, and even then we could cache everything except the moving parts (currently the list of linked derivations). Then the view query can only get the item IDs, and the rest is taken directly from the cache. Opened #377 @alejandrosame btw. we may want to do the same for activity logs: #366
fricklerhandwerk commented 2024-11-20 12:07:05 +00:00 (Migrated from github.com)

Actually it doesn't really matter how we cache those aggregate data, it could just as well be in a helper table.

We discussed 2-3 weeks ago with @RaitoBezarius and @erictapen that we'd probably want the data to flow across intermediate tables in a way that can be recomputed if needed, but accessed quickly when it's available.

Actually it doesn't really matter *how* we cache those aggregate data, it could just as well be in a helper table. We discussed 2-3 weeks ago with @RaitoBezarius and @erictapen that we'd probably want the data to flow across intermediate tables in a way that can be recomputed if needed, but accessed quickly when it's available.
fricklerhandwerk commented 2024-11-25 11:53:53 +00:00 (Migrated from github.com)

I think this is resolved now with #378, but I need to try it out.

I think this is resolved now with #378, but I need to try it out.
fricklerhandwerk commented 2024-11-28 16:31:54 +00:00 (Migrated from github.com)

This seems fixed.

This seems fixed.
Sign in to join this conversation.
No description provided.