Deduplicate CVEs #201
Labels
No labels
automation
backend
bug
contributor experience
data
deployment
documentation
duplicate
good first issue
help wanted
nice to have
notifications
package maintainer
performance
skin
tech debt
user story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-community/nix-security-tracker#201
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In our ingested data, many CVE numbers appear multiple times since they seem to come from different sources.
For example, stage the ingested data before presenting it for triage, and deduplicate it automatically as far as possible:
Discussed with @RaitoBezarius and @erictapen:
This seems to be a problem with the suggestion engine.
This condition apparently does not hold:
The root cause is not duplicate proposals but duplicate ingested CVEs.
I just found a weird thing which I think is related.
I am seeing duplicate proposals in the suggestions list.
Note how the
suggestion_id
is864
in both cases. This directly representspk
field in the ListView, showingCVEDerivationClusterProposal
s. I wonder how this is even possible, that we get duplicated entries with the same primary key? Which curiously even show different affected package names?The code running is my branch
remove-package-selection
, though I doubt it's specific to my changes.It can occur due to a JOIN that repeats some fields.
Did some twiddling on my local instance after nuking all suggestions.
queryset
is what we want to operate on.queryset2
is what the view is currently doing.It seems the simplest way to both solve this issue and keep the query time down despite a suggestion containing lots of information is caching the entire suggestion (including all related models, filtered down to what we need) by ID, already at creation time. Suggestions only change once we touch them, and even then we could cache everything except the moving parts (currently the list of linked derivations).
Then the view query can only get the item IDs, and the rest is taken directly from the cache. Opened #377
@alejandrosame btw. we may want to do the same for activity logs: #366
Actually it doesn't really matter how we cache those aggregate data, it could just as well be in a helper table.
We discussed 2-3 weeks ago with @RaitoBezarius and @erictapen that we'd probably want the data to flow across intermediate tables in a way that can be recomputed if needed, but accessed quickly when it's available.
I think this is resolved now with #378, but I need to try it out.
This seems fixed.