CVE fetchers should work using a bulk saving context or perform bulk_create #16

Open
opened 2023-11-12 18:39:53 +00:00 by RaitoBezarius · 0 comments
RaitoBezarius commented 2023-11-12 18:39:53 +00:00 (Migrated from github.com)

Currently, ingesting of the initial 230K of CVEs takes around 25 minutes on a very fast CPU (130-200 CVE/s).

In practice, SQLite can do much more than that (96K inserts/s are easy.).

Because we have a lot of M2M all arounds, it's not trivial by using the "nice" ORM API.

Two solutions:

Using Django vanilla API

The classical way to solve this is to use Model.through which is an automatically generated table composed of id, $containing_id, $contained_id fields which can be used to bulk create the M2M rows.

Therefore, all the fetchers code should be reworked to take lists all the time (single item is just [x]) and return a list of models (not yet saved!) and all of them are bulk created in the call-site.

Topological sort has to be done manually, usually, we do:

  • create isolated elements with empty M2M
  • create the M2M relations

Using a bulk saver context

We can also remove all reference to save and use a custom API à la https://gist.github.com/crucialfelix/7fa53265ed11e6761531f1b2e0d1f36a to coalesce any operation we need to.

It's unclear if it would make performance faster as-is.

Currently, ingesting of the initial 230K of CVEs takes around 25 minutes on a very fast CPU (130-200 CVE/s). In practice, SQLite can do much more than that (96K inserts/s are easy.). Because we have a lot of M2M all arounds, it's not trivial by using the "nice" ORM API. Two solutions: ### Using Django vanilla API The classical way to solve this is to use `Model.through` which is an automatically generated table composed of `id, $containing_id, $contained_id` fields which can be used to bulk create the M2M rows. Therefore, all the fetchers code should be reworked to take lists all the time (single item is just `[x]`) and return a list of models (not yet saved!) and all of them are bulk created in the call-site. Topological sort has to be done manually, usually, we do: - create isolated elements with empty M2M - create the M2M relations ### Using a bulk saver context We can also remove all reference to `save` and use a custom API à la https://gist.github.com/crucialfelix/7fa53265ed11e6761531f1b2e0d1f36a to coalesce any operation we need to. It's unclear if it would make performance faster as-is.
Sign in to join this conversation.
No description provided.