we now keep not a table of all positions, but a table of all origins and
their sizes. position indices are now direct pointers into the virtual
concatenation of all parsed contents. this slightly reduces memory usage
and time spent in the parser, at the cost of not being able to report
positions if the total input size exceeds 4GiB. this limit is not unique
to nix though, rustc and clang also limit their input to 4GiB (although
at least clang refuses to process inputs that are larger, we will not).
this new 4GiB limit probably will not cause any problems for quite a
while, all of nixpkgs together is less than 100MiB in size and already
needs over 700MiB of memory and multiple seconds just to parse. 4GiB
worth of input will easily take multiple minutes and over 30GiB of
memory without even evaluating anything. if problems *do* arise we can
probably recover the old table-based system by adding some tracking to
Pos::Origin (or increasing the size of PosIdx outright), but for time
being this looks like more complexity than it's worth.
since we now need to read the entire input again to determine the
line/column of a position we'll make unsafeGetAttrPos slightly lazy:
mostly the set it returns is only used to determine the file of origin
of an attribute, not its exact location. the thunks do not add
measurable runtime overhead.
notably this change is necessary to allow changing the parser since
apparently nothing supports nix's very idiosyncratic line ending choice
of "anything goes", making it very hard to calculate line/column
positions in the parser (while byte offsets are very easy).
(cherry picked from commit 5d9fdab3de0ee17c71369ad05806b9ea06dfceda)
Change-Id: Ie0b2430cb120c09097afa8c0101884d94f4bbf34
While preparing PRs like #9753, I've had to change error messages in
dozens of code paths. It would be nice if instead of
EvalError("expected 'boolean' but found '%1%'", showType(v))
we could write
TypeError(v, "boolean")
or similar. Then, changing the error message could be a mechanical
refactor with the compiler pointing out places the constructor needs to
be changed, rather than the error-prone process of grepping through the
codebase. Structured errors would also help prevent the "same" error
from having multiple slightly different messages, and could be a first
step towards error codes / an error index.
This PR reworks the exception infrastructure in `libexpr` to
support exception types with different constructor signatures than
`BaseError`. Actually refactoring the exceptions to use structured data
will come in a future PR (this one is big enough already, as it has to
touch every exception in `libexpr`).
The core design is in `eval-error.hh`. Generally, errors like this:
state.error("'%s' is not a string", getAttrPathStr())
.debugThrow<TypeError>()
are transformed like this:
state.error<TypeError>("'%s' is not a string", getAttrPathStr())
.debugThrow()
The type annotation has moved from `ErrorBuilder::debugThrow` to
`EvalState::error`.
(cherry picked from commit c6a89c1a1659b31694c0fbcd21d78a6dd521c732)
Change-Id: Iced91ba4e00ca9e801518071fb43798936cbd05a
there's no reason the parser itself should be doing semantic analysis
like bindVars. split this bit apart (retaining the previous name in
EvalState) and have the parser really do *only* parsing, decoupled from
EvalState.
(cherry picked from commit b596cc9e7960b9256bcd557334d81e9d555be5a2)
Change-Id: I481a7623afc783e9d28a6eb4627552cf8a780986
most instances of this being used do not refer to the "current"
position, sometimes not even to one reasonably close by. it could also
be called `makePos` instead, but `at` seems clear in context.
(cherry picked from commit 835a6c7bcfd0b22acc16f31de5fc7bb650d52017)
Change-Id: I17cab8a6cc14cac5b64624431957bfcf04140809
ParserState better describes what this struct really is. the parser
really does modify its state (most notably position and symbol tables),
so calling it that rather than obliquely "data" (which implies being
input only) makes sense.
(cherry picked from commit 007605616477f4f0d8a0064c375b1d3cf6188ac5)
Change-Id: I92feaec796530e1d4d0f7d4fba924229591cea95
If we call `adjustLoc`, the global variable `prev_yylloc` is shared
between threads and racy.
Currently, nix itself does not concurrently parsing files, but this is
helpful for libexpr users. (The parser is thread-safe except this.)
Pos objects are somewhat wasteful as they duplicate the origin file name and
input type for each object. on files that produce more than one Pos when parsed
this a sizeable waste of memory (one pointer per Pos). the same goes for
ptr<Pos> on 64 bit machines: parsing enough source to require 8 bytes to locate
a position would need at least 8GB of input and 64GB of expression memory. it's
not likely that we'll hit that any time soon, so we can use a uint32_t index to
locate positions instead.
Before the change lexter errors did not report the location:
$ nix build -f. mc
error: path has a trailing slash
(use '--show-trace' to show detailed location information)
Note that it's not clear what file generates the error.
After the change location is reported:
$ src/nix/nix --extra-experimental-features nix-command build -f ~/nm mc
error: path has a trailing slash
at .../pkgs/development/libraries/glib/default.nix:54:18:
53| };
54| src = /tmp/foo/;
| ^
55|
(use '--show-trace' to show detailed location information)
Here we see both problematic file and the string itself.
it can be replaced with StringToken if we add another bit if information to
StringToken, namely whether this string should take part in indentation scanning
or not. since all escaping terminates indentation scanning we need to set this
bit only for the non-escaped IND_STRING rule.
this improves performance by about 1%.
before
nix search --no-eval-cache --offline ../nixpkgs hello
Time (mean ± σ): 8.880 s ± 0.048 s [User: 6.809 s, System: 1.643 s]
Range (min … max): 8.781 s … 8.993 s 20 runs
nix eval -f ../nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix
Time (mean ± σ): 375.0 ms ± 2.2 ms [User: 339.8 ms, System: 35.2 ms]
Range (min … max): 371.5 ms … 379.3 ms 20 runs
nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 2.831 s ± 0.040 s [User: 2.536 s, System: 0.225 s]
Range (min … max): 2.769 s … 2.912 s 20 runs
after
nix search --no-eval-cache --offline ../nixpkgs hello
Time (mean ± σ): 8.832 s ± 0.048 s [User: 6.757 s, System: 1.657 s]
Range (min … max): 8.743 s … 8.921 s 20 runs
nix eval -f ../nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix
Time (mean ± σ): 367.4 ms ± 3.2 ms [User: 332.7 ms, System: 34.7 ms]
Range (min … max): 364.6 ms … 374.6 ms 20 runs
nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 2.810 s ± 0.030 s [User: 2.517 s, System: 0.225 s]
Range (min … max): 2.742 s … 2.854 s 20 runs
mainly to avoid an allocation and a copy of a string that can be
modified in place (ever since EvalState holds on to the buffer, not the
generated parser itself).
# before
Benchmark 1: nix search --offline nixpkgs hello
Time (mean ± σ): 571.7 ms ± 2.4 ms [User: 563.3 ms, System: 8.0 ms]
Range (min … max): 566.7 ms … 579.7 ms 50 runs
Benchmark 2: nix eval -f ../nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix
Time (mean ± σ): 376.6 ms ± 1.0 ms [User: 345.8 ms, System: 30.5 ms]
Range (min … max): 374.5 ms … 379.1 ms 50 runs
Benchmark 3: nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 2.922 s ± 0.006 s [User: 2.707 s, System: 0.215 s]
Range (min … max): 2.906 s … 2.934 s 50 runs
# after
Benchmark 1: nix search --offline nixpkgs hello
Time (mean ± σ): 570.4 ms ± 2.8 ms [User: 561.3 ms, System: 8.6 ms]
Range (min … max): 564.6 ms … 578.1 ms 50 runs
Benchmark 2: nix eval -f ../nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix
Time (mean ± σ): 375.4 ms ± 1.3 ms [User: 343.2 ms, System: 31.7 ms]
Range (min … max): 373.4 ms … 378.2 ms 50 runs
Benchmark 3: nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 2.925 s ± 0.006 s [User: 2.704 s, System: 0.219 s]
Range (min … max): 2.910 s … 2.942 s 50 runs
every stringy token the lexer returns is turned into a Symbol and not
used further, so we don't have to strdup. using a string_view is
sufficient, but due to limitations of the current parser we have to use
a POD type that holds the same information.
gives ~2% on system build, 6% on search, 8% on parsing alone
# before
Benchmark 1: nix search --offline nixpkgs hello
Time (mean ± σ): 610.6 ms ± 2.4 ms [User: 602.5 ms, System: 7.8 ms]
Range (min … max): 606.6 ms … 617.3 ms 50 runs
Benchmark 2: nix eval -f hackage-packages.nix
Time (mean ± σ): 430.1 ms ± 1.4 ms [User: 393.1 ms, System: 36.7 ms]
Range (min … max): 428.2 ms … 434.2 ms 50 runs
Benchmark 3: nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 3.032 s ± 0.005 s [User: 2.808 s, System: 0.223 s]
Range (min … max): 3.023 s … 3.041 s 50 runs
# after
Benchmark 1: nix search --offline nixpkgs hello
Time (mean ± σ): 574.7 ms ± 2.8 ms [User: 566.3 ms, System: 8.0 ms]
Range (min … max): 569.2 ms … 580.7 ms 50 runs
Benchmark 2: nix eval -f hackage-packages.nix
Time (mean ± σ): 394.4 ms ± 0.8 ms [User: 361.8 ms, System: 32.3 ms]
Range (min … max): 392.7 ms … 395.7 ms 50 runs
Benchmark 3: nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 2.976 s ± 0.005 s [User: 2.757 s, System: 0.218 s]
Range (min … max): 2.966 s … 2.990 s 50 runs
We now parse function applications as a vector of arguments rather
than as a chain of binary applications, e.g. 'substring 1 2 "foo"' is
parsed as
ExprCall { .fun = <substring>, .args = [ <1>, <2>, <"foo"> ] }
rather than
ExprApp (ExprApp (ExprApp <substring> <1>) <2>) <"foo">
This allows primops to be called immediately (if enough arguments are
supplied) without having to allocate intermediate tPrimOpApp values.
On
$ nix-instantiate --dry-run '<nixpkgs/nixos/release-combined.nix>' -A nixos.tests.simple.x86_64-linux
this gives a substantial performance improvement:
user CPU time: median = 0.9209 mean = 0.9218 stddev = 0.0073 min = 0.9086 max = 0.9340 [rejected, p=0.00000, Δ=-0.21433±0.00677]
elapsed time: median = 1.0585 mean = 1.0584 stddev = 0.0024 min = 1.0523 max = 1.0623 [rejected, p=0.00000, Δ=-0.20594±0.00236]
because it reduces the number of tPrimOpApp allocations from 551990 to
42534 (i.e. only small minority of primop calls are partially
applied) which in turn reduces time spent in the garbage collector.
Using a 64bit integer on 32bit systems will come with a bit of a
performance overhead, but given that Nix doesn't use a lot of integers
compared to other types, I think the overhead is negligible also
considering that 32bit systems are in decline.
The biggest advantage however is that when we use a consistent integer
size across all platforms it's less likely that we miss things that we
break due to that. One example would be:
https://github.com/NixOS/nixpkgs/pull/44233
On Hydra it will evaluate, because the evaluator runs on a 64bit
machine, but when evaluating the same on a 32bit machine it will fail,
so using 64bit integers should make that consistent.
While the change of the type in value.hh is rather easy to do, we have a
few more options available for doing the conversion in the lexer:
* Via an #ifdef on the architecture and using strtol() or strtoll()
accordingly depending on which architecture we are. For the #ifdef
we would need another AX_COMPILE_CHECK_SIZEOF in configure.ac.
* Using istringstream, which would involve copying the value.
* As we're already using boost, lexical_cast might be a good idea.
Spoiler: I went for the latter, first of all because lexical_cast does
have an overload for const char* and second of all, because it doesn't
involve copying around the input string. Also, because istringstream
seems to come with a bigger overhead than boost::lexical_cast:
https://www.boost.org/doc/libs/release/doc/html/boost_lexical_cast/performance.html
The first method (still using strtol/strtoll) also wasn't something I
pursued further, because it is also locale-aware which I doubt is what
we want, given that the regex for int is [0-9]+.
Signed-off-by: aszlig <aszlig@nix.build>
Fixes: #2339
Flex's regexes have an annoying feature: the dot matches everything
except a newline. This causes problems for expressions like:
"${0}\
"
where the backslash-newline combination matches this rule instead of the
intended one mentioned in the comment:
<STRING>\$|\\|\$\\ {
/* This can only occur when we reach EOF, otherwise the above
(...|\$[^\{\"\\]|\\.|\$\\.)+ would have triggered.
This is technically invalid, but we leave the problem to the
parser who fails with exact location. */
return STR;
}
However, the parser actually accepts the resulting token sequence
('"' DOLLAR_CURLY 0 '}' STR '"'), which is a problem because the lexer
rule didn't assign anything to yylval. Ultimately this leads to a crash
when dereferencing a NULL pointer in ExprConcatStrings::bindVars().
The fix does change the syntax of the language in some corner cases
but I think it's only turning previously invalid (or crashing) syntax
to valid syntax. E.g.
"a\
b"
and
''a''\
b''
were previously syntax errors but now both result in "a\nb".
Found by afl-fuzz.
With catch-all rules, we hide potential errors.
It turns out that a4744254 made one cath-all useless. Flex detected that
is was impossible to reach.
The other is more subtle, as it can only trigger on unfinished escapes
in unfinished strings, which only occurs at EOF.