ZLUDA/doc/NOTES.md
2020-09-08 21:29:18 +02:00

4.3 KiB

Parser generators in Rust:

I'm convinced nobody actually uses parser generators in Rust:

  • pomelo can't generate lexer (understandable, as it is a port of lemon and lemon can't do this either)
  • pest can't do parse actions, you have to convert your parse tree to ast manually
  • lalrpop can't do comments
    • and the day I wrote the line above it can
    • reports parsing errors as byte offsets
    • if you want to skip parsing one of the alternatives, functional design gets quite awkward
  • antlr4rust is untried and requires java to build
  • no library supports island grammars

What to emit?

  • SPIR-V
    • Better library support, easier to emit
    • Can by optimized by IGC
    • Can't do some things (not sure what exactly yet)
      • But we can work around with inline VISA
  • VISA
    • Quicker compilation

A64 vs BTS

  • How to force A64: -cl-intel-greater-than-4GB-buffer-required
  • PTX made a baffling desing choice: global pointers are represented as untyped 64bit integers
  • Consequently, there's no 100% certain way to know which argument is a surface and which is a scalar
    • It seems that NVidia guys realized what a horrible idea that was and emit cvta.to.global as a marker for global pointers?
      • But it's only emitted in a recent release build, can't rely on it
      • Maybe debug builds emit debug metadata to detect surfaces?
      • Might add this as an optimization later
    • cuLaunchKernel docs say this: "The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image", note the wording: offsets and sizes and not types
    • Wait, you can mark an argument as a pointer with .ptr: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr, but it's useless with NV compiler not emitting it
  • Potential solution: compile only during the dispatch, when type of arguments is known?
    • Can't do, the set of arguments passed to cuLaunchKernel is untyped
  • Solution: treat all arguments as untyped integers and say goodbye to BTS access

Implicit conversions

  • PTX support for implicit conversions is completely degenerate, docs say:
    For convenience, ld, st, and cvt instructions permit source and destination data operands to be wider than the instruction-type size, so that narrow values may be loaded, stored, and converted using regular-width registers. For example, 8-bit or 16-bit values may be held directly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types and sizes
    Which is sensible, but completely untrue. In reality ptxas compiles silly code like this:
    param.f32       param_1
    ...
    .reg.s32        %r1
    ld.param.b16 	%r1, [param_1];
    
  • Surprise, surprise, there's two kind of implicit conversions at play in the example above:
    • "Relaxed type-checking rules": this is the conversion of b16 operation type to s32 dst register
    • Undocumented type coercion when dereferencing param_1. The PTX behaviour is to coerce every type. It's something to the effect of [param_1] = *(b16*)param_1

PTX grammar

  • PTX grammar rules are atrocious, keywords can be freely reused as ids without escaping
  • Modifiers can be applied to instructions in any arbitrary order. We don't support it and hope we will never have to

Rust debugging

  • Nothing works 100% well on vscode/Windows:
    • MSVC/lldb - always garbage (simple enums are fubar)
    • MSVC/cppvsdbg - sometimes garbage (nested enums are fubar)
    • GNU/lldb - mostly fine, but can't follow child processes
    • GNU/gdb - always garbage (I don't have the patience to manually QA rust-gdb on Windows) and doesn't quite understand file paths for break points
  • Neither on vscode/Linux:
    • lldb - mostly fine, but can't follow child processes
    • gdb - visualizes variables somewhat awkardly (shows all possible variants of an enum)
  • CLion could be the solution, but intellij-rust can't load this project

CUDA <-> L0

  • device ~= device
  • stream ~= command queue
  • context ~= context (1.0+)
  • graph ~= command list
  • module ~= module

IGC

  • IGC is extremely brittle and segfaults on fairly innocent code:
    • OpBitcast of pointer to uint
    • OpCopyMemory of alloca'd variable