ZLUDA/NOTES.md at 2e8e55738ce8c636da78ba195805ad852e04ff17

mirror of https://github.com/vosen/ZLUDA.git synced 2025-07-18 09:46:21 +03:00

Andrzej Janik 76afbeba63 Implement support for PTX call instruction

2020-09-08 21:29:18 +02:00

Parser generators in Rust:

I'm convinced nobody actually uses parser generators in Rust:

pomelo can't generate lexer (understandable, as it is a port of lemon and lemon can't do this either)
pest can't do parse actions, you have to convert your parse tree to ast manually
lalrpop can't do comments
- and the day I wrote the line above it can
- reports parsing errors as byte offsets
- if you want to skip parsing one of the alternatives, functional design gets quite awkward
antlr4rust is untried and requires java to build
no library supports island grammars

SPIR-V
- Better library support, easier to emit
- Can by optimized by IGC
- Can't do some things (not sure what exactly yet)
  - But we can work around with inline VISA
VISA
- Quicker compilation

How to force A64: -cl-intel-greater-than-4GB-buffer-required
PTX made a baffling desing choice: global pointers are represented as untyped 64bit integers
Consequently, there's no 100% certain way to know which argument is a surface and which is a scalar
- It seems that NVidia guys realized what a horrible idea that was and emit cvta.to.global as a marker for global pointers?
  - But it's only emitted in a recent release build, can't rely on it
  - Maybe debug builds emit debug metadata to detect surfaces?
  - Might add this as an optimization later
- cuLaunchKernel docs say this: "The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image", note the wording: offsets and sizes and not types
- Wait, you can mark an argument as a pointer with .ptr: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr, but it's useless with NV compiler not emitting it
Potential solution: compile only during the dispatch, when type of arguments is known?
- Can't do, the set of arguments passed to cuLaunchKernel is untyped
Solution: treat all arguments as untyped integers and say goodbye to BTS access

PTX support for implicit conversions is completely degenerate, docs say:
For convenience, ld, st, and cvt instructions permit source and destination data operands to be wider than the instruction-type size, so that narrow values may be loaded, stored, and converted using regular-width registers. For example, 8-bit or 16-bit values may be held directly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types and sizes
Which is sensible, but completely untrue. In reality ptxas compiles silly code like this:
```
param.f32       param_1
...
.reg.s32        %r1
ld.param.b16 	%r1, [param_1];
```
Surprise, surprise, there's two kind of implicit conversions at play in the example above:
- "Relaxed type-checking rules": this is the conversion of b16 operation type to s32 dst register
- Undocumented type coercion when dereferencing param_1. The PTX behaviour is to coerce every type. It's something to the effect of [param_1] = *(b16*)param_1

PTX grammar rules are atrocious, keywords can be freely reused as ids without escaping
Modifiers can be applied to instructions in any arbitrary order. We don't support it and hope we will never have to

Nothing works 100% well on vscode/Windows:
- MSVC/lldb - always garbage (simple enums are fubar)
- MSVC/cppvsdbg - sometimes garbage (nested enums are fubar)
- GNU/lldb - mostly fine, but can't follow child processes
- GNU/gdb - always garbage (I don't have the patience to manually QA rust-gdb on Windows) and doesn't quite understand file paths for break points
Neither on vscode/Linux:
- lldb - mostly fine, but can't follow child processes
- gdb - visualizes variables somewhat awkardly (shows all possible variants of an enum)
CLion could be the solution, but intellij-rust can't load this project

IGC is extremely brittle and segfaults on fairly innocent code:
- OpBitcast of pointer to uint
- OpCopyMemory of alloca'd variable