Very interesting. Does this mean certain optimizations run before the monomorphization step? Do you know why? Compilation performance is the obvious thing that comes to mind.
Yes there are optimizations that run before monomorphization; this one occurs during macro expansion.
It's a bit of a chicken-and-egg problem. To monomorphize clone(), someone must emit an implementation first. But an optimal implementation requires analyses that aren't available until later in the pipeline. Here the optimization kicks in for types deriving Copy, but a generic parameter is enough to defeat it.
IIRC, that "optimization" mostly avoids wasting time compiling a complex `Clone` implementation, when simply returning `*self` suffices (there are some crates with at lot of `#[derive(Copy, Clone)]` types). We try to avoid having a lot of logic like that too early, for precisely the reasons you mention.
I'd be interested in an example where LLVM can't optimize the general version, as it means we might want to do this through MIR shims instead (which can be generated when collecting the monomorphic instances to codegen - this is what happens when you clone a tuple or closure, for example).
The behavior differs between (the present) nightly and rustc 1.45.2; the nightly available when this link was posted matched the 1.45.2 behavior.
The output with 1.45.2 is as follows:
example::clone_concrete:
mov eax, edi
ret
example::clone_abstract:
mov ecx, edi
and ecx, -256
xor eax, eax
xor edx, edx
cmp dil, 1
sete dl
cmove eax, ecx
or eax, edx
ret
Fascinating coincidence! It was probably the LLVM upgrade (https://github.com/rust-lang/rust/pull/73526) landing, probably before the comment was even posted (but the nightly would only show up with the upgraded LLVM the next day).
For example, see how switching from a concrete to a generic type increases the size of clone() by 4x: https://rust.godbolt.org/z/qbYr3v
This is because at the point the compiler synthesizes clone() it has less information about the type.