This also surprised me first, later I discussed with my colleague, and my colleague thought the C implementation could leverage the CPU cache for UFUNC.
My understanding of this seems contradiction fact, cause my appended C code do not applied any optimized. after all, to achieve the same sorting goal, different algorithm takes totally different time.
My point in this article is to demonstrate how fast UFunc is comparing without using it. C should be no double the fast language if the best algorithm and optimization techniques applied.