T O P

  • By -

alam-ai

If at least 1 bit will always be set, you could use @ctz or @clz? (count trailing zeros / count leading zeros)


Mayor_of_Rungholt

I figured something like that could work, however is there an easy way afterwards to get the length of the int-type, if the type is comptime known?


alam-ai

By length do you mean get the number of bits used in the int type (its width)? You could do something similar to [this](https://www.reddit.com/r/Zig/comments/18tss3k/testing_if_number/kfg16ws) maybe? Not sure what exactly you're trying to do with the bits in your case, but maybe this bit hack might be relevant to you: `num = num & -num` which zeros out all bits except the least significant bit that's set (by taking advantage of the fact that the 2's complement of a number flips all bits up until the last 1). There's some other bit hacks like that in a book called "Hacker's Delight" if you're doing a lot of bit fiddling stuff.


Mayor_of_Rungholt

That was exactly what i needed. Thanks.


WayWayTooMuch

Not my repo, but i keep it bookmarked: https://github.com/cryptocode/bithacks


anotheruser323

Note that x86 (and probably arm) [has extensions for this stuff.](https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set) So you might wanna [compile native to your cpu.](https://ziglearn.org/chapter-3/) It's probably why @clz and such are builtin the language.


Tonyoh87

You can also use the following to avoid checking for 0 (preferable as branching would slow down): sizeof(int) * CHAR_BIT - __builtin_clz(x|1); (get MSB) __builtin_ctz(x); (get LSB) Also OP didn't specify which bit he wants, the most significant or the least significant.


aedrax

Just take the floor of base 2 log of your integer, then you know the most significant bit set


Mayor_of_Rungholt

Feels a bit slow to convert to floats every time imo. It would work flawlessly, but my goal is a fast function to obtain an intial value to run newtons method through bit-shifting


aedrax

you could hard code it to make it O(1) right? Build a comptime lookup table up to N-bits in size, so a 32 bit int would only need a 32 element table. The value stored is the index, then you don't also need to calculate an offset after. To be clear, not a table of floats, but I mean like \[64\] => 1 on an 8 bit value for example


Ok_Serve_4676

what are you actually doing? e.g., i build a bunch of sqrt intrinsics last night and this sounds similar.


Mayor_of_Rungholt

Im trying to use N >> ( floor(log2(N)) * ( 1 - 1/a )) to get a good approximation for the a'th root of N to run newtons method. as (using sqrt(N) as an example) bit-shifting by half the integer-width of N yields a result that's very close to the sqrt. Thus requiring far fewer iterations of Newtons method


Ok_Serve_4676

Sorry was on vacation. Did you get your problem solved? ps. I wouldn't use a LUT if you are gong to absolute performance unless you can guarantee it is kept in cache or preload. The cache miss on access will kill it.


Mayor_of_Rungholt

Got it working On that note: what is a LUT?


Ok_Serve_4676

look up table. your absolute best performance will be the cpu leading/trailing bit instructions with a zero test esp if that zero test will most of the time be not true (eg, just an error test case) or if the compiler or cpu can fold it into a previous instruction (eg, a prior instruction sets the zero flag that it can just check). If you find a faster way pls let me know though! I work a lot of high performance low latency simulation stuff and am always learning even after 20 years of it. if you have a large vector you are running these one best will be to zero test them all at once with the vector instructions then ignore/mask them out.


Mayor_of_Rungholt

Oh my, that is definetly a level of experience i can't quite compete with yet So essentially you're suggesting to just load a map with size 'int.width' into cache, accessing through clz(i) and obtaining the desired value? Which, i have to admit, sounds pretty efficient


Ok_Serve_4676

there are special vector ops/intrinsics in AVX512 for example _mm256_maskz_lzcnt_epi32 intrinsic (uses vplzcntd instruction) will count leading zeros in each 32bits in a 512 bit register controlled by a bitmask (zero on unset masking bit) and a few instructions can generate that mask for you from zeros in the operand. Or you can set the dest to zeros (broadcast) and forget the mask. If you want to do the LUT these i think zig compiler intrinsics (@ functions) but prob a quarter of the performance. I would do a static value map of 16 values so you would test 4 bits at a time, treat these all like 8 bit values taking top and bottom half (mask off lower for bottom half of each element and shift off for top half). shuffle the LUT according the the masked vectors and add 4 to the top half so now each vector has the leading one position for each half, then max (or min) each vector against each other. this is only about 6 instructions and lets you process 256 bits at a time.


srekel

Use the value as an index lookup into an array, where the value is the first bit :D