Re: [PATCH v3 00/35] bitops: add atomic find_bit() operations

16 Dec 2023

      On Mon, Dec 11, 2023 at 06:27:14PM -0800, Yury Norov wrote:
...
Add helpers around test_and_{set,clear}_bit() that allow to search for
clear or set bits and flip them atomically.
The target patterns may look like this:
for (idx = 0; idx < nbits; idx++)
   	if (test_and_clear_bit(idx, bitmap))
   		do_something(idx);
Or like this:
do {
   	bit = find_first_bit(bitmap, nbits);
   	if (bit >= nbits)
   		return nbits;
   } while (!test_and_clear_bit(bit, bitmap));
   return bit;
In both cases, the opencoded loop may be converted to a single function
or iterator call. Correspondingly:
for_each_test_and_clear_bit(idx, bitmap, nbits)
   	do_something(idx);
Or:
   return find_and_clear_bit(bitmap, nbits);
Obviously, the less routine code people have to write themself, the
less probability to make a mistake.
Those are not only handy helpers but also resolve a non-trivial
issue of using non-atomic find_bit() together with atomic
test_and_{set,clear)_bit().
The trick is that find_bit() implies that the bitmap is a regular
non-volatile piece of memory, and compiler is allowed to use such
optimization techniques like re-fetching memory instead of caching it.
For example, find_first_bit() is implemented like this:
  for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {
          val = addr[idx];
          if (val) {
                  sz = min(idx * BITS_PER_LONG + __ffs(val), sz);
                  break;
          }
  }

On register-memory architectures, like x86, compiler may decide to
access memory twice - first time to compare against 0, and second time
to fetch its value to pass it to __ffs().
When running find_first_bit() on volatile memory, the memory may get
changed in-between, and for instance, it may lead to passing 0 to
__ffs(), which is undefined. This is a potentially dangerous call.
find_and_clear_bit() as a wrapper around test_and_clear_bit()
naturally treats underlying bitmap as a volatile memory and prevents
compiler from such optimizations.
Now that KCSAN is catching exactly this type of situations and warns on
undercover memory modifications. We can use it to reveal improper usage
of find_bit(), and convert it to atomic find_and_*_bit() as appropriate.
In some cases concurrent operations with plain find_bit() are acceptable.
For example:

two threads running find_*_bit(): safe wrt ffs(0) and returns correct
value, because underlying bitmap is unchanged;
find_next_bit() in parallel with set or clear_bit(), when modifying
a bit prior to the start bit to search: safe and correct;
find_first_bit() in parallel with set_bit(): safe, but may return wrong
bit number;
find_first_zero_bit() in parallel with clear_bit(): same as above.

In last 2 cases find_bit() may not return a correct bit number, but
it may be OK if caller requires any (not exactly the first) set or clear
bit, correspondingly.
In such cases, KCSAN may be safely silenced with data_race(). But in most
cases where KCSAN detects concurrency people should carefully review their
code and likely protect critical sections or switch to atomic
find_and_bit(), as appropriate.
The 1st patch of the series adds the following atomic primitives:
find_and_set_bit(addr, nbits);
   find_and_set_next_bit(addr, nbits, start);
   ...
Here find_and_{set,clear} part refers to the corresponding
test_and_{set,clear}_bit function. Suffixes like _wrap or _lock
derive their semantics from corresponding find() or test() functions.
For brevity, the naming omits the fact that we search for zero bit in
find_and_set, and correspondingly search for set bit in find_and_clear
functions.
The patch also adds iterators with atomic semantics, like
for_each_test_and_set_bit(). Here, the naming rule is to simply prefix
corresponding atomic operation with 'for_each'.
In [1] Jan reported 2% slowdown in a single-thread search test when
switching find_bit() function to treat bitmaps as volatile arrays. On
the other hand, kernel robot in the same thread reported +3.7% to the
performance of will-it-scale.per_thread_ops test.
Assuming that our compilers are sane and generate better code against
properly annotated data, the above discrepancy doesn't look weird. When
running on non-volatile bitmaps, plain find_bit() outperforms atomic
find_and_bit(), and vice-versa.
So, all users of find_bit() API, where heavy concurrency is expected,
are encouraged to switch to atomic find_and_bit() as appropriate.
The 1st patch of this series adds atomic find_and_bit() API, 2nd adds
a basic test for new API, and all the following patches spread it over
the kernel.
They can be applied separately from each other on per-subsystems basis,
or I can pull them in bitmap tree, as appropriate.
[1] https://lore.kernel.org/lkml/634f5fdf-e236-42cf-be8d-48a581c21660@alu.unizg....
Thank you all for reviews and comments. Now moving the series to
bitmap-for-next for testing.
Thanks,
Yury