OSX threading #1

gaming-hacker · 2016-02-09T20:35:16Z

I'm trying to integrate yours fixes for OSX into xiany/branch but for some reason the new threading model and the OSX thread_barrier don't like each other.

Details: - Reverted most changes applied during commit ec25807.

Details: - Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex was used to lock access to initialization/finalization actions. But everything worked out okay as long as bli_init() was called by single-threaded code. - Changed to static initialization for memory allocator mutex in bli_mem.c, and moved mutex to that file (from bli_init.c). - Fixed some type mismatches in bli_threading_pthreads.c that resulted in compiler warnings. - Fixed a small memory leak with allocated-but-never-freed (and unused) pthread_attr_t objects. - Whitespace changes to bli_init.c and bli_mem.c.

Details: - Spun-off initialization of global scalar constants to bli_const_init() and of threading stuff to bli_thread_init(). - Added some missing _finalize() functions, even when there is nothing to do.

Details: - Removed some stale script code that should have been removed during 590bb3b.

Details: - Fixed some bugs that only manifested in multithreaded instances of some (non-gemm) level-3 operations. The bugs were related to invalid allocation of "edge" cases to thread subpartitions. (Here, we define an "edge" case to be one where the dimension being partitioned for parallelism is not a whole multiple of whatever register blocksize is needed in that dimension.) In BLIS, we always require edge cases to be part of the bottom, right, or bottom-right subpartitions. (This is so that zero-padding only has to happen at the bottom, right, or bottom-right edges of micro-panels.) The previous implementations of bli_get_range() and _get_range_weighted() did not adhere to this implicit policy and thus produced bad ranges for some combinations of operation, parameter cases, problem sizes, and n-way parallelism. - As part of the above fix, the functions bli_get_range() and _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b, and _b2t suffixes, similar to the partitioning functions. This is an easy way to make sure that the variants are calling the right version of each function. The function signatures have also been changed slightly. - Comment/whitespace updates. - Removed unnecessary '/' from macros in bli_obj_macro_defs.h.

Details: - Added API-level initialization state to _const, _error, _mem, _thread, _ind, and _cntl APIs. While this functionality will mostly go unused, adding miniscule overhead at init-time, there will be at least once instance in the near future where, in order to avoid an infinite loop, a certain portion of the initialization will call a query function that itself attempts to call bli_init(). API-level initialization will allow this later stage to verify that an earlier stage of initialization has completed, even if the overall call to bli_init() has not yet returned. - Added _is_initialized() functions for each API, setting the underlying bool_t during _init() and unsetting it during _finalize(). - Comment, whitespace changes.

Details: - Added conditional code that returns early from the API-level _init() routines if the API is already initialized. Actually meant for this to be included in 5f93cbe.

Details: - Replaced the old memory allocator, which was based on statically- allocated arrays, with one based on a new internal pool_t type, which, combined with a new bli_pool_*() API, provides a new abstract data type that implements the same memory pool functionality but with blocks from the heap (ie: malloc() or equivalent). Hiding the details of the pool in a separate API also allows for a much simpler bli_mem.c family of functions. - Added a new internal header, bli_config_macro_defs.h, which enables sane defaults for the values previously found in bli_config. Those values can be overridden by #defining them in bli_config.h the same way kernel defaults can be overridden in bli_kernel.h. This file most resembles what was previously a typical configuration's bli_config.h. - Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which defaults to BLIS_PAGE_SIZE, to specify the alignment of individual blocks in the memory pool. Also added a corresponding query routine to the bli_info API. - Deprecated (once again) the micro-panel alignment feature. Upon further reflection, it seems that the goal of more predictable L1 cache replacement behavior is outweighed by the harm caused by non-contiguous micro-panels when k % kc != 0. I honestly don't think anyone will even miss this feature. - Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call bli_cntl_init() instead of bli_init(). - Removed query functions from bli_info.c that are no longer applicable given the dynamic memory allocator. - Removed unnecessary definitions from configurations' bli_config.h files, which are now pleasantly sparse. - Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite modules. Thanks to Devangi Parikh for pointing out these miscalculations. - Comment, whitespace changes.

Details: - Added a new configuration for AMD Excavator-based hardware also known as Carrizo when referring to the entire APU. This configuration uses the same micro-kernels as the piledriver, but with different cache blocksizes.

Details: - Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors and FMA instructions. (Complex support is currently provided by default induced method, 4m1a.) - Added a 'haswell' configuration, which uses the aforementioned kernels. - Inserted auto-detection support for haswell configuration in build/auto-detect/cpuid_x86.c. - Modified configure script to explicitly echo when automatic or manual configuration is in progress. - Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.

Details: - Fixed a typecasting ambiguity in bli_pool_alloc_block() in which pointer arithmetic was performed on a void* as if it were a byte pointer (such as char*). Some compilers may have already been interpreting this situation as intended, despite the sloppiness. Thanks to Aleksei Rechinskii for reporting this issue. - Redefined pointer alignment macros to typecast to uintptr_t instead of siz_t.

Details: - Expanded/updated interface for bli_get_range_weighted() and bli_get_range() so that the direction of movement is specified in the function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b()) and also so that the object being partitioned is passed instead of an uplo parameter. Updated invocations in level-3 blocked variants, as appropriate. - (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to carefully take into account the location of the diagonal when computing ranges so that the area of each subpartition (which, in all present level-3 operations, is proportional to the amount of computation engendered) is as equal as possible. - Added calls to a new class of routines to all non-gemm level-3 blocked variants: bli_<oper>_prune_unref_mparts_[mnk]() where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which dimension is being partitioned. These routines call a more basic routine, bli_prune_unref_mparts(), to prune unreferenced/unstored regions from matrices and simultaneously adjust other matrices which share the same dimension accordingly. - Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the new pruning routines. - Fixed incorrect blocking factors passed into bli_get_range_*() in bli_trsm_blk_var[12][fb].c - Added a new test driver in test/thread_ranges that can exercise the new bli_get_range_*() and bli_get_range_weighted_*() under a range of conditions. - Reimplemented m and n fields of obj_t as elements in a "dim" array field so that dimensions could be queried via index constant (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification macros accordingly. - Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values. - Added bli_round() macro, which calls C math library function round(), and bli_round_to_mult(), which rounds a value to the nearest multiple of some other value. - Added miscellaneous pruning- and mdim_t-related macros. - Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to bli_obj_row_off(), bli_obj_col_off().

Details: - Replaced the old (and short) README file with a much more comprehensive version written in github-flavored markdown. The new file is based on content taken from the old Google Code homepage.

Details: - Fixed typos in README.md. - Fixed column heading alignment for testsuite when matlab output is enabled. - Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.

Details: - Added section to README.md file containing links to wikis with brief descriptions.

Details: - Removed the optional flop-counting feature introduced in commit 7574c99.

Enable Travis CI

Fixed incomplete code in the double precision ARMv8 microkernel.

Details: - Fixed a family of bugs in the triangular level-3 operations for certain complex implementations (3m1 and 4m1a) that only manifest if one of the register blocksizes (PACKMR/PACKNR, actually) is odd: - Fixed incorrect imaginary stride computation in bli_packm_blk_var2() for the triangular case. - Fixed the incorrect computation of imaginary stride, as stored in the auxinfo_t struct in trmm and trsm macro-kernels. - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the cases where the the register blocksize for the triangular matrix is odd. Introduced a new byte-granular pointer arithmetic macro, bli_ptr_add(), that computes the correct value. - Added cpp macro to bli_macro_defs.h for typeof() operator, defined in terms of __typeof__, which is used by bli_ptr_add() macro. - Disabled the row- vs. column-storage optimization in bli_trmm_front() for singleton problems because the inherent ambiguity of whether a scalar is row-stored or column-stored causes the wrong parameter combination code to be executed (by dumb luck of our checking for row storage first). - Added commented-out debugging lines to 3m1/4m1a and reference micro-kernels, and trsm_ll macro-kernel.

Details: - Changed bli_pool_finalize() so that the freeing begins with the block at top_index instead of block 0. This allows us to use the function for terminal finalization as well as temporary cleanup prior to reinitialization. Also, clear the pool_t struct upon _pool_finalize() in case it is called in the terminal case with some blocks still checked out to threads (in which case the threads will see the new block size as 0 and thus release the block as intended). - Added bli_pool_reinit(), which calls _pool_finalize() followed by _pool_init() with new parameters. - Added bli_mem_reinit(), which is based on bli_pool_reinit(). - Added new wrapper, _mem_compute_pool_block_sizes(), which calls _mem_compute_pool_block_sizes_dt(). - Updated bli_mem_release() so that the pblk_t is freed, via _pool_free_block(), if the block size recorded in the mem_t at the time the pblk_t was acquired is now different from the value in the pool_t.

Details: - Fixed a bug in the relatively new quadratic partitioning code that, under the right conditions, would perform sqrt() on a negative value. If the solution is imaginary, we discard it and use an alternate partition width that assumes no diagonal intersection. That alternate width is actually already computed, so, the fix was quite simple. Thanks to Devangi Parikh for reporting this bug.

Details: - Minor change to quadratic equation solution code that avoids recomputation of the sqrt() parameter when the compiler is not smart enough to perform this optimization automatically.

Details: - Separated bli_adjust_strides() into _alloc() and _attach() flavors so that the latter can avoid a test performed by the former, in which the rs and cs are overridden and set to zero if either matrix dimension is zero. Actually, we also disable this overridding behavior, even for the _alloc() case, since keeping the original strides (probably) does not hurt anything. The original code has been kept commented-out, though, in case an unintended consequence is later discovered. - Fixed a typo in an error check for general stride cases where rs == cs.

Details: - Implemented the "beta == 0" case for general stride output for the dunnington sgemm micro-kernel. This case had been, up until now, identical to the "beta != 0" case, which does not work when the output matrix has nan's and inf's. It had manifested as nan residuals in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin Matthews for reporting this bug.

Details: - Applied a patch submitted by Devin Matthews that: - implements subtle changes to handling of somewhat unusual cases of row and column strides to accommodate certail tensor cases, which includes adding dimension parameters to _is_col_tilted() and _is_row_tilted() macros, - simplifies how buffers are sized when requested BLIS-allocated objects, - re-consolidates bli_adjust_strides_*() into one function, and - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99 environments.

Details: - Consolidated the two blocked variants for packm into a single implementation (packm_blk_var1) and removed the other variant. - Updated all induced method _cntl_init() functions in frame/cntl/ind/ to use the new blocked variant 1. - Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(), to detect pack_t schemas for induced methods and native execution, respectively.

Use unaligned vmovups for accessing matrix C.

fgvanzee and others added 30 commits May 24, 2015 16:02

Backed-out adjusted dim changes to test/3m4m.

590bb3b

Details: - Reverted most changes applied during commit ec25807.

Minor cleanup to bli_init() and friends.

b6ee82a

Details: - Spun-off initialization of global scalar constants to bli_const_init() and of threading stuff to bli_thread_init(). - Added some missing _finalize() functions, even when there is nothing to do.

Minor update to test/3m4m/runme.sh.

d62ceec

Details: - Removed some stale script code that should have been removed during 590bb3b.

Minor updates to test/3m4m files.

9135dfd

Added early return to API-level _init() routines.

9848f25

Details: - Added conditional code that returns early from the API-level _init() routines if the API is already initialized. Actually meant for this to be included in 5f93cbe.

Version file update (0.1.7)

267253d

CHANGELOG update (0.1.7)

0b7255a

Added 'carrizo' configuration.

d4b8913

Details: - Added a new configuration for AMD Excavator-based hardware also known as Carrizo when referring to the entire APU. This configuration uses the same micro-kernels as the piledriver, but with different cache blocksizes.

Merge branch 'master' of github.com:flame/blis

ef0fbbb

Version file update (0.1.8)

47caa33

CHANGELOG update (0.1.8)

ecc3ebb

Add Travis CI.

12ffd56

Try to fix the compiling bug on travis.

efa641e

Merge branch 'upstream_master'

fe3e355

Replaced README with README.md.

bbebdb5

Details: - Replaced the old (and short) README file with a much more comprehensive version written in github-flavored markdown. The new file is based on content taken from the old Google Code homepage.

Minor edits to README.md, testsuite.

5532990

Details: - Fixed typos in README.md. - Fixed column heading alignment for testsuite when matlab output is enabled. - Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.

Minor updates to CREDITS, README files.

e7e1f2f

Added "Getting Started" section to README.md.

d170574

Details: - Added section to README.md file containing links to wikis with brief descriptions.

Minor formatting change to README.md.

276da36

Removed flop-counting mechanism.

77ddb0b

Details: - Removed the optional flop-counting feature introduced in commit 7574c99.

Merge branch 'upstream_master'

4b0ac1a

Detect Intel Broadwell (using Haswell config).

4f88c29

Merge pull request flame#33 from xianyi/master

7e03e45

Enable Travis CI

fgvanzee and others added 16 commits October 21, 2015 14:53

Use vzeroall in haswell micro-kernels.

b489152

Merge branch 'master' of github.com:flame/blis

d3159c5

Fixed incomplete code in the double precision ARMv8 microkernel.

a0a7b85

Merge pull request flame#35 from figual/master

46294d8

Fixed incomplete code in the double precision ARMv8 microkernel.

add Travis CI build status icon to the README

33557ec

Merge branch 'master' of github.com:flame/blis

0694b72

Minor re-expression in quadratic partitioning code.

3e6dd11

Details: - Minor change to quadratic equation solution code that avoids recomputation of the sqrt() parameter when the compiler is not smart enough to perform this optimization automatically.

Merge branch 'upstream_master'

fcdd9d1

Relax the condition for dgemm AVX micro kernel column store branch.

3b358fa

Use unaligned vmovups for accessing matrix C.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSX threading #1

OSX threading #1

gaming-hacker commented Feb 9, 2016

OSX threading #1

Are you sure you want to change the base?

OSX threading #1

Conversation

gaming-hacker commented Feb 9, 2016