forked from flame/blis
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSX threading #1
Open
gaming-hacker
wants to merge
46
commits into
scibuilder:master
Choose a base branch
from
xianyi:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Details: - Reverted most changes applied during commit ec25807.
Details: - Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex was used to lock access to initialization/finalization actions. But everything worked out okay as long as bli_init() was called by single-threaded code. - Changed to static initialization for memory allocator mutex in bli_mem.c, and moved mutex to that file (from bli_init.c). - Fixed some type mismatches in bli_threading_pthreads.c that resulted in compiler warnings. - Fixed a small memory leak with allocated-but-never-freed (and unused) pthread_attr_t objects. - Whitespace changes to bli_init.c and bli_mem.c.
Details: - Spun-off initialization of global scalar constants to bli_const_init() and of threading stuff to bli_thread_init(). - Added some missing _finalize() functions, even when there is nothing to do.
Details: - Removed some stale script code that should have been removed during 590bb3b.
Details: - Fixed some bugs that only manifested in multithreaded instances of some (non-gemm) level-3 operations. The bugs were related to invalid allocation of "edge" cases to thread subpartitions. (Here, we define an "edge" case to be one where the dimension being partitioned for parallelism is not a whole multiple of whatever register blocksize is needed in that dimension.) In BLIS, we always require edge cases to be part of the bottom, right, or bottom-right subpartitions. (This is so that zero-padding only has to happen at the bottom, right, or bottom-right edges of micro-panels.) The previous implementations of bli_get_range() and _get_range_weighted() did not adhere to this implicit policy and thus produced bad ranges for some combinations of operation, parameter cases, problem sizes, and n-way parallelism. - As part of the above fix, the functions bli_get_range() and _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b, and _b2t suffixes, similar to the partitioning functions. This is an easy way to make sure that the variants are calling the right version of each function. The function signatures have also been changed slightly. - Comment/whitespace updates. - Removed unnecessary '/' from macros in bli_obj_macro_defs.h.
Details: - Added API-level initialization state to _const, _error, _mem, _thread, _ind, and _cntl APIs. While this functionality will mostly go unused, adding miniscule overhead at init-time, there will be at least once instance in the near future where, in order to avoid an infinite loop, a certain portion of the initialization will call a query function that itself attempts to call bli_init(). API-level initialization will allow this later stage to verify that an earlier stage of initialization has completed, even if the overall call to bli_init() has not yet returned. - Added _is_initialized() functions for each API, setting the underlying bool_t during _init() and unsetting it during _finalize(). - Comment, whitespace changes.
Details: - Added conditional code that returns early from the API-level _init() routines if the API is already initialized. Actually meant for this to be included in 5f93cbe.
Details: - Replaced the old memory allocator, which was based on statically- allocated arrays, with one based on a new internal pool_t type, which, combined with a new bli_pool_*() API, provides a new abstract data type that implements the same memory pool functionality but with blocks from the heap (ie: malloc() or equivalent). Hiding the details of the pool in a separate API also allows for a much simpler bli_mem.c family of functions. - Added a new internal header, bli_config_macro_defs.h, which enables sane defaults for the values previously found in bli_config. Those values can be overridden by #defining them in bli_config.h the same way kernel defaults can be overridden in bli_kernel.h. This file most resembles what was previously a typical configuration's bli_config.h. - Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which defaults to BLIS_PAGE_SIZE, to specify the alignment of individual blocks in the memory pool. Also added a corresponding query routine to the bli_info API. - Deprecated (once again) the micro-panel alignment feature. Upon further reflection, it seems that the goal of more predictable L1 cache replacement behavior is outweighed by the harm caused by non-contiguous micro-panels when k % kc != 0. I honestly don't think anyone will even miss this feature. - Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call bli_cntl_init() instead of bli_init(). - Removed query functions from bli_info.c that are no longer applicable given the dynamic memory allocator. - Removed unnecessary definitions from configurations' bli_config.h files, which are now pleasantly sparse. - Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite modules. Thanks to Devangi Parikh for pointing out these miscalculations. - Comment, whitespace changes.
Details: - Added a new configuration for AMD Excavator-based hardware also known as Carrizo when referring to the entire APU. This configuration uses the same micro-kernels as the piledriver, but with different cache blocksizes.
Details: - Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors and FMA instructions. (Complex support is currently provided by default induced method, 4m1a.) - Added a 'haswell' configuration, which uses the aforementioned kernels. - Inserted auto-detection support for haswell configuration in build/auto-detect/cpuid_x86.c. - Modified configure script to explicitly echo when automatic or manual configuration is in progress. - Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.
Details: - Fixed a typecasting ambiguity in bli_pool_alloc_block() in which pointer arithmetic was performed on a void* as if it were a byte pointer (such as char*). Some compilers may have already been interpreting this situation as intended, despite the sloppiness. Thanks to Aleksei Rechinskii for reporting this issue. - Redefined pointer alignment macros to typecast to uintptr_t instead of siz_t.
Details: - Expanded/updated interface for bli_get_range_weighted() and bli_get_range() so that the direction of movement is specified in the function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b()) and also so that the object being partitioned is passed instead of an uplo parameter. Updated invocations in level-3 blocked variants, as appropriate. - (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to carefully take into account the location of the diagonal when computing ranges so that the area of each subpartition (which, in all present level-3 operations, is proportional to the amount of computation engendered) is as equal as possible. - Added calls to a new class of routines to all non-gemm level-3 blocked variants: bli_<oper>_prune_unref_mparts_[mnk]() where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which dimension is being partitioned. These routines call a more basic routine, bli_prune_unref_mparts(), to prune unreferenced/unstored regions from matrices and simultaneously adjust other matrices which share the same dimension accordingly. - Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the new pruning routines. - Fixed incorrect blocking factors passed into bli_get_range_*() in bli_trsm_blk_var[12][fb].c - Added a new test driver in test/thread_ranges that can exercise the new bli_get_range_*() and bli_get_range_weighted_*() under a range of conditions. - Reimplemented m and n fields of obj_t as elements in a "dim" array field so that dimensions could be queried via index constant (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification macros accordingly. - Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values. - Added bli_round() macro, which calls C math library function round(), and bli_round_to_mult(), which rounds a value to the nearest multiple of some other value. - Added miscellaneous pruning- and mdim_t-related macros. - Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to bli_obj_row_off(), bli_obj_col_off().
Details: - Replaced the old (and short) README file with a much more comprehensive version written in github-flavored markdown. The new file is based on content taken from the old Google Code homepage.
Details: - Fixed typos in README.md. - Fixed column heading alignment for testsuite when matlab output is enabled. - Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.
Details: - Added section to README.md file containing links to wikis with brief descriptions.
Details: - Removed the optional flop-counting feature introduced in commit 7574c99.
Enable Travis CI
Fixed incomplete code in the double precision ARMv8 microkernel.
Details: - Fixed a family of bugs in the triangular level-3 operations for certain complex implementations (3m1 and 4m1a) that only manifest if one of the register blocksizes (PACKMR/PACKNR, actually) is odd: - Fixed incorrect imaginary stride computation in bli_packm_blk_var2() for the triangular case. - Fixed the incorrect computation of imaginary stride, as stored in the auxinfo_t struct in trmm and trsm macro-kernels. - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the cases where the the register blocksize for the triangular matrix is odd. Introduced a new byte-granular pointer arithmetic macro, bli_ptr_add(), that computes the correct value. - Added cpp macro to bli_macro_defs.h for typeof() operator, defined in terms of __typeof__, which is used by bli_ptr_add() macro. - Disabled the row- vs. column-storage optimization in bli_trmm_front() for singleton problems because the inherent ambiguity of whether a scalar is row-stored or column-stored causes the wrong parameter combination code to be executed (by dumb luck of our checking for row storage first). - Added commented-out debugging lines to 3m1/4m1a and reference micro-kernels, and trsm_ll macro-kernel.
Details: - Changed bli_pool_finalize() so that the freeing begins with the block at top_index instead of block 0. This allows us to use the function for terminal finalization as well as temporary cleanup prior to reinitialization. Also, clear the pool_t struct upon _pool_finalize() in case it is called in the terminal case with some blocks still checked out to threads (in which case the threads will see the new block size as 0 and thus release the block as intended). - Added bli_pool_reinit(), which calls _pool_finalize() followed by _pool_init() with new parameters. - Added bli_mem_reinit(), which is based on bli_pool_reinit(). - Added new wrapper, _mem_compute_pool_block_sizes(), which calls _mem_compute_pool_block_sizes_dt(). - Updated bli_mem_release() so that the pblk_t is freed, via _pool_free_block(), if the block size recorded in the mem_t at the time the pblk_t was acquired is now different from the value in the pool_t.
Details: - Fixed a bug in the relatively new quadratic partitioning code that, under the right conditions, would perform sqrt() on a negative value. If the solution is imaginary, we discard it and use an alternate partition width that assumes no diagonal intersection. That alternate width is actually already computed, so, the fix was quite simple. Thanks to Devangi Parikh for reporting this bug.
Details: - Minor change to quadratic equation solution code that avoids recomputation of the sqrt() parameter when the compiler is not smart enough to perform this optimization automatically.
Details: - Separated bli_adjust_strides() into _alloc() and _attach() flavors so that the latter can avoid a test performed by the former, in which the rs and cs are overridden and set to zero if either matrix dimension is zero. Actually, we also disable this overridding behavior, even for the _alloc() case, since keeping the original strides (probably) does not hurt anything. The original code has been kept commented-out, though, in case an unintended consequence is later discovered. - Fixed a typo in an error check for general stride cases where rs == cs.
Details: - Implemented the "beta == 0" case for general stride output for the dunnington sgemm micro-kernel. This case had been, up until now, identical to the "beta != 0" case, which does not work when the output matrix has nan's and inf's. It had manifested as nan residuals in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin Matthews for reporting this bug.
Details: - Applied a patch submitted by Devin Matthews that: - implements subtle changes to handling of somewhat unusual cases of row and column strides to accommodate certail tensor cases, which includes adding dimension parameters to _is_col_tilted() and _is_row_tilted() macros, - simplifies how buffers are sized when requested BLIS-allocated objects, - re-consolidates bli_adjust_strides_*() into one function, and - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99 environments.
Details: - Consolidated the two blocked variants for packm into a single implementation (packm_blk_var1) and removed the other variant. - Updated all induced method _cntl_init() functions in frame/cntl/ind/ to use the new blocked variant 1. - Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(), to detect pack_t schemas for induced methods and native execution, respectively.
Use unaligned vmovups for accessing matrix C.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm trying to integrate yours fixes for OSX into xiany/branch but for some reason the new threading model and the OSX thread_barrier don't like each other.