Commit Graph

133 Commits

Author SHA1 Message Date
Tomasz Sobczyk fafb9557a8 Get train loss from update_parameters. 2020-12-02 08:56:20 +09:00
Tomasz Sobczyk 256c4b55ec Properly apply gradient norm clipping after it's scaled in the update_parameters. 2020-12-02 08:56:20 +09:00
Tomasz Sobczyk 539bd2d1c8 Replace the old loss/grad calculation completely. 2020-12-02 08:56:20 +09:00
Tomasz Sobczyk b71d1e8620 Pass the new loss function to update_parameters 2020-12-02 08:56:20 +09:00
Tomasz Sobczyk 1322a9a5fd Prevent false sharing of num_calls counter in the shared input trainer. Fix current_operation not being local to the executing thread. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 2aa7f5290e Fix comparison of integers with different signedness. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk a97b65eaef Fix compilation error with USE_BLAS 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 622e0b14c2 Remove superfluous example shuffling. Shuffling now only happens on reading. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 34510dd08a Remove used examples asyncronously. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 0bee8fef64 Don't unnecessarily copy the batch part. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk e954b14196 Prefetch weights for feature transformer backprop to shared cache. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 49b2dcb1f3 Preallocate memory for unique_features. Keep the training_features temporary buffer as a thread_local so we reuse the storage. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 1c8495b54b Remove handwritten saxpy because compilers optimize the second look anyway. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 15c528ca7b Prepare feature transformer learner. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk a3c78691a2 Prepare input slice trainer. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 401fc0fbab Prepare clipped relu trainer. 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk cc11375f6d Skeleton for new evaluate learner 2020-11-30 08:54:53 +09:00
Tomasz Sobczyk 0d4b803b08 Prepare trainer affine transform. 2020-11-30 08:54:53 +09:00
noobpwnftw 0b2ae6cb64 Merge remote-tracking branch 'remotes/official/master' into merge 2020-11-28 06:47:04 +08:00
MaximMolchanov 7615e3485e Calculate sum from first elements
in affine transform for AVX512/AVX2/SSSE3

The idea is to initialize sum with the first element instead of zero.
Reduce one add_epi32 and one set_zero SIMD instructions for each output dimension.

sum = 0; for i = 1 to n sum += a[i] ->
sum = a[1]; for i = 2 to n sum += a[i]

STC:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 69048 W: 7024 L: 6799 D: 55225
Ptnml(0-2): 260, 5175, 23458, 5342, 289
https://tests.stockfishchess.org/tests/view/5faf2cf467cbf42301d6aa06

closes https://github.com/official-stockfish/Stockfish/pull/3227

No functional change.
2020-11-25 21:10:13 +01:00
Stéphane Nicolet 027626db1e Small cleanups 13
No functional change
2020-11-23 22:20:32 +01:00
noobpwnftw c29554a120 Merge remote-tracking branch 'remotes/official/master' into master
Bench: 3597730
2020-11-23 04:27:12 +08:00
JWmer 3975fc9c0d Update half_relative_ka.cpp 2020-11-22 07:45:39 +09:00
JWmer b0429237a8 Update half_ka.cpp 2020-11-22 07:45:39 +09:00
JWmer ea70e378cd Update a.cpp 2020-11-22 07:45:39 +09:00
JWmer be4cd56146 Update half_kp.cpp 2020-11-22 07:45:39 +09:00
JWmer 021f47b00e Update half_relative_kp.cpp 2020-11-22 07:45:39 +09:00
JWmer 36c801699f Update k.cpp 2020-11-22 07:45:39 +09:00
JWmer 5b3e9b0eb3 Update p.cpp 2020-11-22 07:45:39 +09:00
JWmer c04c5b6658 Update nnue_common.h 2020-11-22 07:45:39 +09:00
JWmer b27c51b5cf Delete k-p-cr-ep_256x2-32-32.h 2020-11-22 07:45:39 +09:00
JWmer 72fee2f7a4 Delete k-p-cr_256x2-32-32.h 2020-11-22 07:45:39 +09:00
JWmer d9dcdc2b73 Delete k-p_256x2-32-32.h 2020-11-22 07:45:39 +09:00
Tomasz Sobczyk 691da3bdad Add more information for factorizers at the start of training. 2020-11-14 18:47:22 +09:00
Tomasz Sobczyk 4e1653d53a Fix reliance on transitive includes for factorizers in trainer feature transformer. Add a file that includes all factorizers. 2020-11-14 12:35:12 +09:00
Tomasz Sobczyk ba35c88ab8 AVX-512 for smaller affine and feature transforms.
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.

For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.

These changes provide about 1.5% speedup for AVX-512 builds.

Closes https://github.com/official-stockfish/Stockfish/pull/3218

No functional change.
2020-11-07 16:49:49 +01:00
Tomasz Sobczyk 3f6451eff7 Manually align arrays on the stack
as a workaround to issues with overaligned alignas() on stack variables in gcc < 9.3 on windows.

closes https://github.com/official-stockfish/Stockfish/pull/3217

fixes #3216

No functional change
2020-11-04 19:52:42 +01:00
Tomasz Sobczyk 75e06a1c89 Optimize affine transform for SSSE3 and higher targets.
A non-functional speedup. Unroll the loops going over
the output dimensions in the affine transform layers by
a factor of 4 and perform 4 horizontal additions at a time.
Instead of doing naive horizontal additions on each vector
separately use hadd and shuffling between vectors to reduce
the number of instructions by using all lanes for all stages
of the horizontal adds.

passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3

closes https://github.com/official-stockfish/Stockfish/pull/3203

No functional change
2020-11-02 19:41:17 +01:00
Tomasz Sobczyk 987b6c98d4 Move the observed feature collection to the threaded part now that it can be done safely. 2020-11-01 11:02:44 +09:00
Tomasz Sobczyk e8907bcfc4 Replace omp in trainer_feature_transformer 2020-10-31 11:54:03 +09:00
Tomasz Sobczyk db1b33d4ac Optimize trainer clipped relu propagate 2020-10-31 11:52:51 +09:00
Tomasz Sobczyk b5714c4084 Parallelize input slice trainer backprop. 2020-10-31 11:52:26 +09:00
Tomasz Sobczyk 941897ff2c Optimize trainer clipped relu backpropagate. 2020-10-31 11:50:12 +09:00
Tomasz Sobczyk c96743c5bd Optimize feature transformer backpropagation stats. 2020-10-31 11:49:29 +09:00
Tomasz Sobczyk 2c10b1babc Optimize feature transformer clipped relu. 2020-10-31 11:48:02 +09:00
Tomasz Sobczyk a56d8124d8 Replace non-blas parts of trainers with our own blas-like routines. 2020-10-31 08:36:58 +09:00
Tomasz Sobczyk ee0917a345 Pass ThreadPool to update_parameters, propagate, and backpropagate. 2020-10-29 09:21:19 +09:00
Tomasz Sobczyk f1e96cab55 Align trainer arrays to cache line. 2020-10-29 09:12:50 +09:00
Tomasz Sobczyk ec9e49e875 Add a HalfKA architecture (a product of K - king, and A - any piece) along with all required infrastructure. HalfKA doesn't discriminate kings compared to HalfKP. Keep old architecture as the default one. 2020-10-29 09:10:01 +09:00
Tomasz Sobczyk 317fda2516 Cleanup eval saving and lr scheduling. 2020-10-28 23:08:05 +09:00