http://gcc.gnu.org/bugzilla/show_bug.cgi?id=61121
Bug ID: 61121 Summary: -ftree-parallelize-loops=n (n as value) not accepted in 4.9.0 Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jmichae3 at yahoo dot com https://groups.google.com/forum/#!topic/gnu.gcc.help/T1guYK8-z70 just says that -O2 is needed for things like -floop-parallelize-all, -ftree-parallelize-loops=4 and probably similar functionality. I have this on my commandline to multiply the speed of my code (should do some runtime tests): -ftree-parallelize-loops=12 -floop-parallelize-all -ftree-loop-vectorize -ftree-slp-vectorize -O2 it is unclear to me from the manual is if these switches can be combined this way (the -ftree-parallelize-loops=12 with -floop-parallelize-all and -ftree-loop-vectorize with -ftree-slp-vectorize. I wanted all the parallelization I coud get. wiki at http://gcc.gnu.org/wiki/AutoParInGCC doesn't say much. http://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Optimize-Options.html#Optimize-Options (gcc optional manual) doesn't completely match what 4.9.0 is doing, it does not take -ftree-parallelize-loops=n (n as a value) but it does take 12 for a value. my machine has 12 threads. other people's machine might have only 1 thread/code. some HPC servers have 80+ and 1-4 procs or plenty more (fridge boxes). manual should say you need to have -O2 to use those options that require it, because joe dev can't figure this out. I wish the -O2 were -Ofast. I desperately need my code to be fast as possible as it has long runtimes (but I would understand if it's because too much optimzation would remove that kind of optimization). I have plenty of RAM. most of us do. that's for personal use. for genral public use I would drop that number of threads down to 4, maybe even 2 with just using numbers. I would very much like for -ftree-parallelize-loops=12 to work equally as well as -ftree-parallelize-loops=n or -ftree-parallelize-loops=0 but it does not (like the documentation shows). it is possible to query the OS for the number of processor threads and then allocate such, or percentage of number of cpu threads/cores queried from the OS, like leaving 1 or 2 behind for the system or general performance purposes. use all of the threads and the system might not let let you do an awful lot of else very well (but it's workable, just slow). but allocating a fixed number of threads presumes you know how many threads are on every user's box. you can't assume that. some cpus have 1, 2, 3, 4, 6, 8, 12, 20, 30, and HPC servers have multiple cpus could be 120+ threads with 4+ procs. so please put in support for not just numbers (which has its uses for the 80-thread box) that the manual says - n or 0 does auto-sizing. if you are using atoi() to get the number of threads, you can use the value 0 as an "auto" if it needs to be only a number. but that 0 or n as an actual value needs to to be documented, with 2 examples, one using n and one using 4, or one using 0 and one using 4.