[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028105#comment-17028105 ]
Rinka Singh edited comment on LUCENE-7745 at 2/1/20 6:13 PM: ------------------------------------------------------------- Another update. Sorry about the delay. It took me SIGNIFICANTLY longer (multiple race conditions, debugging on the GPU etc., etc.,) than I anticipated but I think I have the histogram running: * Created a sorted histogram with word count - max file size tested is about 5 MB/~436K words. * The histogram also does some math: Calculating median, mean, std. deviation etc., etc., but I haven't optimized that and it is horrendously slow with the large data so I just commented the code out as that's not important anyway. * applied stopwords - I have a stop-word list of ~4.4K words. * Performance (this is a debug compile and the numbers are while running under gdb): ** Quadro 2000 GPU (192 cores) + Intel Dual Core + Kubuntu 14.04, CUDA 7.5: 765 sec ** GeForce GTX 780 (2304 cores) on an I5, Kubuntu 16.04, CUDA 8.0: 640 sec. * I have done some perf. optimization - (registers and shared memory) but there's a lot more that can be done. I suspect I can bump the speeds up by at least 5x if not more. ** Applying the stop-word can be optimized further but I assume that is not so critical since the index will be updated infrequently. At this point it is in the code path and contributes to the 644 sec. ** Algorithm optimization can give quite a bit of bang for the buck. ** I :) "invented" my own sort (a parallel version of a selection sort within and across multiple chunks). I'd need to do more experimenting here. ** :) I have yet to compile and test for production code. Testing was exclusively in debug code running under gdb. ** The whole thing is sequential at a high level: Read file into GPU, break into chunks, apply stop words, sort.. This can be parallelized significantly - reading data and sorting can be parallelized. * I also don't have access to high end GPUs (like a v100 [https://www.nvidia.com/en-us/data-center/v100/] for example). The high end GPUs should get another significant performance boost over and above the perf. optimization I can do. To give you an idea, I remember I'd done some testing (about a year ago on a K80 based machine) on a very old version of this and I'd seen something like a 2x boost. * Going forward, I will find it VERY difficult to commit to timelines as it seems to take me something like 7-10x the time I would have taken on a CPU. Reasons for this are many: ** GPU development is inherently much, much slower - thinking the design through takes at least 3-4x more time. I dumped many alternative designs halfway through development (something that would never happen to me when doing CPU based development). ** Debugging is SIGNIFICANTLY slower - despite having CUDA tools. ** Race conditions have bitten me multiple times and each time I lost weeks and months of time. ** And finally of course is my own limitation - that of transitioning back into being a developer. It took me quite a while and I still am not at the point where I was as a 10 year experienced developer (so long ago). I'll release the code on my github in a day or two with data on how to compile & run this (I'll include both the data & the stop-word files). I'll put the link here. *I'd love to hear how these numbers relate to running an equivalent histogram on a CPU cluster. Please can someone run this and let me know.* Also *if someone can provide me with a V100 based Instance* (even an AWS instance is fine), I can run it there (as is) and generate some numbers. Underlying assumption: Code is working fine (this can be a bad assumption since I have done just enough testing for me to process one file - code works with a small set of boundary conditions and this is not something that I would deploy at this point). I was more focused on getting it out than doing extensive testing. Next steps: I'll start updating the code to; * Put this on github, do some more measurements * implement an inverted index for a single file. * Then for multiple files. * Finally, set it up so that you can send queries to this inverted index running on the GPU... * :) Testing of course. *But I'd like feedback on this* before going down this rabbit hole. Also, once I have an initial inverted index in place, I'd love to have contributors. There is a lot of contribution that even a CPU developer can do - one look at the code will tell you. *If you can help, please do reach out to me.* One major advantage for the contributor will be learning GPU programming - and trust me on this, that's a completely different ball game altogether. was (Author: rinka): Another update. Sorry about the delay. It took me SIGNIFICANTLY longer (multiple race conditions, debugging on the GPU etc., etc.,) than I anticipated but I think I have the histogram running: * Created a sorted histogram with word count - max file size tested is about 5 MB/~436K words. * The histogram also does some math: Calculating median, mean, std. deviation etc., etc., but I haven't optimized that and it is horrendously slow with the large data so I just commented the code out as that's not important anyway. * applied stopwords - I have a stop-word list of ~4.4K words. * Performance (this is a debug compile and the numbers are while running under gdb): ** Quadro 2000 GPU (192 cores) + Intel Dual Core + Kubuntu 14.04, CUDA 7.5: 765 sec ** GeForce GTX 780 (2304 cores) on an I5, Kubuntu 16.04, CUDA 8.0: 644 sec. * I have done some perf. optimization - (registers and shared memory) but there's a lot more that can be done. I suspect I can bump the speeds up by at least 5x if not more. ** Applying the stop-word can be optimized further but I assume that is not so critical since the index will be updated infrequently. At this point it is in the code path and contributes to the 644 sec. ** Algorithm optimization can give quite a bit of bang for the buck. ** I :) "invented" my own sort (a parallel version of a selection sort within and across multiple chunks). I'd need to do more experimenting here. ** :) I have yet to compile and test for production code. Testing was exclusively in debug code running under gdb. ** The whole thing is sequential at a high level: Read file into GPU, break into chunks, apply stop words, sort.. This can be parallelized significantly - reading data and sorting can be parallelized. * I also don't have access to high end GPUs (like a v100 https://www.nvidia.com/en-us/data-center/v100/ for example). The high end GPUs should get another significant performance boost over and above the perf. optimization I can do. To give you an idea, I remember I'd done some testing (about a year ago on a K80 based machine) on a very old version of this and I'd seen something like a 2x boost. * Going forward, I will find it VERY difficult to commit to timelines as it seems to take me something like 7-10x the time I would have taken on a CPU. Reasons for this are many: ** GPU development is inherently much, much slower - thinking the design through takes at least 3-4x more time. I dumped many alternative designs halfway through development (something that would never happen to me when doing CPU based development). ** Debugging is SIGNIFICANTLY slower - despite having CUDA tools. ** Race conditions have bitten me multiple times and each time I lost weeks and months of time. ** And finally of course is my own limitation - that of transitioning back into being a developer. It took me quite a while and I still am not at the point where I was as a 10 year experienced developer (so long ago). I'll release the code on my github in a day or two with data on how to compile & run this (I'll include both the data & the stop-word files). I'll put the link here. *I'd love to hear how these numbers relate to running an equivalent histogram on a CPU cluster. Please can someone run this and let me know.* Also *if someone can provide me with a V100 based Instance* (even an AWS instance is fine), I can run it there (as is) and generate some numbers. Underlying assumption: Code is working fine (this can be a bad assumption since I have done just enough testing for me to process one file - code works with a small set of boundary conditions and this is not something that I would deploy at this point). I was more focused on getting it out than doing extensive testing. Next steps: I'll start updating the code to; * Put this on github, do some more measurements * implement an inverted index for a single file. * Then for multiple files. * Finally, set it up so that you can send queries to this inverted index running on the GPU... * :) Testing of course. *But I'd like feedback on this* before going down this rabbit hole. Also, once I have an initial inverted index in place, I'd love to have contributors. There is a lot of contribution that even a CPU developer can do - one look at the code will tell you. *If you can help, please do reach out to me.* One major advantage for the contributor will be learning GPU programming - and trust me on this, that's a completely different ball game altogether. > Explore GPU acceleration > ------------------------ > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Ishan Chattopadhyaya > Assignee: Ishan Chattopadhyaya > Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org