[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028105#comment-17028105
 ] 

Rinka Singh edited comment on LUCENE-7745 at 2/1/20 6:13 PM:
-------------------------------------------------------------

Another update.

Sorry about the delay.  It took me SIGNIFICANTLY longer (multiple race 
conditions, debugging on the GPU etc., etc.,) than I anticipated but I think I 
have the histogram running:
 * Created a sorted histogram with word count - max file size tested is about 5 
MB/~436K words.
 * The histogram also does some math: Calculating median, mean, std. deviation 
etc., etc., but I haven't optimized that and it is horrendously slow with the 
large data so I just commented the code out as that's not important anyway.
 * applied stopwords - I have a stop-word list of ~4.4K words.
 * Performance (this is a debug compile and the numbers are while running under 
gdb):
 ** Quadro 2000 GPU (192 cores) + Intel Dual Core + Kubuntu 14.04, CUDA 7.5: 
765 sec
 ** GeForce GTX 780 (2304 cores) on an I5, Kubuntu 16.04, CUDA 8.0: 640 sec.
 * I have done some perf. optimization - (registers and shared memory) but 
there's a lot more that can be done.  I suspect I can bump the speeds up by at 
least 5x if not more.
 ** Applying the stop-word can be optimized further but I assume that is not so 
critical since the index will be updated infrequently.  At this point it is in 
the code path and contributes to the 644 sec.
 ** Algorithm optimization can give quite a bit of bang for the buck.
 ** I :) "invented" my own sort (a parallel version of a selection sort within 
and across multiple chunks).  I'd need to do more experimenting here.
 ** :) I have yet to compile and test for production code.  Testing was 
exclusively in debug code running under gdb.
 ** The whole thing is sequential at a high level: Read file into GPU, break 
into chunks, apply stop words, sort..  This can be parallelized significantly - 
reading data and sorting can be parallelized.
 * I also don't have access to high end GPUs (like a v100 
[https://www.nvidia.com/en-us/data-center/v100/] for example). The high end 
GPUs should get another significant  performance boost over and above the perf. 
optimization I can do.  To give you an idea, I remember I'd done some testing 
(about a year ago on a K80 based machine) on a very old version of this and I'd 
seen something like a 2x boost.
 * Going forward, I will find it VERY difficult to commit to timelines as it 
seems to take me something like 7-10x the time I would have taken on a CPU.  
Reasons for this are many:
 ** GPU development is inherently much, much slower - thinking the design 
through takes at least 3-4x more time.  I dumped many alternative designs 
halfway through development (something that would never happen to me when doing 
CPU based development).
 ** Debugging is SIGNIFICANTLY slower - despite having CUDA tools.
 ** Race conditions have bitten me multiple times and each time I lost weeks 
and months of time.
 ** And finally of course is my own limitation - that of transitioning back 
into being a developer.  It took me quite a while and I still am not at the 
point where I was as a 10 year experienced developer (so long ago).

I'll release the code on my github in a day or two with data on how to compile 
& run this (I'll include both the data & the stop-word files). I'll put the 
link here.

*I'd love to hear how these numbers relate to running an equivalent histogram 
on a CPU cluster.  Please can someone run this and let me know.*  Also *if 
someone can provide me with a V100 based Instance* (even an AWS instance is 
fine), I can run it there (as is) and generate some numbers.

Underlying assumption: Code is working fine (this can be a bad assumption since 
I have done just enough testing for me to process one file - code works with a 
small set of boundary conditions and this is not something that I would deploy 
at this point). I was more focused on getting it out than doing extensive 
testing.

Next steps:

I'll start updating the code to;
 * Put this on github, do some more measurements
 * implement an inverted index for a single file.
 * Then for multiple files.
 * Finally, set it up so that you can send queries to this inverted index 
running on the GPU...
 * :) Testing of course.

*But I'd like feedback on this* before going down this rabbit hole.

Also, once I have an initial inverted index in place, I'd love to have 
contributors.  There is a lot of contribution that even a CPU developer can do 
- one look at the code will tell you.  *If you can help, please do reach out to 
me.* One major advantage for the contributor will be learning GPU programming - 
and trust me on this, that's a completely different ball game altogether.


was (Author: rinka):
Another update.

Sorry about the delay.  It took me SIGNIFICANTLY longer (multiple race 
conditions, debugging on the GPU etc., etc.,) than I anticipated but I think I 
have the histogram running:
 * Created a sorted histogram with word count - max file size tested is about 5 
MB/~436K words.
 * The histogram also does some math: Calculating median, mean, std. deviation 
etc., etc., but I haven't optimized that and it is horrendously slow with the 
large data so I just commented the code out as that's not important anyway.
 * applied stopwords - I have a stop-word list of ~4.4K words.
 * Performance (this is a debug compile and the numbers are while running under 
gdb):
 ** Quadro 2000 GPU (192 cores) + Intel Dual Core + Kubuntu 14.04, CUDA 7.5: 
765 sec
 ** GeForce GTX 780 (2304 cores) on an I5, Kubuntu 16.04, CUDA 8.0: 644 sec.
 * I have done some perf. optimization - (registers and shared memory) but 
there's a lot more that can be done.  I suspect I can bump the speeds up by at 
least 5x if not more.
 ** Applying the stop-word can be optimized further but I assume that is not so 
critical since the index will be updated infrequently.  At this point it is in 
the code path and contributes to the 644 sec.
 ** Algorithm optimization can give quite a bit of bang for the buck.
 ** I :) "invented" my own sort (a parallel version of a selection sort within 
and across multiple chunks).  I'd need to do more experimenting here.
 ** :) I have yet to compile and test for production code.  Testing was 
exclusively in debug code running under gdb.
 ** The whole thing is sequential at a high level: Read file into GPU, break 
into chunks, apply stop words, sort..  This can be parallelized significantly - 
reading data and sorting can be parallelized.
 * I also don't have access to high end GPUs (like a v100 
https://www.nvidia.com/en-us/data-center/v100/ for example). The high end GPUs 
should get another significant  performance boost over and above the perf. 
optimization I can do.  To give you an idea, I remember I'd done some testing 
(about a year ago on a K80 based machine) on a very old version of this and I'd 
seen something like a 2x boost.
 * Going forward, I will find it VERY difficult to commit to timelines as it 
seems to take me something like 7-10x the time I would have taken on a CPU.  
Reasons for this are many:
 ** GPU development is inherently much, much slower - thinking the design 
through takes at least 3-4x more time.  I dumped many alternative designs 
halfway through development (something that would never happen to me when doing 
CPU based development).
 ** Debugging is SIGNIFICANTLY slower - despite having CUDA tools.
 ** Race conditions have bitten me multiple times and each time I lost weeks 
and months of time.
 ** And finally of course is my own limitation - that of transitioning back 
into being a developer.  It took me quite a while and I still am not at the 
point where I was as a 10 year experienced developer (so long ago).

I'll release the code on my github in a day or two with data on how to compile 
& run this (I'll include both the data & the stop-word files). I'll put the 
link here.

*I'd love to hear how these numbers relate to running an equivalent histogram 
on a CPU cluster.  Please can someone run this and let me know.*  Also *if 
someone can provide me with a V100 based Instance* (even an AWS instance is 
fine), I can run it there (as is) and generate some numbers.

Underlying assumption: Code is working fine (this can be a bad assumption since 
I have done just enough testing for me to process one file - code works with a 
small set of boundary conditions and this is not something that I would deploy 
at this point). I was more focused on getting it out than doing extensive 
testing.

Next steps:

I'll start updating the code to;
 * Put this on github, do some more measurements
 * implement an inverted index for a single file.
 * Then for multiple files.
 * Finally, set it up so that you can send queries to this inverted index 
running on the GPU...
 * :) Testing of course.

*But I'd like feedback on this* before going down this rabbit hole.

Also, once I have an initial inverted index in place, I'd love to have 
contributors.  There is a lot of contribution that even a CPU developer can do 
- one look at the code will tell you.  *If you can help, please do reach out to 
me.* One major advantage for the contributor will be learning GPU programming - 
and trust me on this, that's a completely different ball game altogether.

> Explore GPU acceleration
> ------------------------
>
>                 Key: LUCENE-7745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7745
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>              Labels: gsoc2017, mentor
>         Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to