Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-12 Thread via GitHub
uschindler commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1938448818 > > How can that be done? > > This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-08 Thread via GitHub
chatman commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1934843930 How about something with the source maintained in the sandbox dir (along with instructions to build), but no corresponding official release artifact? On Fri, 9 Feb, 2024, 1:

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-08 Thread via GitHub
lmessinger commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1933785298 Hi, Got it. Pointing to the project from the documentation would actually be very valuable to the Hebrew community. How can that be done? is the documentation also on

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-06 Thread via GitHub
dweiss commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1930667697 It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-06 Thread via GitHub
lmessinger commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1929981311 hi, in Hebrew and other Semitic languages, lemmas are context-dependent. eg שמן could be interpreted as fat, oil, their name, from all dependent on the context s

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1929933564 @lmessinger I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up. Do y'all not have a Java one? O

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-04 Thread via GitHub
lmessinger commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1925739977 I mean, create just the tokens - the lemmas / wordpieces -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-01 Thread via GitHub
benwtrent commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1921775426 For the analyzer, are you meaning something that tokenizes into an embedding? Or just creates the tokens (wordpiece + dictionary)? -- This is an automated message from t

[I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-01 Thread via GitHub
lmessinger opened a new issue, #13065: URL: https://github.com/apache/lucene/issues/13065 ### Description Hi, We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and stopwords), based on a BERT model. We'd like to contribute this to this repository. How ca