uschindler commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1938448818
> > How can that be done?
>
> This is a question that is much harder to answer than I thought... Lucene
doesn't have a tutorial/user guide. The only place I could think of
chatman commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1934843930
How about something with the source maintained in the sandbox dir (along
with instructions to build), but no corresponding official release artifact?
On Fri, 9 Feb, 2024, 1:
lmessinger commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1933785298
Hi,
Got it. Pointing to the project from the documentation would actually be
very valuable to the Hebrew community. How can that be done? is the
documentation also on
dweiss commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1930667697
It will be a major headache to maintain native bindings for all major
platforms. I think such an analyzer should be a downstream project (then you
can restrict the platforms on which
lmessinger commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1929981311
hi,
in Hebrew and other Semitic languages, lemmas are context-dependent.
eg שמן could be interpreted as
fat, oil, their name, from
all dependent on the context
s
benwtrent commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1929933564
@lmessinger I don't see why text tokenization would need any native code.
Word piece is pretty simple and just a dictionary look up.
Do y'all not have a Java one?
O
lmessinger commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1925739977
I mean, create just the tokens - the lemmas / wordpieces
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use
benwtrent commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1921775426
For the analyzer, are you meaning something that tokenizes into an
embedding?
Or just creates the tokens (wordpiece + dictionary)?
--
This is an automated message from t
lmessinger opened a new issue, #13065:
URL: https://github.com/apache/lucene/issues/13065
### Description
Hi,
We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and
stopwords), based on a BERT model. We'd like to contribute this to this
repository. How ca