cangkuren opened a new issue, #14089: URL: https://github.com/apache/lucene/issues/14089
### Description `public AnalyzerResult analyze(String text) throws IOException { // text = HtmlExtractor.extractTextFromHtml(text); List<String> tokens = new ArrayList<>(); List<String> originalTexts = new ArrayList<>(); try (TokenStream stream = tokenStream("*", text)) { stream.reset(); CharTermAttribute charTermAttribute = stream.addAttribute(CharTermAttribute.class); OffsetAttribute offsetAttribute = stream.addAttribute(OffsetAttribute.class); while (stream.incrementToken()) { tokens.add(charTermAttribute.toString()); originalTexts.add(text.substring(offsetAttribute.startOffset(), offsetAttribute.endOffset())); } } return AnalyzerResult.builder().tokens(tokens).originalTexts(originalTexts).build(); }` `public static String extractTextFromHtml(String content) { Document document = Jsoup.parseBodyFragment(content); return document.body().text().replace(" ", "").trim(); }` `protected Reader initReader(String fieldName, Reader reader) { reader = new HTMLStripCharFilter(reader); reader = new JapaneseIterationMarkCharFilter(reader); return reader; }` `@PostConstruct @Scheduled(cron = "0 0 0 * * *") public void init() { Logged.L.info("load Japanese config."); String dict = japaneseDictConfig.getDict(); UserDictionary userDictionary = null; try { userDictionary = UserDictionary.open(new StringReader(dict)); } catch (Exception e) { Logged.L.error("load japanese dict error", e); } List<String> stopWords = stopWordConfig.getStopWords(); CharArraySet stopSet = new CharArraySet(stopWords, true); stopSet.add(getDefaultStopSet()); Tokenizer tokenizer = new JapaneseTokenizer(userDictionary, true, false, JapaneseTokenizer.Mode.SEARCH); TokenStream stream = new JapaneseBaseFormFilter(tokenizer); stream = new JapanesePartOfSpeechStopFilter(stream, getDefaultStopTags()); stream = new CJKWidthFilter(stream); stream = new StopFilter(stream, stopSet); stream = new JapaneseReadingFormFilter(stream); // stream = new JapaneseKatakanaStemFilter(stream); stream = new JapaneseNumberFilter(stream); stream = new LowerCaseFilter(stream); this.tokenStreamComponents = new TokenStreamComponents(tokenizer, stream); }` `@Test public void test() throws SQLException, IOException { String ss = "<span style=\"font-weight: bolder; color: var(--theme-color-black) !important; background: var(--module-background) !important;\">背景</span>"; MultiLanguageAnalyzer.AnalyzerResult analyzerResult = analyzer.analyze(ss); System.out.println(analyzerResult.getOriginalTexts()); }` The code is like this. I use lucene-analyzers kuromoji 8.11.4 If I do not filter html using jsoup, The output originalTexts will be `背景</span>`. The html will still exist, does this result match the expectation? If I use jsoup to filter the input first, the output will be `背景`, ### Version and environment details <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-kuromoji</artifactId> <version>8.11.4</version> </dependency> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org