[I] HTMLStripCharFilter [lucene]

via GitHub Sun, 29 Dec 2024 19:18:07 -0800

cangkuren opened a new issue, #14089:
URL: https://github.com/apache/lucene/issues/14089

   ### Description
   
   `public AnalyzerResult analyze(String text) throws IOException {
   //        text = HtmlExtractor.extractTextFromHtml(text);
           List<String> tokens = new ArrayList<>();
           List<String> originalTexts = new ArrayList<>();
           try (TokenStream stream = tokenStream("*", text)) {
               stream.reset();
               CharTermAttribute charTermAttribute = 
stream.addAttribute(CharTermAttribute.class);
               OffsetAttribute offsetAttribute = 
stream.addAttribute(OffsetAttribute.class);
               while (stream.incrementToken()) {
                   tokens.add(charTermAttribute.toString());
                   
originalTexts.add(text.substring(offsetAttribute.startOffset(), 
offsetAttribute.endOffset()));
               }
           }
           return 
AnalyzerResult.builder().tokens(tokens).originalTexts(originalTexts).build();
       }`
   `public static String extractTextFromHtml(String content) {
           Document document = Jsoup.parseBodyFragment(content);
           return document.body().text().replace(" ", "").trim();
       }`
   
   `protected Reader initReader(String fieldName, Reader reader) {
           reader = new HTMLStripCharFilter(reader);
           reader = new JapaneseIterationMarkCharFilter(reader);
           return reader;
       }`
   
   `@PostConstruct
       @Scheduled(cron = "0 0 0 * * *")
       public void init() {
           Logged.L.info("load Japanese config.");
           String dict = japaneseDictConfig.getDict();
           UserDictionary userDictionary = null;
           try {
               userDictionary = UserDictionary.open(new StringReader(dict));
           } catch (Exception e) {
               Logged.L.error("load japanese dict error", e);
           }
   
           List<String> stopWords = stopWordConfig.getStopWords();
           CharArraySet stopSet = new CharArraySet(stopWords, true);
           stopSet.add(getDefaultStopSet());
   
           Tokenizer tokenizer = new JapaneseTokenizer(userDictionary, true, 
false, JapaneseTokenizer.Mode.SEARCH);
   
           TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
           stream = new JapanesePartOfSpeechStopFilter(stream, 
getDefaultStopTags());
           stream = new CJKWidthFilter(stream);
           stream = new StopFilter(stream, stopSet);
           stream = new JapaneseReadingFormFilter(stream);
   //        stream = new JapaneseKatakanaStemFilter(stream);
           stream = new JapaneseNumberFilter(stream);
           stream = new LowerCaseFilter(stream);
           this.tokenStreamComponents = new TokenStreamComponents(tokenizer, 
stream);
       }`
   `@Test
       public void test() throws SQLException, IOException {
           String ss = "<span style=\"font-weight: bolder; color: 
var(--theme-color-black) !important; background: var(--module-background) 
!important;\">背景</span>";
           MultiLanguageAnalyzer.AnalyzerResult analyzerResult = 
analyzer.analyze(ss);
           System.out.println(analyzerResult.getOriginalTexts());
   
       }`
   The code is like this. I use lucene-analyzers kuromoji 8.11.4
    If I do not filter html using jsoup, The output originalTexts will be 
`背景</span>`. The html will still exist, does this result match the expectation?
   
   If I use jsoup to filter the input first, the output will be `背景`,
   
   
   
   ### Version and environment details
   
   <dependency>
       <groupId>org.apache.lucene</groupId>
       <artifactId>lucene-analyzers-kuromoji</artifactId>
       <version>8.11.4</version>
   </dependency>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] HTMLStripCharFilter [lucene]

Reply via email to