Hi Ahmet, Thanks a lot for the post, details and infos.
I've started trying out all the options that you suggested. And... I must say that I am not able to reproduce my error. Which means that even the code that I posted works with flying colors. I am puzzled. Cheers, Arturas On Thu, Jul 5, 2018 at 5:25 PM, Ahmet Arslan <iori...@yahoo.com.invalid> wrote: > Hi Arturas, > > Here are some things to try : > > 1) HTMLStripCharFilter stripper = new > HTMLStripCharFilter(strReader.markSupported() > ? strReader : new BufferedReader(strReader)) > > 2) Consider using HTML Strip update processor factory. > > 3) Create a custom Lucene analyzer using html strip char filter and white > space tokenizer. Use the "invoking the analyzer" example given in > http://lucene.apache.org/core/7_4_0/core/org/apache/lucene/a > nalysis/package-summary.html > > Ahmet > > > > On Thursday, July 5, 2018, 9:53:58 AM GMT+3, Arturas Mazeika < > maze...@gmail.com> wrote: > > > > > > Hi Solr Folk, > > What would be the easiest way to use some of the Solr and Lucene components > in SolrJ? > > I am pretty amazed how much thought and careful engineering went into some > individual components to cover the wild real world effectively. And I > wonder whether one could re-use some of them in other context. > > At the bottom, I wanted to strip the HTML code and store the output in solr > (with different reasons behind [0]). I approached the problem > pragmatically: googled with "HTMLStripCharFilter and example", got to [1]. > checked which jar I need for that (solr-core), googled for pom dependencies > [2]. and integrated this into my solrj app: > > StringReader strReader = new StringReader(content); > HTMLStripCharFilter stripper = new > HTMLStripCharFilter(new BufferedReader(strReader)); > StringBuilder o = new StringBuilder(); > char[] cbuf = new char[1024 * 10]; > while (true) { > int count = stripper.read(cbuf); > if (count == -1) > break; // end of stream mark is -1 > if (count > 0) > o.append(cbuf, 0, count); > } > stripper.close(); > doc.addField("content_stripped", o.toString()); > > > Dependencies were downloaded [3], and if I start the program nothing > happens (I have a feeling that a web server is being started). > > Comments? > > Cheers, > Arturas > > References > > [0] Reasons may vary from optimizing highlighting of the text for the end > user to exposing oneself to individual components of solr at the deepest > level, analysis of impact to algorithms like machine learning or data > management > > [1] > https://www.programcreek.com/java-api-examples/index.php?api > =org.apache.lucene.analysis.charfilter.HTMLStripCharFilter > > [2] pom.xml: > > <dependencies> > <dependency> > <groupId>org.apache.solr</groupId> > <artifactId>solr-solrj</artifactId> > <version>7.3.0</version> > </dependency> > > <dependency> > <groupId>org.apache.solr</groupId> > <artifactId>solr-core</artifactId> > <version>7.3.0</version> > </dependency> > </dependencies> > > [3]Included Jars: > hppc-0.7.3.jar already exists in destination. > jackson-annotations-2.5.4.jar already exists in destination. > jackson-core-2.5.4.jar already exists in destination. > jackson-databind-2.5.4.jar already exists in destination. > jackson-dataformat-smile-2.5.4.jar already exists in destination. > caffeine-2.4.0.jar already exists in destination. > guava-14.0.1.jar already exists in destination. > protobuf-java-3.1.0.jar already exists in destination. > t-digest-3.1.jar already exists in destination. > commons-cli-1.2.jar already exists in destination. > commons-codec-1.10.jar already exists in destination. > commons-collections-3.2.2.jar already exists in destination. > commons-configuration-1.6.jar already exists in destination. > commons-fileupload-1.3.2.jar already exists in destination. > commons-io-2.5.jar already exists in destination. > commons-lang-2.6.jar already exists in destination. > dom4j-1.6.1.jar already exists in destination. > gmetric4j-1.0.7.jar already exists in destination. > metrics-core-3.2.2.jar already exists in destination. > metrics-ganglia-3.2.2.jar already exists in destination. > metrics-graphite-3.2.2.jar already exists in destination. > metrics-jetty9-3.2.2.jar already exists in destination. > metrics-jvm-3.2.2.jar already exists in destination. > javax.servlet-api-3.1.0.jar already exists in destination. > tools.jar already exists in destination. > joda-time-2.2.jar already exists in destination. > log4j-1.2.17.jar already exists in destination. > eigenbase-properties-1.1.5.jar already exists in destination. > antlr4-runtime-4.5.1-1.jar already exists in destination. > calcite-core-1.13.0.jar already exists in destination. > calcite-linq4j-1.13.0.jar already exists in destination. > avatica-core-1.10.0.jar already exists in destination. > commons-exec-1.3.jar already exists in destination. > commons-lang3-3.6.jar already exists in destination. > commons-math3-3.6.1.jar already exists in destination. > curator-client-2.8.0.jar already exists in destination. > curator-framework-2.8.0.jar already exists in destination. > curator-recipes-2.8.0.jar already exists in destination. > hadoop-annotations-2.7.4.jar already exists in destination. > hadoop-auth-2.7.4.jar already exists in destination. > hadoop-common-2.7.4.jar already exists in destination. > hadoop-hdfs-2.7.4.jar already exists in destination. > htrace-core-3.2.0-incubating.jar already exists in destination. > httpclient-4.5.3.jar already exists in destination. > httpcore-4.4.6.jar already exists in destination. > httpmime-4.5.3.jar already exists in destination. > lucene-analyzers-common-7.3.0.jar already exists in destination. > lucene-analyzers-kuromoji-7.3.0.jar already exists in destination. > lucene-analyzers-phonetic-7.3.0.jar already exists in destination. > lucene-backward-codecs-7.3.0.jar already exists in destination. > lucene-classification-7.3.0.jar already exists in destination. > lucene-codecs-7.3.0.jar already exists in destination. > lucene-core-7.3.0.jar already exists in destination. > lucene-expressions-7.3.0.jar already exists in destination. > lucene-grouping-7.3.0.jar already exists in destination. > lucene-highlighter-7.3.0.jar already exists in destination. > lucene-join-7.3.0.jar already exists in destination. > lucene-memory-7.3.0.jar already exists in destination. > lucene-misc-7.3.0.jar already exists in destination. > lucene-queries-7.3.0.jar already exists in destination. > lucene-queryparser-7.3.0.jar already exists in destination. > lucene-sandbox-7.3.0.jar already exists in destination. > lucene-spatial-extras-7.3.0.jar already exists in destination. > lucene-spatial3d-7.3.0.jar already exists in destination. > lucene-suggest-7.3.0.jar already exists in destination. > solr-core-7.3.0.jar already exists in destination. > solr-solrj-7.3.0.jar already exists in destination. > zookeeper-3.4.11.jar already exists in destination. > jackson-core-asl-1.9.13.jar already exists in destination. > jackson-mapper-asl-1.9.13.jar already exists in destination. > commons-compiler-2.7.6.jar already exists in destination. > janino-2.7.6.jar already exists in destination. > stax2-api-3.1.4.jar already exists in destination. > woodstox-core-asl-4.4.1.jar already exists in destination. > jetty-continuation-9.4.8.v20171121.jar already exists in destination. > jetty-deploy-9.4.8.v20171121.jar already exists in destination. > jetty-http-9.4.8.v20171121.jar already exists in destination. > jetty-io-9.4.8.v20171121.jar already exists in destination. > jetty-jmx-9.4.8.v20171121.jar already exists in destination. > jetty-rewrite-9.4.8.v20171121.jar already exists in destination. > jetty-security-9.4.8.v20171121.jar already exists in destination. > jetty-server-9.4.8.v20171121.jar already exists in destination. > jetty-servlet-9.4.8.v20171121.jar already exists in destination. > jetty-servlets-9.4.8.v20171121.jar already exists in destination. > jetty-util-9.4.8.v20171121.jar already exists in destination. > jetty-webapp-9.4.8.v20171121.jar already exists in destination. > jetty-xml-9.4.8.v20171121.jar already exists in destination. > spatial4j-0.7.jar already exists in destination. > noggit-0.8.jar already exists in destination. > asm-5.1.jar already exists in destination. > asm-commons-5.1.jar already exists in destination. > org.restlet-2.3.0.jar already exists in destination. > org.restlet.ext.servlet-2.3.0.jar already exists in destination. > jcl-over-slf4j-1.7.24.jar already exists in destination. > slf4j-api-1.7.24.jar already exists in destination. >