Hi Ahmet,

Thanks a lot for the post, details and infos.

I've started trying out all the options that you suggested. And... I must
say that I am not able to reproduce my error. Which means that even the
code that I posted works with flying colors.

I am puzzled.

Cheers,
Arturas


On Thu, Jul 5, 2018 at 5:25 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi Arturas,
>
> Here are some things to try :
>
> 1) HTMLStripCharFilter stripper = new 
> HTMLStripCharFilter(strReader.markSupported()
> ? strReader : new BufferedReader(strReader))
>
> 2) Consider using HTML Strip update processor factory.
>
> 3) Create a custom Lucene analyzer using html strip char filter and white
> space tokenizer. Use the "invoking the analyzer" example given in
> http://lucene.apache.org/core/7_4_0/core/org/apache/lucene/a
> nalysis/package-summary.html
>
> Ahmet
>
>
>
> On Thursday, July 5, 2018, 9:53:58 AM GMT+3, Arturas Mazeika <
> maze...@gmail.com> wrote:
>
>
>
>
>
> Hi Solr Folk,
>
> What would be the easiest way to use some of the Solr and Lucene components
> in SolrJ?
>
> I am pretty amazed how much thought and careful engineering went into some
> individual components to cover the wild real world effectively. And I
> wonder whether one could re-use some of them in other context.
>
> At the bottom, I wanted to strip the HTML code and store the output in solr
> (with different reasons behind [0]). I approached the problem
> pragmatically: googled with "HTMLStripCharFilter and example", got to [1].
> checked which jar I need for that (solr-core), googled for pom dependencies
> [2]. and integrated this into my solrj app:
>
>                     StringReader strReader = new StringReader(content);
>                     HTMLStripCharFilter stripper = new
> HTMLStripCharFilter(new BufferedReader(strReader));
>                     StringBuilder o = new StringBuilder();
>                     char[] cbuf = new char[1024 * 10];
>                     while (true) {
>                         int count = stripper.read(cbuf);
>                         if (count == -1)
>                             break; // end of stream mark is -1
>                         if (count > 0)
>                             o.append(cbuf, 0, count);
>                     }
>                     stripper.close();
>                     doc.addField("content_stripped", o.toString());
>
>
> Dependencies were downloaded [3], and if I start the program nothing
> happens (I have a feeling that a web server is being started).
>
> Comments?
>
> Cheers,
> Arturas
>
> References
>
> [0] Reasons may vary from optimizing highlighting of the text for the end
> user to exposing oneself to individual components of solr at the deepest
> level, analysis of impact to algorithms like machine learning or data
> management
>
> [1]
> https://www.programcreek.com/java-api-examples/index.php?api
> =org.apache.lucene.analysis.charfilter.HTMLStripCharFilter
>
> [2] pom.xml:
>
>   <dependencies>
>         <dependency>
>             <groupId>org.apache.solr</groupId>
>             <artifactId>solr-solrj</artifactId>
>             <version>7.3.0</version>
>         </dependency>
>
>         <dependency>
>             <groupId>org.apache.solr</groupId>
>             <artifactId>solr-core</artifactId>
>             <version>7.3.0</version>
>         </dependency>
>     </dependencies>
>
> [3]Included Jars:
> hppc-0.7.3.jar already exists in destination.
> jackson-annotations-2.5.4.jar already exists in destination.
> jackson-core-2.5.4.jar already exists in destination.
> jackson-databind-2.5.4.jar already exists in destination.
> jackson-dataformat-smile-2.5.4.jar already exists in destination.
> caffeine-2.4.0.jar already exists in destination.
> guava-14.0.1.jar already exists in destination.
> protobuf-java-3.1.0.jar already exists in destination.
> t-digest-3.1.jar already exists in destination.
> commons-cli-1.2.jar already exists in destination.
> commons-codec-1.10.jar already exists in destination.
> commons-collections-3.2.2.jar already exists in destination.
> commons-configuration-1.6.jar already exists in destination.
> commons-fileupload-1.3.2.jar already exists in destination.
> commons-io-2.5.jar already exists in destination.
> commons-lang-2.6.jar already exists in destination.
> dom4j-1.6.1.jar already exists in destination.
> gmetric4j-1.0.7.jar already exists in destination.
> metrics-core-3.2.2.jar already exists in destination.
> metrics-ganglia-3.2.2.jar already exists in destination.
> metrics-graphite-3.2.2.jar already exists in destination.
> metrics-jetty9-3.2.2.jar already exists in destination.
> metrics-jvm-3.2.2.jar already exists in destination.
> javax.servlet-api-3.1.0.jar already exists in destination.
> tools.jar already exists in destination.
> joda-time-2.2.jar already exists in destination.
> log4j-1.2.17.jar already exists in destination.
> eigenbase-properties-1.1.5.jar already exists in destination.
> antlr4-runtime-4.5.1-1.jar already exists in destination.
> calcite-core-1.13.0.jar already exists in destination.
> calcite-linq4j-1.13.0.jar already exists in destination.
> avatica-core-1.10.0.jar already exists in destination.
> commons-exec-1.3.jar already exists in destination.
> commons-lang3-3.6.jar already exists in destination.
> commons-math3-3.6.1.jar already exists in destination.
> curator-client-2.8.0.jar already exists in destination.
> curator-framework-2.8.0.jar already exists in destination.
> curator-recipes-2.8.0.jar already exists in destination.
> hadoop-annotations-2.7.4.jar already exists in destination.
> hadoop-auth-2.7.4.jar already exists in destination.
> hadoop-common-2.7.4.jar already exists in destination.
> hadoop-hdfs-2.7.4.jar already exists in destination.
> htrace-core-3.2.0-incubating.jar already exists in destination.
> httpclient-4.5.3.jar already exists in destination.
> httpcore-4.4.6.jar already exists in destination.
> httpmime-4.5.3.jar already exists in destination.
> lucene-analyzers-common-7.3.0.jar already exists in destination.
> lucene-analyzers-kuromoji-7.3.0.jar already exists in destination.
> lucene-analyzers-phonetic-7.3.0.jar already exists in destination.
> lucene-backward-codecs-7.3.0.jar already exists in destination.
> lucene-classification-7.3.0.jar already exists in destination.
> lucene-codecs-7.3.0.jar already exists in destination.
> lucene-core-7.3.0.jar already exists in destination.
> lucene-expressions-7.3.0.jar already exists in destination.
> lucene-grouping-7.3.0.jar already exists in destination.
> lucene-highlighter-7.3.0.jar already exists in destination.
> lucene-join-7.3.0.jar already exists in destination.
> lucene-memory-7.3.0.jar already exists in destination.
> lucene-misc-7.3.0.jar already exists in destination.
> lucene-queries-7.3.0.jar already exists in destination.
> lucene-queryparser-7.3.0.jar already exists in destination.
> lucene-sandbox-7.3.0.jar already exists in destination.
> lucene-spatial-extras-7.3.0.jar already exists in destination.
> lucene-spatial3d-7.3.0.jar already exists in destination.
> lucene-suggest-7.3.0.jar already exists in destination.
> solr-core-7.3.0.jar already exists in destination.
> solr-solrj-7.3.0.jar already exists in destination.
> zookeeper-3.4.11.jar already exists in destination.
> jackson-core-asl-1.9.13.jar already exists in destination.
> jackson-mapper-asl-1.9.13.jar already exists in destination.
> commons-compiler-2.7.6.jar already exists in destination.
> janino-2.7.6.jar already exists in destination.
> stax2-api-3.1.4.jar already exists in destination.
> woodstox-core-asl-4.4.1.jar already exists in destination.
> jetty-continuation-9.4.8.v20171121.jar already exists in destination.
> jetty-deploy-9.4.8.v20171121.jar already exists in destination.
> jetty-http-9.4.8.v20171121.jar already exists in destination.
> jetty-io-9.4.8.v20171121.jar already exists in destination.
> jetty-jmx-9.4.8.v20171121.jar already exists in destination.
> jetty-rewrite-9.4.8.v20171121.jar already exists in destination.
> jetty-security-9.4.8.v20171121.jar already exists in destination.
> jetty-server-9.4.8.v20171121.jar already exists in destination.
> jetty-servlet-9.4.8.v20171121.jar already exists in destination.
> jetty-servlets-9.4.8.v20171121.jar already exists in destination.
> jetty-util-9.4.8.v20171121.jar already exists in destination.
> jetty-webapp-9.4.8.v20171121.jar already exists in destination.
> jetty-xml-9.4.8.v20171121.jar already exists in destination.
> spatial4j-0.7.jar already exists in destination.
> noggit-0.8.jar already exists in destination.
> asm-5.1.jar already exists in destination.
> asm-commons-5.1.jar already exists in destination.
> org.restlet-2.3.0.jar already exists in destination.
> org.restlet.ext.servlet-2.3.0.jar already exists in destination.
> jcl-over-slf4j-1.7.24.jar already exists in destination.
> slf4j-api-1.7.24.jar already exists in destination.
>

Reply via email to