Ah! Only supported type is: text/html; encoding=utf-8
I am not confident of this either :) but this should work. See the code-snippet below: ...... if(res.httpStatus == 200) { // Raw content type of form "text/html; encoding=utf-8" String rawContentType = conn.getContentType(); String type = rawContentType.split(";")[0]; if(typeSupported(type) || "*".equals(fileTypes)) { String encoding = conn.getContentEncoding(); .... Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <la...@franz.com> wrote: > Amrit Sarkar wrote: > > >> Strange, > >> > >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's > >> Content-Type. Let's see what it says now. > > Same thing. Verified Content-Type: > > quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& > grep Content-Type > Content-Type: text/html;charset=utf-8 > quadra[git:master]$ ] > > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md > /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md > SimplePostTool version 5.0.0 > Posting web pages to Solr url http://localhost:8983/solr/ > handbook/update/extract > Entering auto mode. Indexing pages with content-types corresponding to > file endings md > SimplePostTool: WARNING: Never crawl an external web site faster than > every 10 seconds, your IP will probably be blocked > Entering recursive mode, depth=10, delay=0s > Entering crawl at level 0 (1 links total, 1 new) > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a > HTTP result status of 415 > 0 web pages indexed. > COMMITting Solr index changes to http://localhost:8983/solr/ > handbook/update/extract... > Time spent: 0:00:00.531 > quadra[git:master]$ > > Kevin > > >> > >> Amrit Sarkar > >> Search Engineer > >> Lucidworks, Inc. > >> 415-589-9269 > >> www.lucidworks.com > >> Twitter http://twitter.com/lucidworks > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <la...@franz.com> wrote: > >> > >> > OK, so I hacked markserv to add Content-Type text/html, but now I get > >> > > >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > >> > > >> > What is it expecting? > >> > > >> > $ docker exec -it --user=solr solr bin/post -c handbook > >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md > >> > /docker-java-home/jre/bin/java -classpath > /opt/solr/dist/solr-core-7.0.1.jar > >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook > -Ddata=web > >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md > >> > SimplePostTool version 5.0.0 > >> > Posting web pages to Solr url http://localhost:8983/solr/ > >> > handbook/update/extract > >> > Entering auto mode. Indexing pages with content-types corresponding to > >> > file endings md > >> > SimplePostTool: WARNING: Never crawl an external web site faster than > >> > every 10 seconds, your IP will probably be blocked > >> > Entering recursive mode, depth=10, delay=0s > >> > Entering crawl at level 0 (1 links total, 1 new) > >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md > returned a > >> > HTTP result status of 415 > >> > 0 web pages indexed. > >> > COMMITting Solr index changes to http://localhost:8983/solr/ > >> > handbook/update/extract... > >> > Time spent: 0:00:03.882 > >> > $ > >> > > >> > Thanks. > >> > > >> > Kevin > >> > >