Reference to the code: .....
String rawContentType = conn.getContentType(); String type = rawContentType.split(";")[0]; if(typeSupported(type) || "*".equals(fileTypes)) { String encoding = conn.getContentEncoding(); ..... protected boolean typeSupported(String type) { for(String key : mimeMap.keySet()) { if(mimeMap.get(key).equals(type)) { if(fileTypes.contains(key)) return true; } } return false; } ..... It has another check for fileTypes, I can see the page ending with .md (which you are indexing) and not .html. Let's hope now this is not the issue. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > Kevin, > > Just put "html" too and give it a shot. These are the types it is > expecting: > > mimeMap = new HashMap<>(); > mimeMap.put("xml", "application/xml"); > mimeMap.put("csv", "text/csv"); > mimeMap.put("json", "application/json"); > mimeMap.put("jsonl", "application/json"); > mimeMap.put("pdf", "application/pdf"); > mimeMap.put("rtf", "text/rtf"); > mimeMap.put("html", "text/html"); > mimeMap.put("htm", "text/html"); > mimeMap.put("doc", "application/msword"); > mimeMap.put("docx", > "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); > mimeMap.put("ppt", "application/vnd.ms-powerpoint"); > mimeMap.put("pptx", > "application/vnd.openxmlformats-officedocument.presentationml.presentation"); > mimeMap.put("xls", "application/vnd.ms-excel"); > mimeMap.put("xlsx", > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); > mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); > mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); > mimeMap.put("txt", "text/plain"); > mimeMap.put("log", "text/plain"); > > The keys are the types supported. > > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <sarkaramr...@gmail.com> > wrote: > >> Ah! >> >> Only supported type is: text/html; encoding=utf-8 >> >> I am not confident of this either :) but this should work. >> >> See the code-snippet below: >> >> ...... >> >> if(res.httpStatus == 200) { >> // Raw content type of form "text/html; encoding=utf-8" >> String rawContentType = conn.getContentType(); >> String type = rawContentType.split(";")[0]; >> if(typeSupported(type) || "*".equals(fileTypes)) { >> String encoding = conn.getContentEncoding(); >> >> .... >> >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <la...@franz.com> wrote: >> >>> Amrit Sarkar wrote: >>> >>> >> Strange, >>> >> >>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org >>> page's >>> >> Content-Type. Let's see what it says now. >>> >>> Same thing. Verified Content-Type: >>> >>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >>> grep Content-Type >>> Content-Type: text/html;charset=utf-8 >>> quadra[git:master]$ ] >>> >>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c >>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes >>> md >>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar >>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >>> SimplePostTool version 5.0.0 >>> Posting web pages to Solr url http://localhost:8983/solr/han >>> dbook/update/extract >>> Entering auto mode. Indexing pages with content-types corresponding to >>> file endings md >>> SimplePostTool: WARNING: Never crawl an external web site faster than >>> every 10 seconds, your IP will probably be blocked >>> Entering recursive mode, depth=10, delay=0s >>> Entering crawl at level 0 (1 links total, 1 new) >>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >>> HTTP result status of 415 >>> 0 web pages indexed. >>> COMMITting Solr index changes to http://localhost:8983/solr/han >>> dbook/update/extract... >>> Time spent: 0:00:00.531 >>> quadra[git:master]$ >>> >>> Kevin >>> >>> >> >>> >> Amrit Sarkar >>> >> Search Engineer >>> >> Lucidworks, Inc. >>> >> 415-589-9269 >>> >> www.lucidworks.com >>> >> Twitter http://twitter.com/lucidworks >>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >>> >> >>> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <la...@franz.com> wrote: >>> >> >>> >> > OK, so I hacked markserv to add Content-Type text/html, but now I >>> get >>> >> > >>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type >>> text/html >>> >> > >>> >> > What is it expecting? >>> >> > >>> >> > $ docker exec -it --user=solr solr bin/post -c handbook >>> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >>> >> > /docker-java-home/jre/bin/java -classpath >>> /opt/solr/dist/solr-core-7.0.1.jar >>> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >>> -Ddata=web >>> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >>> >> > SimplePostTool version 5.0.0 >>> >> > Posting web pages to Solr url http://localhost:8983/solr/ >>> >> > handbook/update/extract >>> >> > Entering auto mode. Indexing pages with content-types corresponding >>> to >>> >> > file endings md >>> >> > SimplePostTool: WARNING: Never crawl an external web site faster >>> than >>> >> > every 10 seconds, your IP will probably be blocked >>> >> > Entering recursive mode, depth=10, delay=0s >>> >> > Entering crawl at level 0 (1 links total, 1 new) >>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type >>> text/html >>> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md >>> returned a >>> >> > HTTP result status of 415 >>> >> > 0 web pages indexed. >>> >> > COMMITting Solr index changes to http://localhost:8983/solr/ >>> >> > handbook/update/extract... >>> >> > Time spent: 0:00:03.882 >>> >> > $ >>> >> > >>> >> > Thanks. >>> >> > >>> >> > Kevin >>> >> > >>> >> >> >