Amrit Sarkar wrote: >> Reference to the code: >> >> ..... >> >> String rawContentType = conn.getContentType(); >> String type = rawContentType.split(";")[0]; >> if(typeSupported(type) || "*".equals(fileTypes)) { >> String encoding = conn.getContentEncoding(); >> >> ..... >> >> protected boolean typeSupported(String type) { >> for(String key : mimeMap.keySet()) { >> if(mimeMap.get(key).equals(type)) { >> if(fileTypes.contains(key)) >> return true; >> } >> } >> return false; >> } >> >> ..... >> >> It has another check for fileTypes, I can see the page ending with .md >> (which you are indexing) and not .html. Let's hope now this is not the >> issue.
Did you see the "-filetypes md" at the end of the post command line? Shouldn't that handle it? Kevin >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar <sarkaramr...@gmail.com> >> wrote: >> >> > Kevin, >> > >> > Just put "html" too and give it a shot. These are the types it is >> > expecting: >> > >> > mimeMap = new HashMap<>(); >> > mimeMap.put("xml", "application/xml"); >> > mimeMap.put("csv", "text/csv"); >> > mimeMap.put("json", "application/json"); >> > mimeMap.put("jsonl", "application/json"); >> > mimeMap.put("pdf", "application/pdf"); >> > mimeMap.put("rtf", "text/rtf"); >> > mimeMap.put("html", "text/html"); >> > mimeMap.put("htm", "text/html"); >> > mimeMap.put("doc", "application/msword"); >> > mimeMap.put("docx", >> > "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); >> > mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > mimeMap.put("pptx", >> > "application/vnd.openxmlformats-officedocument.presentationml.presentation"); >> > mimeMap.put("xls", "application/vnd.ms-excel"); >> > mimeMap.put("xlsx", >> > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> > mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> > mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> > mimeMap.put("txt", "text/plain"); >> > mimeMap.put("log", "text/plain"); >> > >> > The keys are the types supported. >> > >> > >> > Amrit Sarkar >> > Search Engineer >> > Lucidworks, Inc. >> > 415-589-9269 >> > www.lucidworks.com >> > Twitter http://twitter.com/lucidworks >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <sarkaramr...@gmail.com> >> > wrote: >> > >> >> Ah! >> >> >> >> Only supported type is: text/html; encoding=utf-8 >> >> >> >> I am not confident of this either :) but this should work. >> >> >> >> See the code-snippet below: >> >> >> >> ...... >> >> >> >> if(res.httpStatus == 200) { >> >> // Raw content type of form "text/html; encoding=utf-8" >> >> String rawContentType = conn.getContentType(); >> >> String type = rawContentType.split(";")[0]; >> >> if(typeSupported(type) || "*".equals(fileTypes)) { >> >> String encoding = conn.getContentEncoding(); >> >> >> >> .... >> >> >> >> >> >> Amrit Sarkar >> >> Search Engineer >> >> Lucidworks, Inc. >> >> 415-589-9269 >> >> www.lucidworks.com >> >> Twitter http://twitter.com/lucidworks >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> >> >> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <la...@franz.com> wrote: >> >> >> >>> Amrit Sarkar wrote: >> >>> >> >>> >> Strange, >> >>> >> >> >>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org >> >>> page's >> >>> >> Content-Type. Let's see what it says now. >> >>> >> >>> Same thing. Verified Content-Type: >> >>> >> >>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >> >>> grep Content-Type >> >>> Content-Type: text/html;charset=utf-8 >> >>> quadra[git:master]$ ] >> >>> >> >>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c >> >>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes >> >>> md >> >>> /docker-java-home/jre/bin/java -classpath >> >>> /opt/solr/dist/solr-core-7.0.1.jar >> >>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> >>> -Ddata=web >> >>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >>> SimplePostTool version 5.0.0 >> >>> Posting web pages to Solr url http://localhost:8983/solr/han >> >>> dbook/update/extract >> >>> Entering auto mode. Indexing pages with content-types corresponding to >> >>> file endings md >> >>> SimplePostTool: WARNING: Never crawl an external web site faster than >> >>> every 10 seconds, your IP will probably be blocked >> >>> Entering recursive mode, depth=10, delay=0s >> >>> Entering crawl at level 0 (1 links total, 1 new) >> >>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >> >>> HTTP result status of 415 >> >>> 0 web pages indexed. >> >>> COMMITting Solr index changes to http://localhost:8983/solr/han >> >>> dbook/update/extract... >> >>> Time spent: 0:00:00.531 >> >>> quadra[git:master]$ >> >>> >> >>> Kevin >> >>> >> >>> >> >> >>> >> Amrit Sarkar >> >>> >> Search Engineer >> >>> >> Lucidworks, Inc. >> >>> >> 415-589-9269 >> >>> >> www.lucidworks.com >> >>> >> Twitter http://twitter.com/lucidworks >> >>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >>> >> >> >>> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <la...@franz.com> wrote: >> >>> >> >> >>> >> > OK, so I hacked markserv to add Content-Type text/html, but now I >> >>> get >> >>> >> > >> >>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type >> >>> text/html >> >>> >> > >> >>> >> > What is it expecting? >> >>> >> > >> >>> >> > $ docker exec -it --user=solr solr bin/post -c handbook >> >>> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> >>> >> > /docker-java-home/jre/bin/java -classpath >> >>> /opt/solr/dist/solr-core-7.0.1.jar >> >>> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> >>> -Ddata=web >> >>> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >>> >> > SimplePostTool version 5.0.0 >> >>> >> > Posting web pages to Solr url http://localhost:8983/solr/ >> >>> >> > handbook/update/extract >> >>> >> > Entering auto mode. Indexing pages with content-types corresponding >> >>> to >> >>> >> > file endings md >> >>> >> > SimplePostTool: WARNING: Never crawl an external web site faster >> >>> than >> >>> >> > every 10 seconds, your IP will probably be blocked >> >>> >> > Entering recursive mode, depth=10, delay=0s >> >>> >> > Entering crawl at level 0 (1 links total, 1 new) >> >>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type >> >>> text/html >> >>> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md >> >>> returned a >> >>> >> > HTTP result status of 415 >> >>> >> > 0 web pages indexed. >> >>> >> > COMMITting Solr index changes to http://localhost:8983/solr/ >> >>> >> > handbook/update/extract... >> >>> >> > Time spent: 0:00:03.882 >> >>> >> > $ >> >>> >> > >> >>> >> > Thanks. >> >>> >> > >> >>> >> > Kevin >> >>> >> > >> >>> >> >> >> >> >> >