Hi Kevin,

Can you post the solr log in the mail thread. I don't think it handled the
.md by itself by first glance at code.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer <la...@franz.com> wrote:

> Amrit Sarkar wrote:
>
> >> Kevin,
> >>
> >> Just put "html" too and give it a shot. These are the types it is
> expecting:
>
> Same thing.
>
> >>
> >> mimeMap = new HashMap<>();
> >> mimeMap.put("xml", "application/xml");
> >> mimeMap.put("csv", "text/csv");
> >> mimeMap.put("json", "application/json");
> >> mimeMap.put("jsonl", "application/json");
> >> mimeMap.put("pdf", "application/pdf");
> >> mimeMap.put("rtf", "text/rtf");
> >> mimeMap.put("html", "text/html");
> >> mimeMap.put("htm", "text/html");
> >> mimeMap.put("doc", "application/msword");
> >> mimeMap.put("docx",
> >> "application/vnd.openxmlformats-officedocument.
> wordprocessingml.document");
> >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> >> mimeMap.put("pptx",
> >> "application/vnd.openxmlformats-officedocument.
> presentationml.presentation");
> >> mimeMap.put("xls", "application/vnd.ms-excel");
> >> mimeMap.put("xlsx",
> >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
> >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
> >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
> >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
> >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
> >> mimeMap.put("txt", "text/plain");
> >> mimeMap.put("log", "text/plain");
> >>
> >> The keys are the types supported.
> >>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <sarkaramr...@gmail.com>
> >> wrote:
> >>
> >> > Ah!
> >> >
> >> > Only supported type is: text/html; encoding=utf-8
> >> >
> >> > I am not confident of this either :) but this should work.
> >> >
> >> > See the code-snippet below:
> >> >
> >> > ......
> >> >
> >> > if(res.httpStatus == 200) {
> >> >   // Raw content type of form "text/html; encoding=utf-8"
> >> >   String rawContentType = conn.getContentType();
> >> >   String type = rawContentType.split(";")[0];
> >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
> >> >     String encoding = conn.getContentEncoding();
> >> >
> >> > ....
> >> >
> >> >
> >> > Amrit Sarkar
> >> > Search Engineer
> >> > Lucidworks, Inc.
> >> > 415-589-9269
> >> > www.lucidworks.com
> >> > Twitter http://twitter.com/lucidworks
> >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> >
> >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <la...@franz.com> wrote:
> >> >
> >> >> Amrit Sarkar wrote:
> >> >>
> >> >> >> Strange,
> >> >> >>
> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
> page's
> >> >> >> Content-Type. Let's see what it says now.
> >> >>
> >> >> Same thing.  Verified Content-Type:
> >> >>
> >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md
> |&
> >> >> grep Content-Type
> >> >>   Content-Type: text/html;charset=utf-8
> >> >> quadra[git:master]$ ]
> >> >>
> >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
> handbook
> >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> >> >> /docker-java-home/jre/bin/java -classpath
> /opt/solr/dist/solr-core-7.0.1.jar
> >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> -Ddata=web
> >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> >> >> SimplePostTool version 5.0.0
> >> >> Posting web pages to Solr url http://localhost:8983/solr/han
> >> >> dbook/update/extract
> >> >> Entering auto mode. Indexing pages with content-types corresponding
> to
> >> >> file endings md
> >> >> SimplePostTool: WARNING: Never crawl an external web site faster than
> >> >> every 10 seconds, your IP will probably be blocked
> >> >> Entering recursive mode, depth=10, delay=0s
> >> >> Entering crawl at level 0 (1 links total, 1 new)
> >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md
> returned a
> >> >> HTTP result status of 415
> >> >> 0 web pages indexed.
> >> >> COMMITting Solr index changes to http://localhost:8983/solr/han
> >> >> dbook/update/extract...
> >> >> Time spent: 0:00:00.531
> >> >> quadra[git:master]$
> >> >>
> >> >> Kevin
> >> >>
> >> >> >>
> >> >> >> Amrit Sarkar
> >> >> >> Search Engineer
> >> >> >> Lucidworks, Inc.
> >> >> >> 415-589-9269
> >> >> >> www.lucidworks.com
> >> >> >> Twitter http://twitter.com/lucidworks
> >> >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> >> >>
> >> >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <la...@franz.com>
> wrote:
> >> >> >>
> >> >> >> > OK, so I hacked markserv to add Content-Type text/html, but now
> I get
> >> >> >> >
> >> >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type
> text/html
> >> >> >> >
> >> >> >> > What is it expecting?
> >> >> >> >
> >> >> >> > $ docker exec -it --user=solr solr bin/post -c handbook
> >> >> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
> md
> >> >> >> > /docker-java-home/jre/bin/java -classpath
> >> >> /opt/solr/dist/solr-core-7.0.1.jar
> >> >> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> >> >> -Ddata=web
> >> >> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> >> >> >> > SimplePostTool version 5.0.0
> >> >> >> > Posting web pages to Solr url http://localhost:8983/solr/
> >> >> >> > handbook/update/extract
> >> >> >> > Entering auto mode. Indexing pages with content-types
> corresponding
> >> >> to
> >> >> >> > file endings md
> >> >> >> > SimplePostTool: WARNING: Never crawl an external web site
> faster than
> >> >> >> > every 10 seconds, your IP will probably be blocked
> >> >> >> > Entering recursive mode, depth=10, delay=0s
> >> >> >> > Entering crawl at level 0 (1 links total, 1 new)
> >> >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type
> text/html
> >> >> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md
> >> >> returned a
> >> >> >> > HTTP result status of 415
> >> >> >> > 0 web pages indexed.
> >> >> >> > COMMITting Solr index changes to http://localhost:8983/solr/
> >> >> >> > handbook/update/extract...
> >> >> >> > Time spent: 0:00:03.882
> >> >> >> > $
> >> >> >> >
> >> >> >> > Thanks.
> >> >> >> >
> >> >> >> > Kevin
> >> >> >> >
> >> >>
> >> >
> >> >
>

Reply via email to