Re: solr 7.0.1: exception running post to crawl simple website

Kevin Layer Fri, 13 Oct 2017 07:11:36 -0700

Amrit Sarkar wrote:

>> Reference to the code:
>> 
>> .....
>> 
>> String rawContentType = conn.getContentType();
>> String type = rawContentType.split(";")[0];
>> if(typeSupported(type) || "*".equals(fileTypes)) {
>>   String encoding = conn.getContentEncoding();
>> 
>> .....
>> 
>> protected boolean typeSupported(String type) {
>>   for(String key : mimeMap.keySet()) {
>>     if(mimeMap.get(key).equals(type)) {
>>       if(fileTypes.contains(key))
>>         return true;
>>     }
>>   }
>>   return false;
>> }
>> 
>> .....
>> 
>> It has another check for fileTypes, I can see the page ending with .md
>> (which you are indexing) and not .html. Let's hope now this is not the
>> issue.


Did you see the "-filetypes md" at the end of the post command line?
Shouldn't that handle it?

Kevin

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar <sarkaramr...@gmail.com>
>> wrote:
>> 
>> > Kevin,
>> >
>> > Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > mimeMap = new HashMap<>();
>> > mimeMap.put("xml", "application/xml");
>> > mimeMap.put("csv", "text/csv");
>> > mimeMap.put("json", "application/json");
>> > mimeMap.put("jsonl", "application/json");
>> > mimeMap.put("pdf", "application/pdf");
>> > mimeMap.put("rtf", "text/rtf");
>> > mimeMap.put("html", "text/html");
>> > mimeMap.put("htm", "text/html");
>> > mimeMap.put("doc", "application/msword");
>> > mimeMap.put("docx", 
>> > "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
>> > mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > mimeMap.put("pptx", 
>> > "application/vnd.openxmlformats-officedocument.presentationml.presentation");
>> > mimeMap.put("xls", "application/vnd.ms-excel");
>> > mimeMap.put("xlsx", 
>> > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("txt", "text/plain");
>> > mimeMap.put("log", "text/plain");
>> >
>> > The keys are the types supported.
>> >
>> >
>> > Amrit Sarkar
>> > Search Engineer
>> > Lucidworks, Inc.
>> > 415-589-9269
>> > www.lucidworks.com
>> > Twitter http://twitter.com/lucidworks
>> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >
>> > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <sarkaramr...@gmail.com>
>> > wrote:
>> >
>> >> Ah!
>> >>
>> >> Only supported type is: text/html; encoding=utf-8
>> >>
>> >> I am not confident of this either :) but this should work.
>> >>
>> >> See the code-snippet below:
>> >>
>> >> ......
>> >>
>> >> if(res.httpStatus == 200) {
>> >>   // Raw content type of form "text/html; encoding=utf-8"
>> >>   String rawContentType = conn.getContentType();
>> >>   String type = rawContentType.split(";")[0];
>> >>   if(typeSupported(type) || "*".equals(fileTypes)) {
>> >>     String encoding = conn.getContentEncoding();
>> >>
>> >> ....
>> >>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <la...@franz.com> wrote:
>> >>
>> >>> Amrit Sarkar wrote:
>> >>>
>> >>> >> Strange,
>> >>> >>
>> >>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>> >>> page's
>> >>> >> Content-Type. Let's see what it says now.
>> >>>
>> >>> Same thing.  Verified Content-Type:
>> >>>
>> >>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> >>> grep Content-Type
>> >>>   Content-Type: text/html;charset=utf-8
>> >>> quadra[git:master]$ ]
>> >>>
>> >>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>> >>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
>> >>> md
>> >>> /docker-java-home/jre/bin/java -classpath 
>> >>> /opt/solr/dist/solr-core-7.0.1.jar
>> >>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook 
>> >>> -Ddata=web
>> >>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >>> SimplePostTool version 5.0.0
>> >>> Posting web pages to Solr url http://localhost:8983/solr/han
>> >>> dbook/update/extract
>> >>> Entering auto mode. Indexing pages with content-types corresponding to
>> >>> file endings md
>> >>> SimplePostTool: WARNING: Never crawl an external web site faster than
>> >>> every 10 seconds, your IP will probably be blocked
>> >>> Entering recursive mode, depth=10, delay=0s
>> >>> Entering crawl at level 0 (1 links total, 1 new)
>> >>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> >>> HTTP result status of 415
>> >>> 0 web pages indexed.
>> >>> COMMITting Solr index changes to http://localhost:8983/solr/han
>> >>> dbook/update/extract...
>> >>> Time spent: 0:00:00.531
>> >>> quadra[git:master]$
>> >>>
>> >>> Kevin
>> >>>
>> >>> >>
>> >>> >> Amrit Sarkar
>> >>> >> Search Engineer
>> >>> >> Lucidworks, Inc.
>> >>> >> 415-589-9269
>> >>> >> www.lucidworks.com
>> >>> >> Twitter http://twitter.com/lucidworks
>> >>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>> >>
>> >>> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <la...@franz.com> wrote:
>> >>> >>
>> >>> >> > OK, so I hacked markserv to add Content-Type text/html, but now I
>> >>> get
>> >>> >> >
>> >>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type
>> >>> text/html
>> >>> >> >
>> >>> >> > What is it expecting?
>> >>> >> >
>> >>> >> > $ docker exec -it --user=solr solr bin/post -c handbook
>> >>> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >>> >> > /docker-java-home/jre/bin/java -classpath
>> >>> /opt/solr/dist/solr-core-7.0.1.jar
>> >>> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> >>> -Ddata=web
>> >>> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >>> >> > SimplePostTool version 5.0.0
>> >>> >> > Posting web pages to Solr url http://localhost:8983/solr/
>> >>> >> > handbook/update/extract
>> >>> >> > Entering auto mode. Indexing pages with content-types corresponding
>> >>> to
>> >>> >> > file endings md
>> >>> >> > SimplePostTool: WARNING: Never crawl an external web site faster
>> >>> than
>> >>> >> > every 10 seconds, your IP will probably be blocked
>> >>> >> > Entering recursive mode, depth=10, delay=0s
>> >>> >> > Entering crawl at level 0 (1 links total, 1 new)
>> >>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type
>> >>> text/html
>> >>> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md
>> >>> returned a
>> >>> >> > HTTP result status of 415
>> >>> >> > 0 web pages indexed.
>> >>> >> > COMMITting Solr index changes to http://localhost:8983/solr/
>> >>> >> > handbook/update/extract...
>> >>> >> > Time spent: 0:00:03.882
>> >>> >> > $
>> >>> >> >
>> >>> >> > Thanks.
>> >>> >> >
>> >>> >> > Kevin
>> >>> >> >
>> >>>
>> >>
>> >>
>> >

Re: solr 7.0.1: exception running post to crawl simple website

Reply via email to