RE: Solr 6.4. Can't index MS Visio vsdx files
Great Tim. What do I need to do to integrate it on my current installation? On May 31, 2017 16:24, "Allison, Timothy B." wrote: Apache Tika 1.15 is now available. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 7:45 AM To: solr-user@lucene.apache.org Subject: RE: Solr 6.4. Can't index MS Visio vsdx files Probably better to ask on the Tika list. We'll push the release asap after PDFBox 2.0.6 is out. Andreas plans to cut the release candidate for PDFBox this Friday. Tika will probably have an RC by Monday 5/15, with the release happening later in the week...That's if there are no surprises...[2] You can get a recent build if you'd like to test [1]. Best, Tim [1] https://builds.apache.org/view/Tika/job/Tika-trunk/ [2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and 2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/ reports/reports_pdfbox_2_0_6.tar.gz -Original Message- From: Gytis Mikuciunas [mailto:gyt...@gmail.com] Sent: Tuesday, May 9, 2017 7:17 AM To: solr-user@lucene.apache.org Subject: Re: Solr 6.4. Can't index MS Visio vsdx files Are there any news regarding Tika 1.15? Maybe it's already ready for download somewhere G. On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. wrote: > The release candidate for POI was just cut...unfortunately, I think > after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening that! > > That'll be done within a week unless there are surprises. Once that's > out, I have to update a few things, but I'd think we'd have a > candidate for Tika a week later, then a week for release. > > You can get nightly builds here: https://builds.apache.org/ > > Please ask on the POI or Tika users lists for how to get the > latest/latest running, and thank you, again, for opening the issue on POI's Bugzilla. > > Best, > >Tim > > -Original Message- > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > Sent: Wednesday, April 12, 2017 1:00 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files > > when 1.15 will be released? maybe you have some beta version and I > could test it :) > > SAX sounds interesting, and from info that I found in google it could > solve my issues. > > On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. > > wrote: > > > It depends. We've been trying to make parsers more, erm, flexible, > > but there are some problems from which we cannot recover. > > > > Tl;dr there isn't a short answer. :( > > > > My sense is that DIH/ExtractingDocumentHandler is intended to get > > people up and running with Solr easily but it is not really a great > > idea for production. See Erick's gem: https://lucidworks.com/2012/ > > 02/14/indexing-with-solrj/ > > > > As for the Tika portion... at the very least, Tika _shouldn't_ cause > > the ingesting process to crash. At most, it should fail at the file > > level and not cause greater havoc. In practice, if you're > > processing millions of files from the wild, you'll run into bad > > behavior and need to defend against permanent hangs, oom, memory leaks. > > > > Also, at the least, if there's an exception with an embedded file, > > Tika should catch it and keep going with the rest of the file. If > > this doesn't happen let us know! We are aware that some types of > > embedded file stream problems were causing parse failures on the > > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't > > let them percolate up through the parent file (they're reported in > > the > metadata though). > > > > Specifically for your stack traces: > > > > For your initial problem with the missing class exceptions -- I > > thought we used to catch those in docx and log them. I haven't been > > able to track this down, though. I can look more if you have a need. > > > > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' > > name 'PolylineTo' ", this problem might go away if we implemented a > > pure SAX parser for vsdx. We just did this for docx and pptx > > (coming in 1.15) and these are more robust to variation because they > > aren't requiring a match with the ooxml schema. I haven't looked > > much at vsdx, but that _might_ help. > > > > For "TODO Support v5 Pointers", this isn't supported and would > > require contributions. However, I agree that POI shouldn't throw a > > Runtime exception. Perhaps open an issue in POI, or maybe we should > > catch this special example at the Tika level? > > > > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI > > team _might_ be able to modify the parser to ignore a stream if > > there's an exception, but that's often a sign that something needs > > to be fixed with the parser. In short, the solution will come from POI. > > > > Best, > > > > Tim > > > > -Original Message- > > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > > Sent: Tuesd
can't start node in cloud mode
Hi running bin/solr start does not start up in cloud mode despite having ZK_HOST set in /etc/default/solr.in.sh. running openjdk 1.8 solr 6.5.1 on aws linux zookeeper 3.4.6 on aws linux (3 node ensemble) logs look clean both in zookeeper and solr running bin/solr zk ls / returns Connecting to ZooKeeper at ec2-xxx:2181,ec2-xxx:2181,ec2-xxx:2181/solr ... Getting listing for Zookeeper node / from ZooKeeper at ec2-34-196-159-23.compute-1.amazonaws.com:2181, ec2-34-196-204-202.compute-1.amazonaws.com:2181, ec2-34-196-212-108.compute-1.amazonaws.com:2181/solr recurse: false configs solr.xml This suggests that zk_hosts is being picked up and the connection looks ok woth the zk ensemble running bin/solr start returns Waiting up to 180 seconds to see Solr running on port 8983 [\] Started Solr server on port 8983 (pid=30467). Happy searching! but the status suggests we are in stand alone core mode bin/solr status Found 1 Solr nodes: Solr process 30467 running on port 8983 { "solr_home":"/var/solr/data", "version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - jimczi - 2017-04-21 12:23:42", "startTime":"2017-06-03T15:02:53.995Z", "uptime":"0 days, 0 hours, 1 minutes, 5 seconds", "memory":"23 MB (%4.7) of 490.7 MB"} and indeed trying to create collection fails bin/solr create_collection -c destinations -n destinations ERROR: Solr at http://localhost:8983/solr is running in standalone server mode, please use the create_core command instead; create_collection can only be used when running in SolrCloud mode. forcing the issue with bin/solr start -c or even bin/solr start-c -z xxx/solr makes no difference. I'm sure i'm missing something obvious in the solr.in.sh but can't find anything so far. Lee C
Re: can't start node in cloud mode
Specifying ZK_HOST in the solr.in.sh file works fine for me. bq: running bin/solr start does not start up in cloud mode despite having ZK_HOST set in /etc/default/solr.in.sh. I don't think the startup script would even look in /etc/default for solr.in.sh unless you've defined a "HOME" env variable pointing there. And even then it needs to be a hidden file (the "." in $HOME/.solr.in.sh). See the "CONTROLLING STARTUP" section in bin/solr: # the following locations are searched in this order: # # ./ # $HOME/.solr.in.sh # /usr/share/solr # /usr/local/share/solr # /var/solr/ # /opt/solr (plus some other options...) FYI: bq: "bin/solr start-c -z xxx/solr makes no difference." Unless you've made a typo the -z parameter requires the ZK ensemble, not a Solr node. The -c is unnecessary when -z is specified BTW. The -c will start an _internal_ zookeeper in the absence of a -z parameter. Best, Erick On Sat, Jun 3, 2017 at 8:09 AM, Lee Carroll wrote: > Hi > running bin/solr start does not start up in cloud mode despite having > ZK_HOST set in /etc/default/solr.in.sh. > > running openjdk 1.8 > solr 6.5.1 on aws linux > zookeeper 3.4.6 on aws linux (3 node ensemble) > > logs look clean both in zookeeper and solr > > running bin/solr zk ls / returns > > Connecting to ZooKeeper at ec2-xxx:2181,ec2-xxx:2181,ec2-xxx:2181/solr ... > Getting listing for Zookeeper node / from ZooKeeper at > ec2-34-196-159-23.compute-1.amazonaws.com:2181, > ec2-34-196-204-202.compute-1.amazonaws.com:2181, > ec2-34-196-212-108.compute-1.amazonaws.com:2181/solr recurse: false > configs > solr.xml > > This suggests that zk_hosts is being picked up and the connection looks ok > woth the zk ensemble > > running bin/solr start returns > > Waiting up to 180 seconds to see Solr running on port 8983 [\] > Started Solr server on port 8983 (pid=30467). Happy searching! > > but the status suggests we are in stand alone core mode > > bin/solr status > > Found 1 Solr nodes: > > Solr process 30467 running on port 8983 > { > "solr_home":"/var/solr/data", > "version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - jimczi - > 2017-04-21 12:23:42", > "startTime":"2017-06-03T15:02:53.995Z", > "uptime":"0 days, 0 hours, 1 minutes, 5 seconds", > "memory":"23 MB (%4.7) of 490.7 MB"} > > and indeed trying to create collection fails > > bin/solr create_collection -c destinations -n destinations > > ERROR: Solr at http://localhost:8983/solr is running in standalone server > mode, please use the create_core command instead; > create_collection can only be used when running in SolrCloud mode. > > forcing the issue with bin/solr start -c > > or even > bin/solr start-c -z xxx/solr makes no difference. > > I'm sure i'm missing something obvious in the solr.in.sh but can't find > anything so far. > > Lee C
Re: can't start node in cloud mode
thanks for your response eric. I've found the issue. the set up was fine. I had a dirty solr.xml in zookeeper. Once corrected everythig is fine. cheers lee c On 3 June 2017 at 16:58, Erick Erickson wrote: > Specifying ZK_HOST in the solr.in.sh file works fine for me. > > bq: running bin/solr start does not start up in cloud mode despite having > ZK_HOST set in /etc/default/solr.in.sh. > > I don't think the startup script would even look in /etc/default for > solr.in.sh unless you've defined a "HOME" env variable pointing there. > And even then it needs to be a hidden file (the "." in > $HOME/.solr.in.sh). See the "CONTROLLING STARTUP" section in bin/solr: > > # the following locations are searched in this order: > # > # ./ > # $HOME/.solr.in.sh > # /usr/share/solr > # /usr/local/share/solr > # /var/solr/ > # /opt/solr > > (plus some other options...) > > FYI: > > bq: "bin/solr start-c -z xxx/solr makes no difference." > > Unless you've made a typo the -z parameter requires the ZK ensemble, > not a Solr node. The -c is unnecessary when -z is specified BTW. The > -c will start an _internal_ zookeeper in the absence of a -z > parameter. > > Best, > Erick > > On Sat, Jun 3, 2017 at 8:09 AM, Lee Carroll > wrote: > > Hi > > running bin/solr start does not start up in cloud mode despite having > > ZK_HOST set in /etc/default/solr.in.sh. > > > > running openjdk 1.8 > > solr 6.5.1 on aws linux > > zookeeper 3.4.6 on aws linux (3 node ensemble) > > > > logs look clean both in zookeeper and solr > > > > running bin/solr zk ls / returns > > > > Connecting to ZooKeeper at ec2-xxx:2181,ec2-xxx:2181,ec2-xxx:2181/solr > ... > > Getting listing for Zookeeper node / from ZooKeeper at > > ec2-34-196-159-23.compute-1.amazonaws.com:2181, > > ec2-34-196-204-202.compute-1.amazonaws.com:2181, > > ec2-34-196-212-108.compute-1.amazonaws.com:2181/solr recurse: false > > configs > > solr.xml > > > > This suggests that zk_hosts is being picked up and the connection looks > ok > > woth the zk ensemble > > > > running bin/solr start returns > > > > Waiting up to 180 seconds to see Solr running on port 8983 [\] > > Started Solr server on port 8983 (pid=30467). Happy searching! > > > > but the status suggests we are in stand alone core mode > > > > bin/solr status > > > > Found 1 Solr nodes: > > > > Solr process 30467 running on port 8983 > > { > > "solr_home":"/var/solr/data", > > "version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - jimczi - > > 2017-04-21 12:23:42", > > "startTime":"2017-06-03T15:02:53.995Z", > > "uptime":"0 days, 0 hours, 1 minutes, 5 seconds", > > "memory":"23 MB (%4.7) of 490.7 MB"} > > > > and indeed trying to create collection fails > > > > bin/solr create_collection -c destinations -n destinations > > > > ERROR: Solr at http://localhost:8983/solr is running in standalone > server > > mode, please use the create_core command instead; > > create_collection can only be used when running in SolrCloud mode. > > > > forcing the issue with bin/solr start -c > > > > or even > > bin/solr start-c -z xxx/solr makes no difference. > > > > I'm sure i'm missing something obvious in the solr.in.sh but can't find > > anything so far. > > > > Lee C >