Simple piece of code. Had been working earlier (though against a 6.4.2
instance).
ConcurrentUpdateSolrClient solr = new
ConcurrentUpdateSolrClient("http://myhost:8983/solr",10,2);
try {
solr.deleteByQuery("*:*");
solr.commit();
} catch (SolrServerExcep
name in the path.
Tomás
Sent from my iPhone
> On Jun 5, 2017, at 9:08 PM, Phil Scadden wrote:
>
> Simple piece of code. Had been working earlier (though against a 6.4.2
> instance).
>
> ConcurrentUpdateSolrClient solr = new
> ConcurrentUpdateSolrClient("htt
We have important entities referenced in indexed documents which have
convention naming of geographicname-number. Eg Wainui-8
I want the tokenizer to treat it as Wainui-8 when indexing, and when I search I
want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to return docs
with Wainui-
Do a search with:
fl=id,title,datasource&hl=true&hl.method=unified&limit=50&page=1&q=pressure+AND+testing&rows=50&start=0&wt=json
and I get back a good list of documents. However, some documents are returning
empty fields in the highlighter. Eg, in the highlight array have:
"W:\\Reports\\OCR\\427
want to store that information anyway since it's usually the
destination of copyField directives and you'd highlight _those_ fields.
Best,
Erick
On Thu, Jun 8, 2017 at 8:37 PM, Phil Scadden wrote:
> Do a search with:
> fl=id,title,datasource&hl=true&hl.method=unified&li
ld doesn't. At least in the default
schemas stored is set to false for the catch-all field.
And you don't want to store that information anyway since it's usually the
destination of copyField directives and you'd highlight _those_ fields.
Best,
Erick
On Thu, Jun 8, 2017 a
n Heisey [mailto:apa...@elyograg.org]
Sent: Saturday, 10 June 2017 12:43 a.m.
To: solr-user@lucene.apache.org
Subject: Re: including a minus sign "-" in the token
On 6/8/2017 8:39 PM, Phil Scadden wrote:
> We have important entities referenced in indexed documents which have
> convention
.
To: Phil Scadden
Subject: Re: including a minus sign "-" in the token
On 6/9/2017 8:12 PM, Phil Scadden wrote:
> So, the field I am using for search has type of:
>positionIncrementGap="100" multiValued="true">
&g
, 2017 at 9:58 PM Phil Scadden wrote:
> Tried hard to find difference between pdfs returning no highlighter
> and ones that do for same search term. Includes pdfs that have been
> OCRed and ones that were text to begin with. Head scratching to me.
>
> -Original Message-
> Fr
If I try
/getsolr?
fl=id,title,datasource,score&hl=true&hl.maxAnalyzedChars=9000&hl.method=unified&q=Wainui-1&q.op=AND&wt=csv
The response I get is:
id,title,datasource,scoreW:\PR_Reports\OCR\PR869.pdf,,Petroleum
Reports,8.233313W:\PR_Reports\OCR\PR3440.pdf,,Petroleum
Reports,8.217836W:\PR_
Just had similar issue - works for some, not others. First thing to look at is
hl.maxAnalyzedChars is the query. The default is quite small.
Since many of my documents are large PDF files, I opted to use
storeOffsetsWithPositions="true" termVectors="true" on the field I was
searching on.
This ce
output?What do you get going directly to
Solr's endpoint?
Erik
> On Jun 14, 2017, at 22:13, Phil Scadden wrote:
>
> If I try
> /getsolr?
> fl=id,title,datasource,score&hl=true&hl.maxAnalyzedChars=9000&hl.method=unified&q=Wainui-1&q.op=AND&
http - however, the big advantage of doing your indexing on different machine
is that the heavy lifting that tika does in extracting text from documents,
finding metadata etc is not happening on the server. If the indexer crashes, it
doesn’t affect Solr either.
-Original Message-
From:
The simplest suggestion is get rid of the stop word filter. I've seen people
here comment that it is not worth it for the amount of space it saves.
-Original Message-
From: shamik [mailto:sham...@gmail.com]
Sent: Friday, 21 July 2017 9:49 a.m.
To: solr-user@lucene.apache.org
Subject: Re:
Am I correct in assuming that you have the problem searching only when there is
a hyphen in your indexed text? If you, then it would suggest that you need to
use a different tokenizer when indexing - it looks like the hyphen is removed
and words each side are concatenated - hence need both terms
Further to that. What results do you get when you put those indexed terms into
the Analysis tool on the Solr UI?
-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Tuesday, 1 August 2017 9:06 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search in
:58 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search in solr
Hi Phil Scadden,
Thank you for your reply,
we tried your suggested solution by removing hyphen while indexing, but it was
getting wrong results. i was searching for "شرطة ازكي" and it was showing me
When I am putting PDF documents and rows from a table into the same index, I
create "dataSource" field to identify the source and I don't copy database
fields - only index them - apart from the unique key which is stored as
"document". On search, you process the output before passing to user. If
Perhaps there is potential to optimize with some PLSQL functions on Oracle side
to do as much work within database as possible and have the text indexers only
access a view referencing that function. Also, the obvious optimization is a
record-updated timestamp so that every time indexer runs, on
I am slowing moving 6.5.1 from development to production. After installing solr
on the final test machine, I tried to supply a core by zipping up the data
directory on development and unzipping on test.
When I go to admin I get:
[cid:image001.png@01D31DA9.1B0EF540]
Write.lock obviously causing a
SOLR_HOME is /var/www/solr/data
The zip was actually the entire data directory which also included configsets.
And yes core.properties is in var/www/solr/data/prindex (just has single line
name=prindex, in it). No other cores are present.
The data directory should have been unzipped before the so
5 seems a reasonable limit to me. After that revert to slow.
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, 2 September 2017 12:01 p.m.
To: solr-user
Subject: Re: query with wild card with AND taking lot of time
How far would you take that? Say y
mind you.
Best,
Erick
On Thu, Aug 24, 2017 at 9:02 PM, Phil Scadden wrote:
> SOLR_HOME is /var/www/solr/data
> The zip was actually the entire data directory which also included
> configsets. And yes core.properties is in var/www/solr/data/prindex (just has
> single line name=prind
I am attempted to redo an index job. The delete query worked fine but on
reindex, I get this:
09:42:51,061 ERROR ConcurrentUpdateSolrClient:463 - error
org.apache.solr.common.SolrException: Bad Request
request: http://online-uat:8983/solr/prindex/update?wt=javabin&version=2
at
C for field". Beats me
what it expects for values in document.addField(...), but changing the field
type from Long to Int fixed it.
-Original Message-----
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Sunday, 24 September 2017 4:35 p.m.
To: solr-user@lucene.apache.org
Subject: S
I ran into a problem with indexing documents which I worked around by changing
data type, but I am curious as to how the setup could be made to work.
Solr 6.5.1 - Field type Long, multivalued false, DocValues.
In indexing with Solr, I set the value of field with:
Long accessLevel
changed after some
documents being indexed.
Thanks,
Emir
> On 25 Sep 2017, at 23:42, Phil Scadden wrote:
>
> I ran into a problem with indexing documents which I worked around by
> changing data type, but I am curious as to how the setup could be made to
> work.
>
> Solr 6.
ception ex) {
}
// start the index rebuild
-Original Message-----
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Wednesday, 27 September 2017 10:04 a.m.
To: solr-user@lucene.apache.org
Subject: RE: DocValues, Long and SolrJ
I get it after I have deleted the index with a delete query and start
Now that I am got a big hunk of documents indexed with Solr, I am looking to
see whether I can try some machine learning tools to try and extract
bibliographic references out of the documents. Anyone got some recommendations
about which kits might be good to play with for something like this?
No
While SOLR is behind a firewall, I want to now move to a secured SOLR
environment. I had been hoping to keep SOLRJ out of the picture and just using
httpURLConnection. However, I also don't want to maintain session state,
preferring to send authentication with every request. Is this possible wit
: Stateless queries to secured SOLR server.
On 10/29/2017 6:13 PM, Phil Scadden wrote:
> While SOLR is behind a firewall, I want to now move to a secured SOLR
> environment. I had been hoping to keep SOLRJ out of the picture and just
> using httpURLConnection. However, I also don't wa
: solr-user@lucene.apache.org
Subject: Re: Stateless queries to secured SOLR server.
On 10/31/2017 2:08 PM, Phil Scadden wrote:
> Thanks Shawn. I have done it with SolrJ. Apart from needing the
> NoopResponseParser to handle the wt=, it was pretty painless.
This is confusing to me, because with
Solrj QueryRequest object has a method to set basic authorization
username/password but what is the equivalent way to pass authorization when you
are adding new documents to an index?
ConcurrentUpdateSolrClient solr = new
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);
...
45)
14:52:46,224 DEBUG ConcurrentUpdateSolrClient:210 - finished:
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6eeba4a
Even more puzzling. Authentication is set. What is the invalid version bit?? I
think my solrj is 6.4.1; the server is 6.6.2. Do these have to match exactly??
-Original Message-
Fr
t;name":"read",
"role":"guest"}],
"user-role":{"solrAdmin":["admin","guest"],"solrGuest":"guest"}}}
It looks like I should be able to add.
this one worked to delete the entire index:
UpdateRequest up = ne
]
Sent: Thursday, 2 November 2017 3:13 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Stateless queries to secured SOLR server.
On 11/1/2017 4:22 PM, Phil Scadden wrote:
> Except that I am using solrj in an intermediary proxy and passing the
> response directly to a javascript client. It is
: adding documents to a secured solr server.
On 11/1/2017 8:13 PM, Phil Scadden wrote:
> 14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner:
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6e
> eba4a
> 14:52:46,224 WARN ConcurrentUpdateSolrClient:343
ess to the server.
This is a frustrating problem.
-Original Message-
From: Shawn Heisey [mailto:elyog...@elyograg.org]
Sent: Thursday, 2 November 2017 3:55 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.
On 11/1/2017 8:13 PM, Phil Scadden wrot
Requested reload and now it indexes with secure server using HttpSolrClietn.
Phew. I now look to see if I can optimize and get concurrentupdate server to
work.
At least I can get the index back now.
-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Thursday, 2
Yes, that worked.
-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Thursday, 2 November 2017 6:14 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.
On 11/1/2017 10:04 PM, Phil Scadden wrote:
> For testing, I changed
I have two different document stores that I want index. Both are quite small
(<50,000 documents though documents can be quite large). They are quite capable
of using the same schema, but you would not want to search both simultaneously.
I can see two approaches to handling this case.
1/ Create a
>You'll have a few economies of scale I think with a single core, but frankly I
>don't know if they'd be enough to measure. You say the docs are "quite large"
>though, >are you talking books? Magazine articles? is 20K large or are the 20M?
Technical reports. Sometimes up to 200MB pdfs, but that
if you are using bare NOW in your clauses for, say ranges,
one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the
subject:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
Best,
Erick
On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden wrote:
>>Y
get
some advantage from having more data points about the “text” and “title” fields.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Dec 4, 2017, at 7:17 PM, Phil Scadden wrote:
>
> Thanks Eric. I have already followed the solrj indexing ver
I am indexing PDFs and a separate process has converted any image PDFs to
search PDF before solr gets near it. I notice that tika is very slow at parsing
some PDFs. I don't need any metadata (which I suspect is slowing tika down),
just the text. Has anyone used an alternative PDF text extraction
between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
>
> Erick
>
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden wrote:
>> I am indexin
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
All the logging generated by last line. I don’t have any httpclient.wire lines
in my log4j.properties (presume these are from httpclient.wire). What do I do
to turn this off?
Phil Scadden,
The logging is coming from application which is running in Tomcat. Solr itself
is running in the embedded Jetty.
And yes, another look at the log4j and I see that rootlogger is set to DEBUG.
I've changed that/
>On the Solr server side, the 6.4.x versions have a bug that causes extremely
>high
>Another side issue: Using the extracting handler for handling rich documents
>is discouraged. Tika (which is what is used by the extracting
>handler) is pretty amazing software, but it has a habit of crashing or
>consuming all the heap memory when it encounters a document that it doesn't
>>k
Belay that. I found out why parser was just returning empty data - I didn’t
have the right artefact in maven. In case anyone else trips on this:
org.apache.tika
tika-core
1.12
org.apache.tika
tika-parsers
Got it all working with Tika and SolrJ. (Got the correct artifacts). Much
faster now too which is good. Thanks very much for your help.
Notice: This email and any attachments are confidential and may not be used,
published or redistributed without the prior written consent of the Institute
of Ge
Given the known issues with 6.4.1 and no release date for 6.4.2, is the best
recommendation for a production version of SOLR 6.3.0? Hoping to take to
production in first week of April.
Notice: This email and any attachments are confidential and may not be used,
published or redistributed withou
I would second that guide could be clearer on that. I read and reread several
times trying to get my head around the schema.xml/managed-schema bit. I came
away from first cursory reading with the idea that managed-schema was mostly
for schema-less mode and only after some stuff ups and puzzling
>The first advise is NOT to expose your Solr directly to the public.
>Anyone that can hit /search, can also hit /update and wipe out your index.
I would second that too. We have never exposed Solr and I also sanitise queries
in the proxy.
Notice: This email and any attachments are confidential a
What we are suggesting is that your browser does NOT access solr directly at
all. In fact, configure firewall so that SOLR is unreachable outside the
server. Instead you write a proxy in your site application which calls SOLR
instead. Ie a server-to-server call instead of browser-to-server. This
I have added a signature field to schema and setup dedupe handler in
solrconfig.xml as per docs, however docs say:
“Be sure to change your update handlers to use the defined chain, as below:”
Umm, WHERE do you change the update handler to use the defined chain? Is this
in one of config xmls or
The admin gui displays the time of last commit to a core but how can this be
queried from within SolrJ?
Notice: This email and any attachments are confidential and may not be used,
published or redistributed without the prior written consent of the Institute
of Geological and Nuclear Sciences L
While building directly into Solr might be appealing, I would argue that it is
best to use OCR software first, outside of SOLR, to convert the PDF into
"searchable" PDF format. That way when the document is retrieved, it is a lot
more useful to the searcher - making it easy to find the text with
Only by 10? You must have quite small documents. OCR is extremely expensive
process. Indexing is trivial by comparison. For quite large documents I am
working with OCR can be 100 times slower than indexing a PDF that is searchable
(text extractable without OCR).
-Original Message-
From:
Well I haven’t had to deal with a problem that size, but it seems to me that
you have little alternative except through more computer hardware at it. For
the job I did, I OCRed to convert PDF to searchable PDF outside the indexing
workflow. I used pdftotext utility to extract text from pdf. If t
Yes, that would seem an accurate assessment of the problem.
-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Thursday, 30 March 2017 4:53 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significantly with OCR
Thanks for your reply.
Look up highlighting. http://wiki.apache.org/solr/HighlightingParameters
Notice: This email and any attachments are confidential. If received in error
please destroy and immediately notify us. Do not copy or disclose the contents.
java.lang.Thread.run(Thread.java:619)
--
Phil Scadden, Senior Scientist GNS Science Ltd 764 Cumberland St,
Private Bag 1930, Dunedin, New Zealand Ph +64 3 4799663, fax +64 3 477 5232
Notice: This email and any attachments are confidential. If received in error
please destroy and immediately notify us. Do not
I am new user and I have SOLR installed. I can use the admin page and
query the example data.
However, I was using nutch to load index with intranet web pages and I
got this message.
SolrIndexer: starting at 2011-08-12 16:52:44
org.apache.solr.client.solrj.SolrServerException:
java.net.ConnectE
I will second the SolrJ method. You don’t want to be doing this on your SOLR
instance. One question is whether your PDFs are scanned or are already
searchable. I use tesseract offline to convert all scanned PDFs into searchable
PDF so I don’t want Tika to be doing that. My code core is:
I would strongly consider OCR offline, BEFORE loading the documents into Solr.
The advantage of this is that you convert your OCRed PDF into searchable PDF.
Consider someone using Solr and they have found a document that matches their
search criteria. Once they retrieve the document, they will
As per Erick advice, I would strongly recommend that you do anything tika in a
separate solrj programme. You do not want to have your solr instance processing
via tika.
-Original Message-
From: Tannen, Lev (USAEO) [Contractor]
Sent: Wednesday, 20 March 2019 08:17
To: solr-user@lucene.a
do not want to have your Solr instance
processing via Tika”? If that’s a bad design choice please elaborate.
Thanks,
Geoff
> On Mar 19, 2019, at 5:15 PM, Phil Scadden wrote:
>
> As per Erick advice, I would strongly recommend that you do anything tika in
> a separate solrj prog
I always filter solr request via a proxy (so solr itself is not exposed
directly to the web). In that proxy, the query parameters can be broken down
and filtered as desired (I examine authorities granted to a session to control
even which indexes are being searched) before passing the modified u
I would also second the proxy approach. Beside keeping your solr instance
behind a firewall and not directly exposed, you can do a lot in a proxy.
Per-user control over which index they are access, filtering of queries, etc.
-Original Message-
From: Emir Arnautović [mailto:emir.arnauto..
First off, use basic authentication to at least partially lock it down. Only
the application server has access to the password. Second, our IT people
thought Solr security insufficient to even remotely consider exposing to
external web. It lives behind firewall so do a kind of proxy. External qu
Code for solrj is going to be very dependent on your needs but the beating
heart of my code is below ( note that I do OCR as separate step before feeding
files into indexer). Solrj and tika docs should help.
File f = new File(filename);
ContentHandler textHandler = new
72 matches
Mail list logo