I did a search and noticed pages were executed through aspx. Are you using
.net to parse the xml results from SOLR? Nice site, just trying to figure
out where SOLR fits into this.
On 5/16/07, Mike Austin <[EMAIL PROTECTED]> wrote:
I just wanted to say thanks to everyone for the creation of sol
I am not sure how solr is exactly set up currently, much less on any
specific system. But, for operations which are largely reading, *maybe*
like a query, you might be able run on a read only partition.
A firewall is a lot less work and a good start, like 90% of the problem.
To do this, you brin
If you haven’t already, might want to check out maximal marginal
relevance...original paper: Carbonell and Goldstein.
On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein wrote:
> Yeah, I think your plan sounds fine.
>
> Do you have a specific use case for diversity of results. I've been
> wondering i
This is probably caused by an encoding detection problem in Nutch and/or
Tika. If you can share the file on the Tika user’s list, I can take a look.
On Fri, Oct 5, 2018 at 7:11 AM UMA MAHESWAR
wrote:
> HI ALL,
>
> while i am using nutch for crawling and indexing in to solr,while storing
> data i
is a set of probable languages. From there,
you can pivot the results based on the user expectations.
tim
On Mon, Oct 22, 2018 at 11:18 AM Alexandre Rafalovitch
wrote:
> Additional possibilities:
> 1) omitNorms and maybe omitTermFreqAndPositions for the fields to
> avoid frequen
To follow up w Erick’s point, there are a bunch of transitive dependencies
from tika-parsers. If you aren’t using maven or similar build system to
grab the dependencies, it can be tricky to get it right. If you aren’t
using maven, and you can afford the risks of jar hell, consider using
tika-app or
If you’re processing actual msg (not eml), you’ll also need poi and
poi-scratchpad and their dependencies, but then those msgs could have
attachments, at which point, you may as just add tika-app. :D
On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ)
wrote:
> Hi Erick and Tim,
>
&g
you’re wondering why you might upgrade to 1.19.1, look no further than:
https://tika.apache.org/security.html
On Fri, Oct 26, 2018 at 4:14 AM Martin Frank Hansen (MHQ)
wrote:
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works 😊 But
ion:
> https://wiki.apache.org/tika/RecursiveMetadata
>
> But thanks again for all your help!
>
> -Original Message-
> From: Martin Frank Hansen (MHQ)
> Sent: 26. oktober 2018 10:14
> To: solr-user@lucene.apache.org
> Subject: RE: Reading data using Tika to Sol
Tika relies on you to install tesseract and all the language libraries
you'll need.
If you can successfully call `tesseract testing/eurotext.png
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) wr
Martin,
Let’s move this over to user@tika.
Rohan,
Is there something about Tika’s use of tesseract for image files that can
be improved?
Best,
Tim
On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat wrote:
> I used tess4j for image formats and Tika for scanned PDFs and images wit
OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr! We
have an open ticket to make it "just work", but we aren't there yet
(TIKA-2749).
You have to tell Tika how you want to process images from PDFs via the
tika-config.xml file.
You've seen this link in the links you mentioned:
ht
to ding Nuance (or tesseract), I just wish to point out that
> what to OCR is important, because OCR works well when it has good input.
>
> > -Original Message-
> > From: Tim Allison
> > Sent: Friday, November 2, 2018 11:03 AM
> > To: solr-user@lucene.apach
itly delete the
parent and child documents. There are a number of JIRA tickets floating
around relating to cleaning up the user experience for this.
-Tim
[1]
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-NestedChildDocuments
[2]
http
Y, I tracked this down within Solr. This is a feature, not a bug. I
found a solution (set {{captureAttr}} to {{true}}):
https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263
Please, though,
ually checked that the jars and poms for the
artifacts that maven wasn't able to pull were in fact there.
Is this user error or something wrong with the poms or something else?
Thank you.
Best,
Tim
[1]
apache-snapshot
User error..please ignore.
On Thu, Jan 17, 2019 at 4:36 PM Tim Allison wrote:
>
> All,
> I recently tried to upgrade a project that relies on the snapshot
> repos[1], but maven wasn't able to pull lucene-highlighter,
> lucene-test-framework, lucene-memory, among a
All,
I don't know if this change was intended, but it feels like a bug to me...
TokenFilterFactory[] filters = new TokenFilterFactory[2];
filters[0] = new LowerCaseFilterFactory(Collections.EMPTY_MAP);
filters[1] = new ASCIIFoldingFilterFactory(Collections.EMPTY_MAP);
TokenizerChain chain = new
>At the end of the day it would be a much better architecture to parse the
> PDFs using plain standalone TikaServer
+1
Also, note that we added a -spawnChild switch to tika-server that will
run the server in a child process and kill+restart the child process
if there is an infinite loop/oom/segfa
Haha, looks like Jörn just answered this... onError="skip|continue"
>greatly preferable if the indexing process could ignore exceptions
Please, no. I'm 100% behind the sentiment that DIH should gracefully
handle Tika exceptions, but the better option is to log the
exceptions, store the stacktrace
4.x...y, I know...
What am I doing wrong? How can I fix this?
Thank you.
Best,
Tim
We are successfully running Solr 7.6.0 (and 7.5.0 before it) on OpenJDK 11
without problems. We are also using G1. We do not use Solr Cloud but do
rely on the legacy replication.
-Tim
On Sat, Mar 23, 2019 at 10:13 AM Erick Erickson
wrote:
> I am, in fact, trying to get a summary of all t
/index.html
-Tim
On Mon, Mar 25, 2019 at 10:51 AM Jay Potharaju
wrote:
> I just learnt that java 11 is . Is anyone using open jdk11 in
> production?
> Thanks
>
>
> > On Mar 23, 2019, at 5:15 PM, Jay Potharaju
> wrote:
> >
> > I have not kept up with jdk vers
? The reason I want to keep
the fields as two separate ones is that I want to be able to export from solr
back to exact same excel file structure, i.e. solr fields maps exactly to excel
columns.
I'm using solr 7. Any thoughts or suggestions would be appreciated.
Regards
Tim
TextField is a classname. Look in managedschema and pick a field type by
name, e.g. text_general
On Sat, Apr 6, 2019 at 9:00 AM Dave Beckstrom
wrote:
> Hi Everyone,
>
> I'm really hating SOLR. All I want is to define a text field that data
> can be indexed into and which is searchable. Should
For smaller length documents TFIDFSimilarity will weight towards shorter
documents. Another way to say this, if your documents are 5-10 terms, the
5 terms are going to win.
You might think about having per token, or token pair, weight. I would be
surprised if there was not something similar out t
date range, when the source material
has date ranges built into it is kinda odd. But it occurs. If you query
from noon-1p does that include meeting notes which started at 1130a, but
went for an hour? You have to choose what to do.
tim
On Thu, May 17, 2018 at 6:11 AM, Terry Steichen wrote:
>
We have 3.4.10 and have *tested* at a functional level 6.6.2. So far it
works. We have not done any stress/load testing. But would have to do this
prior to release.
On Tue, May 22, 2018 at 9:44 AM, Walter Underwood
wrote:
> Is anybody running Zookeeper 3.4.12 with Solr 6.6.2? Is that a recomme
You’ll need to provide a PasswordProvider in the ParseContext. I don’t
think that is currently possible in the Solr integration. Please open a
ticket if SolrJ doesn’t meet your needs.
On Thu, May 24, 2018 at 1:03 PM Alexandre Rafalovitch
wrote:
> Hmm. If it works, then it is Tika magic. Which m
...@mail.gmail.com%3e
On Sat, May 26, 2018 at 6:34 AM Tim Allison wrote:
> You’ll need to provide a PasswordProvider in the ParseContext. I don’t
> think that is currently possible in the Solr integration. Please open a
> ticket if SolrJ doesn’t meet your needs.
>
> On Thu, May 24,
standing by on the user list for Tika when you have
questions. :)
Cheers,
Tim
On Fri, May 25, 2018 at 11:10 AM Erick Erickson
wrote:
> I'd consider using a separate Java program that uses Tika directly, or
> one of various services. Then you can assemble whatever you please
>
W00t! Thank you, Shawn!
The "don't use ERH in production" response comes up frequently enough
> that I have created a wiki page we can use for responses:
>
> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>
> Tim, you are extremely well-qualified t
t; > the info is in our "official" place but the real story is in another
> > place,
> > > one we alternately tell people to sometimes ignore but sometimes keep
> up
> > to
> > > date? Even I'm confused.
> > >
> > > On Sat, May 26, 20
Deepti,
I am going to guess the analyzer part of the .net application is cutting
off the last token.
If you try the queries on the console of the running solr cluster, what do
you get? If you dump that specific field for all the docs, can you find it
with grep?
tim
On Fri, Jul 20, 2018 at 10
+1 to Shawn's and Erick's points about isolating Tika in a separate jvm.
Y, please do let us know: u...@tika.apache.org We might be able to
help out, and you, in turn, can help the community figure out what's
going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703
On Sun, Aug 5, 2018
Walter,
When you do the query, what is the sort of the results?
tim
On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood
wrote:
> I’ll back up a bit, since it is sort of an X/Y problem.
>
> I have an index with four shards and 17 million documents. I want to dump
> all the docs in
you would do to tune Solr for
large amounts of dynamic fields?
Does anyone have a guess on what the single high CPU node is doing (some
kind of metrics aggregation maybe?).
Thank you all,
Tim
[1]
[image: image.png]
f fields and/or many
> rows, this shouldn’t run “for many minutes”, but it’s something to look for.
>
> When this happens, what is your query response time like? I’m assuming
> it’s very slow.
>
> But these are all shots in the dark, some thread dumps would be where I’d
> start.
&g
> Erick
>
> On Wed, Mar 18, 2020, 12:04 Edward Ribeiro
> wrote:
>
> > What are your hard and soft commit settings? This can have a large
> > impact on the writing throughput.
> >
> > Best,
> > Edward
> >
> > On Wed, Mar 18, 2020 at 11:43 AM Tim Ro
, you can build the
symbol space from bigrams.
If I ever write a book the title is going to be "The The". I hope it has
multi-lingual translations. Although, at this point, it is a very short
book :/
tim
On Fri, May 15, 2020 at 11:43 AM Walter Underwood
wrote:
> Right. I might us
er to have an honest index and allow the post analysis to change. This
way you can change it 10 times a day and no one will care.
If you are interested in a word cloud I would suspect people have done a
reasonable job around this using a solr index already.
tim
On Fri, May 15, 2020 at 1:48 PM A
okens in a time
field, so you dont get names of people 'june' while searching for 'jun',
for instance.
tim
On Thu, Sep 10, 2020 at 10:08 AM Walter Underwood
wrote:
> It is very common for us to do more processing in the index analysis
> chain. In general, we do that
Related?
https://issues.apache.org/jira/plugins/servlet/mobile#issue/TIKA-2861
On Wed, May 1, 2019 at 8:09 AM Alexandre Rafalovitch
wrote:
> What happens when you run it against a standalone Tika (recommended option
> anyway)? Do you see the relevant fields?
>
> Not every Tika field is capture
I just pushed a fix for TIKA-2861. If you can either build locally or
wait a few hours for Jenkins to build #182, let me know if that works
with straight tika-app.jar.
On Thu, May 2, 2019 at 5:00 AM Where is Where wrote:
>
> Thank you Alex and Tim.
> I have looked at the solrconfig.xm
Sorry build #182: https://builds.apache.org/job/tika-branch-1x/
On Thu, May 2, 2019 at 12:01 PM Tim Allison wrote:
>
> I just pushed a fix for TIKA-2861. If you can either build locally or
> wait a few hours for Jenkins to build #182, let me know if that works
> with straight
de Solr as soon as Tika is out (I also mean it this time).
*TM by Erick Erickson
On Fri, May 3, 2019 at 3:44 AM Where is Where wrote:
>
> Thank you very much Tim, I wonder how to make the Tika change apply to
> Solr? I saw Tika core, parse and xml jar files tika-core.jar
> tika-parse
if need be. (Be wary of over generation if
one of the categories turns out to be 'thin').
Then in the filter query you can query over a category, or simply require a
category:thing to be in the query.
tim
On Thu, May 30, 2019 at 3:33 PM Shawn Heisey wrote:
> On 5/30/2019 4:13 PM, V
My two cents worth of comment,
For our local lucene indexes we use AES encryption. We encrypt the blocks
on the way out, decrypt on the way in.
We are using a C version of lucene, not the java version. But, I suspect
the same methodology could be applied. This assumes the data at rest is
the at
I'd strongly recommend rolling your own ingest code. See Erick's
superb: https://lucidworks.com/post/indexing-with-solrj/
You can easily get attachments via the RecursiveParserWrapper, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParse
https://stackoverflow.com/questions/48348312/solr-7-how-to-do-full-text-search-w-geo-spatial-search
On Mon, Sep 30, 2019 at 10:31 AM Anushka Gupta <
anushka_gu...@external.mckinsey.com> wrote:
> Hi,
>
> I want to be able to filter on different cities and also sort the results
> based on geoproxi
particularly short messages. So I would expect a small set of side fields
remarking this. This would allow you to carry the measures along with the
data.
tim
On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch
wrote:
> Is the 100 words a hard boundary or a soft one?
>
> If it is a
c segments.
I think you will find the last N tokens of a document have some odd
categories within the search results. I might guess you have a different
purpose in mind. Either way, you would likely do better to segment what
you are searching.
tim
On Mon, Oct 14, 2019 at 11:25 PM Kaminski,
rrect me if I'm
wrong.
Thanks,
Tim
I'm currently running into a ConcurrentModificationException ingesting data
as we attempt to upgrade from Solr 8.1 to 8.2. It's not every document, but
it definitely appears regularly in our logs. We didn't run into this
problem in 8.1, so I'm not sure what might have changed. I feel like this
is p
Nevermind my comment on not having this problem in 8.1. We do have it there
as well, I just didn't look far enough back in our logs on my initial
search. Would still appreciate whatever thoughts anyone might have on the
exception.
On Wed, Nov 6, 2019 at 10:17 AM Tim Swetland wrote:
stException (java.lang.String cannot be cast to
java.util.Map) on the replica as in issue SOLR-13471
<https://issues.apache.org/jira/browse/SOLR-13471>.
Anyway, thanks for the insight everyone,
Tim
On Fri, Nov 8, 2019 at 12:26 AM Shawn Heisey wrote:
> On 11/6/2019 8:17 AM, Tim Swetland wrot
401 - 456 of 456 matches
Mail list logo