Re: PriceJunkie.com using solr!

2007-05-17 Thread Tim Archambault
I did a search and noticed pages were executed through aspx. Are you using .net to parse the xml results from SOLR? Nice site, just trying to figure out where SOLR fits into this. On 5/16/07, Mike Austin <[EMAIL PROTECTED]> wrote: I just wanted to say thanks to everyone for the creation of sol

Re: solr crypto mining hack...

2018-08-25 Thread Tim Casey
I am not sure how solr is exactly set up currently, much less on any specific system. But, for operations which are largely reading, *maybe* like a query, you might be able run on a read only partition. A firewall is a lot less work and a good start, like 90% of the problem. To do this, you brin

Re: solr and diversification

2018-09-28 Thread Tim Allison
If you haven’t already, might want to check out maximal marginal relevance...original paper: Carbonell and Goldstein. On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein wrote: > Yeah, I think your plan sounds fine. > > Do you have a specific use case for diversity of results. I've been > wondering i

Re: Encoding issue in solr

2018-10-05 Thread Tim Allison
This is probably caused by an encoding detection problem in Nutch and/or Tika. If you can share the file on the Tika user’s list, I can take a look. On Fri, Oct 5, 2018 at 7:11 AM UMA MAHESWAR wrote: > HI ALL, > > while i am using nutch for crawling and indexing in to solr,while storing > data i

Re: Help with multi-lang searches

2018-10-22 Thread Tim Casey
is a set of probable languages. From there, you can pivot the results based on the user expectations. tim On Mon, Oct 22, 2018 at 11:18 AM Alexandre Rafalovitch wrote: > Additional possibilities: > 1) omitNorms and maybe omitTermFreqAndPositions for the fields to > avoid frequen

Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison
To follow up w Erick’s point, there are a bunch of transitive dependencies from tika-parsers. If you aren’t using maven or similar build system to grab the dependencies, it can be tricky to get it right. If you aren’t using maven, and you can afford the risks of jar hell, consider using tika-app or

Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison
If you’re processing actual msg (not eml), you’ll also need poi and poi-scratchpad and their dependencies, but then those msgs could have attachments, at which point, you may as just add tika-app. :D On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) wrote: > Hi Erick and Tim, > &g

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
you’re wondering why you might upgrade to 1.19.1, look no further than: https://tika.apache.org/security.html On Fri, Oct 26, 2018 at 4:14 AM Martin Frank Hansen (MHQ) wrote: > Hi Tim, > > It is msg files and I added tika-app-1.14.jar to the build path - and now > it works 😊 But

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
ion: > https://wiki.apache.org/tika/RecursiveMetadata > > But thanks again for all your help! > > -Original Message- > From: Martin Frank Hansen (MHQ) > Sent: 26. oktober 2018 10:14 > To: solr-user@lucene.apache.org > Subject: RE: Reading data using Tika to Sol

Re: Tesseract language

2018-10-26 Thread Tim Allison
Tika relies on you to install tesseract and all the language libraries you'll need. If you can successfully call `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan" with your code above. On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) wr

Re: Tesseract language

2018-10-27 Thread Tim Allison
Martin, Let’s move this over to user@tika. Rohan, Is there something about Tika’s use of tesseract for image files that can be improved? Best, Tim On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat wrote: > I used tess4j for image formats and Tika for scanned PDFs and images wit

Re: Solr OCR Support

2018-11-02 Thread Tim Allison
OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr! We have an open ticket to make it "just work", but we aren't there yet (TIKA-2749). You have to tell Tika how you want to process images from PDFs via the tika-config.xml file. You've seen this link in the links you mentioned: ht

Re: Solr OCR Support

2018-11-02 Thread Tim Allison
to ding Nuance (or tesseract), I just wish to point out that > what to OCR is important, because OCR works well when it has good input. > > > -Original Message- > > From: Tim Allison > > Sent: Friday, November 2, 2018 11:03 AM > > To: solr-user@lucene.apach

Re: How to handle List in Solr 6.6

2018-11-06 Thread Tim Underwood
itly delete the parent and child documents. There are a number of JIRA tickets floating around relating to cleaning up the user experience for this. -Tim [1] https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-NestedChildDocuments [2] http

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-17 Thread Tim Allison
Y, I tracked this down within Solr. This is a feature, not a bug. I found a solution (set {{captureAttr}} to {{true}}): https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263 Please, though,

8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison
ually checked that the jars and poms for the artifacts that maven wasn't able to pull were in fact there. Is this user error or something wrong with the poms or something else? Thank you. Best, Tim [1] apache-snapshot

Re: 8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison
User error..please ignore. On Thu, Jan 17, 2019 at 4:36 PM Tim Allison wrote: > > All, > I recently tried to upgrade a project that relies on the snapshot > repos[1], but maven wasn't able to pull lucene-highlighter, > lucene-test-framework, lucene-memory, among a

TokenizerChain.getMultiTermAnalyzer().normalize() no longer normalizes multiterms in 8.x?!

2019-01-25 Thread Tim Allison
All, I don't know if this change was intended, but it feels like a bug to me... TokenFilterFactory[] filters = new TokenFilterFactory[2]; filters[0] = new LowerCaseFilterFactory(Collections.EMPTY_MAP); filters[1] = new ASCIIFoldingFilterFactory(Collections.EMPTY_MAP); TokenizerChain chain = new

Re: by: java.util.zip.DataFormatException: invalid distance too far back reported by Solr API

2019-02-05 Thread Tim Allison
>At the end of the day it would be a much better architecture to parse the > PDFs using plain standalone TikaServer +1 Also, note that we added a -spawnChild switch to tika-server that will run the server in a child process and kill+restart the child process if there is an infinite loop/oom/segfa

Re: Help with a DIH config file

2019-03-15 Thread Tim Allison
Haha, looks like Jörn just answered this... onError="skip|continue" >greatly preferable if the indexing process could ignore exceptions Please, no. I'm 100% behind the sentiment that DIH should gracefully handle Tika exceptions, but the better option is to log the exceptions, store the stacktrace

Why is elevate not working when I convert a request to local parameters?

2019-03-22 Thread Tim Allison
4.x...y, I know... What am I doing wrong? How can I fix this? Thank you. Best, Tim

Re: Java 9 & solr 7.7.0

2019-03-23 Thread Tim Underwood
We are successfully running Solr 7.6.0 (and 7.5.0 before it) on OpenJDK 11 without problems. We are also using G1. We do not use Solr Cloud but do rely on the legacy replication. -Tim On Sat, Mar 23, 2019 at 10:13 AM Erick Erickson wrote: > I am, in fact, trying to get a summary of all t

Re: Java 9 & solr 7.7.0

2019-03-25 Thread Tim Underwood
/index.html -Tim On Mon, Mar 25, 2019 at 10:51 AM Jay Potharaju wrote: > I just learnt that java 11 is . Is anyone using open jdk11 in > production? > Thanks > > > > On Mar 23, 2019, at 5:15 PM, Jay Potharaju > wrote: > > > > I have not kept up with jdk vers

Spatial Search using two separate fields for lat and long

2019-04-03 Thread Tim Hedlund
? The reason I want to keep the fields as two separate ones is that I want to be able to export from solr back to exact same excel file structure, i.e. solr fields maps exactly to excel columns. I'm using solr 7. Any thoughts or suggestions would be appreciated. Regards Tim

Re: SOLR Text Field

2019-04-06 Thread Tim Allison
TextField is a classname. Look in managedschema and pick a field type by name, e.g. text_general On Sat, Apr 6, 2019 at 9:00 AM Dave Beckstrom wrote: > Hi Everyone, > > I'm really hating SOLR. All I want is to define a text field that data > can be indexed into and which is searchable. Should

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Tim Casey
For smaller length documents TFIDFSimilarity will weight towards shorter documents. Another way to say this, if your documents are 5-10 terms, the 5 terms are going to win. You might think about having per token, or token pair, weight. I would be surprised if there was not something similar out t

Re: Date Query Confusion

2018-05-17 Thread Tim Casey
date range, when the source material has date ranges built into it is kinda odd. But it occurs. If you query from noon-1p does that include meeting notes which started at 1130a, but went for an hour? You have to choose what to do. tim On Thu, May 17, 2018 at 6:11 AM, Terry Steichen wrote: >

Re: Zookeeper 3.4.12 with Solr 6.6.2?

2018-05-22 Thread Tim Casey
We have 3.4.10 and have *tested* at a functional level 6.6.2. So far it works. We have not done any stress/load testing. But would have to do this prior to release. On Tue, May 22, 2018 at 9:44 AM, Walter Underwood wrote: > Is anybody running Zookeeper 3.4.12 with Solr 6.6.2? Is that a recomme

Re: Index protected zip

2018-05-26 Thread Tim Allison
You’ll need to provide a PasswordProvider in the ParseContext. I don’t think that is currently possible in the Solr integration. Please open a ticket if SolrJ doesn’t meet your needs. On Thu, May 24, 2018 at 1:03 PM Alexandre Rafalovitch wrote: > Hmm. If it works, then it is Tika magic. Which m

Re: Index protected zip

2018-05-26 Thread Tim Allison
...@mail.gmail.com%3e On Sat, May 26, 2018 at 6:34 AM Tim Allison wrote: > You’ll need to provide a PasswordProvider in the ParseContext. I don’t > think that is currently possible in the Solr integration. Please open a > ticket if SolrJ doesn’t meet your needs. > > On Thu, May 24,

Re: simple enrich uploaded binary documents with sha256 hashes

2018-05-26 Thread Tim Allison
standing by on the user list for Tika when you have questions. :) Cheers, Tim On Fri, May 25, 2018 at 11:10 AM Erick Erickson wrote: > I'd consider using a separate Java program that uses Tika directly, or > one of various services. Then you can assemble whatever you please >

Re: Index protected zip

2018-05-26 Thread Tim Allison
W00t! Thank you, Shawn! The "don't use ERH in production" response comes up frequently enough > that I have created a wiki page we can use for responses: > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika > > Tim, you are extremely well-qualified t

Re: Index protected zip

2018-05-29 Thread Tim Allison
t; > the info is in our "official" place but the real story is in another > > place, > > > one we alternately tell people to sometimes ignore but sometimes keep > up > > to > > > date? Even I'm confused. > > > > > > On Sat, May 26, 20

Re: Exact Phrase search not returning results.

2018-07-20 Thread Tim Casey
Deepti, I am going to guess the analyzer part of the .net application is cutting off the last token. If you try the queries on the console of the running solr cluster, what do you get? If you dump that specific field for all the docs, can you find it with grep? tim On Fri, Jul 20, 2018 at 10

Re: Memory Leak in 7.3 to 7.4

2018-08-06 Thread Tim Allison
+1 to Shawn's and Erick's points about isolating Tika in a separate jvm. Y, please do let us know: u...@tika.apache.org We might be able to help out, and you, in turn, can help the community figure out what's going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703 On Sun, Aug 5, 2018

Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Tim Casey
Walter, When you do the query, what is the sort of the results? tim On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood wrote: > I’ll back up a bit, since it is sort of an X/Y problem. > > I have an index with four shards and 17 million documents. I want to dump > all the docs in

Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
you would do to tune Solr for large amounts of dynamic fields? Does anyone have a guess on what the single high CPU node is doing (some kind of metrics aggregation maybe?). Thank you all, Tim [1] [image: image.png]

Re: Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
f fields and/or many > rows, this shouldn’t run “for many minutes”, but it’s something to look for. > > When this happens, what is your query response time like? I’m assuming > it’s very slow. > > But these are all shots in the dark, some thread dumps would be where I’d > start. &g

Re: Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
> Erick > > On Wed, Mar 18, 2020, 12:04 Edward Ribeiro > wrote: > > > What are your hard and soft commit settings? This can have a large > > impact on the writing throughput. > > > > Best, > > Edward > > > > On Wed, Mar 18, 2020 at 11:43 AM Tim Ro

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
, you can build the symbol space from bigrams. If I ever write a book the title is going to be "The The". I hope it has multi-lingual translations. Although, at this point, it is a very short book :/ tim On Fri, May 15, 2020 at 11:43 AM Walter Underwood wrote: > Right. I might us

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
er to have an honest index and allow the post analysis to change. This way you can change it 10 times a day and no one will care. If you are interested in a word cloud I would suspect people have done a reasonable job around this using a solr index already. tim On Fri, May 15, 2020 at 1:48 PM A

Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Tim Casey
okens in a time field, so you dont get names of people 'june' while searching for 'jun', for instance. tim On Thu, Sep 10, 2020 at 10:08 AM Walter Underwood wrote: > It is very common for us to do more processing in the index analysis > chain. In general, we do that

Re: problem indexing GPS metadata for video upload

2019-05-01 Thread Tim Allison
Related? https://issues.apache.org/jira/plugins/servlet/mobile#issue/TIKA-2861 On Wed, May 1, 2019 at 8:09 AM Alexandre Rafalovitch wrote: > What happens when you run it against a standalone Tika (recommended option > anyway)? Do you see the relevant fields? > > Not every Tika field is capture

Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison
I just pushed a fix for TIKA-2861. If you can either build locally or wait a few hours for Jenkins to build #182, let me know if that works with straight tika-app.jar. On Thu, May 2, 2019 at 5:00 AM Where is Where wrote: > > Thank you Alex and Tim. > I have looked at the solrconfig.xm

Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison
Sorry build #182: https://builds.apache.org/job/tika-branch-1x/ On Thu, May 2, 2019 at 12:01 PM Tim Allison wrote: > > I just pushed a fix for TIKA-2861. If you can either build locally or > wait a few hours for Jenkins to build #182, let me know if that works > with straight

Re: problem indexing GPS metadata for video upload

2019-05-10 Thread Tim Allison
de Solr as soon as Tika is out (I also mean it this time). *TM by Erick Erickson On Fri, May 3, 2019 at 3:44 AM Where is Where wrote: > > Thank you very much Tim, I wonder how to make the Tika change apply to > Solr? I saw Tika core, parse and xml jar files tika-core.jar > tika-parse

Re: Solr query with long query

2019-05-30 Thread Tim Casey
if need be. (Be wary of over generation if one of the categories turns out to be 'thin'). Then in the filter query you can query over a category, or simply require a category:thing to be in the query. tim On Thu, May 30, 2019 at 3:33 PM Shawn Heisey wrote: > On 5/30/2019 4:13 PM, V

Re: Encrypting Solr Index

2019-06-25 Thread Tim Casey
My two cents worth of comment, For our local lucene indexes we use AES encryption. We encrypt the blocks on the way out, decrypt on the way in. We are using a C version of lucene, not the java version. But, I suspect the same methodology could be applied. This assumes the data at rest is the at

Re: Indexing information on number of attachments and their names in EML file

2019-08-02 Thread Tim Allison
I'd strongly recommend rolling your own ingest code. See Erick's superb: https://lucidworks.com/post/indexing-with-solrj/ You can easily get attachments via the RecursiveParserWrapper, e.g. https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParse

Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-09-30 Thread Tim Casey
https://stackoverflow.com/questions/48348312/solr-7-how-to-do-full-text-search-w-geo-spatial-search On Mon, Sep 30, 2019 at 10:31 AM Anushka Gupta < anushka_gu...@external.mckinsey.com> wrote: > Hi, > > I want to be able to filter on different cities and also sort the results > based on geoproxi

Re: Position search

2019-10-15 Thread Tim Casey
particularly short messages. So I would expect a small set of side fields remarking this. This would allow you to carry the measures along with the data. tim On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch wrote: > Is the 100 words a hard boundary or a soft one? > > If it is a

Re: Position search

2019-10-16 Thread Tim Casey
c segments. I think you will find the last N tokens of a document have some odd categories within the search results. I might guess you have a different purpose in mind. Either way, you would likely do better to segment what you are searching. tim On Mon, Oct 14, 2019 at 11:25 PM Kaminski,

BlobRepository "runtme.lib.size"

2019-11-04 Thread Tim Swetland
rrect me if I'm wrong. Thanks, Tim

ConcurrentModificationException in SolrInputDocument writeMap

2019-11-06 Thread Tim Swetland
I'm currently running into a ConcurrentModificationException ingesting data as we attempt to upgrade from Solr 8.1 to 8.2. It's not every document, but it definitely appears regularly in our logs. We didn't run into this problem in 8.1, so I'm not sure what might have changed. I feel like this is p

Re: ConcurrentModificationException in SolrInputDocument writeMap

2019-11-06 Thread Tim Swetland
Nevermind my comment on not having this problem in 8.1. We do have it there as well, I just didn't look far enough back in our logs on my initial search. Would still appreciate whatever thoughts anyone might have on the exception. On Wed, Nov 6, 2019 at 10:17 AM Tim Swetland wrote:

Re: ConcurrentModificationException in SolrInputDocument writeMap

2019-11-18 Thread Tim Swetland
stException (java.lang.String cannot be cast to java.util.Map) on the replica as in issue SOLR-13471 <https://issues.apache.org/jira/browse/SOLR-13471>. Anyway, thanks for the insight everyone, Tim On Fri, Nov 8, 2019 at 12:26 AM Shawn Heisey wrote: > On 11/6/2019 8:17 AM, Tim Swetland wrot

<    1   2   3   4   5