Re: Solr Matched Terms

2015-08-18 Thread simon
Check out https://issues.apache.org/jira/browse/SOLR-4722, which will
return matching terms (and their offsets). Patch can be applied cleanly to
Solr 4; doesn't appear to have been tried with Solr 5

-Simon

On Tue, Aug 18, 2015 at 11:30 AM, Jack Krupansky 
wrote:

> Maybe a specialized highlighter could be produced that simply lists the
> matched terms in a form that apps can easily consume.
>
> -- Jack Krupansky
>
> On Tue, Aug 18, 2015 at 11:11 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Hello,
> >
> > I just wonder what's wrong with highlighting?
> >
> > On Tue, Aug 18, 2015 at 4:19 PM, Basheer Shaik 
> > wrote:
> >
> > > Hi,
> > > I am new to Solr. We have a requirement to carry out fuzzy search. I am
> > > able
> > > to do this and figure out the documents that meet the fuzzy search
> > > criteria.
> > > Is there a way to find out the list of terms from each selected
> document
> > > that matched this search criteria?
> > > Appreciate any help on this.
> > > Thanks,
> > > Basheer Shaik.
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > > http://lucene.472066.n3.nabble.com/Solr-Matched-Terms-tp4223649.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> > 
> >
>


Re: how to index document with multiple words (phrases) and words permutation?

2015-08-25 Thread simon
What you want to do is basically named entity recognition. We have a quite
similar use case (medical/scientific documents, need to look for disease
names /drug names /MeSH terms, etc).

Take a look at David Smiley's Solr Text Tagger (
https://github.com/OpenSextant/SolrTextTagger ) which we've been using with
some success for this task.

best

-Simon

On Mon, Aug 24, 2015 at 2:13 PM, afrooz  wrote:

> Thanks Erick,
> I will explain the detail scenario so you might give me a solution:
> I want to annotate a medical document base on only medical dictionary. I
> don't need to annotate non medical words of document at all.
> The medical dictionary contains terms which contains multiple words, and
> these terms all together has a specific medical meanings. For example "back
> Pain", "back" and "pain" are two separate words but together they have
> another meaning. these terms might be using in different orders in a
> sentences but all with a same meaning. Ex "breast cancer" or "cancer in
> breast" should be consider the same...
> We have terms even more than 6 words also.
>
> So the question is that "I have a document with around 700 words and i need
> to annotate this document base on medical terminology of 3 million size in
> records"
> any idea how to do this?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-index-document-with-multiple-words-phrases-and-words-permutation-tp4224919p4224970.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Detect term occurrences

2015-09-11 Thread simon
+1 on Sujit's recommendation: we have a similar use case (detecting drug
names / disease entities /MeSH terms ) and have been using the
SolrTextTagger with great success.

We run a separate Solr instance as a tagging  service and add the detected
tags as metadata fields to a document before it is ingested into our main
Solr collection.

How many documents/product leaflets do you have ? The tagger is very fast
at the Solr level but I'm seeing quite a bit of HTTP overhead.

best

-Simon

On Fri, Sep 11, 2015 at 1:39 PM, Sujit Pal  wrote:

> Hi Francisco,
>
> >> I have many drug products leaflets, each corresponding to 1 product. In
> the
> other hand we have a medical dictionary with about 10^5 terms.
> I want to detect all the occurrences of those terms for any leaflet
> document.
> Take a look at SolrTextTagger for this use case.
> https://github.com/OpenSextant/SolrTextTagger
>
> 10^5 entries are not that large, I am using it for much larger dictionaries
> at the moment with very good results.
>
> Its a project built (at least originally) by David Smiley, who is also
> quite active in this group.
>
> -sujit
>
>
> On Fri, Sep 11, 2015 at 7:29 AM, Alexandre Rafalovitch  >
> wrote:
>
> > Assuming the medical dictionary is constant, I would do a copyField of
> > text into a separate field and have that separate field use:
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeepWordFilterFactory.html
> > with words coming from the dictionary (normalized).
> >
> > That way that new field will ONLY have your dictionary terms from the
> > text. Then you can do facet against that field or anything else. Or
> > even search and just be a lot more efficient.
> >
> > The main issue would be a gigantic filter, which may mean speed and/or
> > memory issues. Solr has some ways to deal with such large set matches
> > by compiling them into a state machine (used for auto-complete), but I
> > don't know if that's exposed for your purpose.
> >
> > But could make a fun custom filter to build.
> >
> > Regards,
> >Alex.
> > 
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> >
> > On 10 September 2015 at 22:21, Francisco Andrés Fernández
> >  wrote:
> > > Yes.
> > > I have many drug products leaflets, each corresponding to 1 product. In
> > the
> > > other hand we have a medical dictionary with about 10^5 terms.
> > > I want to detect all the occurrences of those terms for any leaflet
> > > document.
> > > Could you give me a clue about how is the best way to perform it?
> > > Perhaps, the best way is (as Walter suggests) to do all the queries
> every
> > > time, as needed.
> > > Regards,
> > >
> > > Francisco
> > >
> > > El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre
> Rafalovitch <
> > > arafa...@gmail.com> escribió:
> > >
> > >> Can you tell us a bit more about the business case? Not the current
> > >> technical one. Because it is entirely possible Solr can solve the
> > >> higher level problem out of the box without you doing manual term
> > >> comparisons.In which case, your problem scope is not quite right.
> > >>
> > >> Regards,
> > >>Alex.
> > >> 
> > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > >> http://www.solr-start.com/
> > >>
> > >>
> > >> On 10 September 2015 at 09:58, Francisco Andrés Fernández
> > >>  wrote:
> > >> > Hi all, I'm new to Solr.
> > >> > I want to detect all ocurrences of terms existing in a thesaurus
> into
> > 1
> > >> or
> > >> > more documents.
> > >> > What´s the best strategy to make it?
> > >> > Doing a query for each term doesn't seem to be the best way.
> > >> > Many thanks,
> > >> >
> > >> > Francisco
> > >>
> >
>


Re: OpenNLP plugin or similar NER software for Solr ??? !!!

2015-11-09 Thread simon
https://github.com/OpenSextant/SolrTextTagger/

We're using it for country tagging successfully.

On Wed, Nov 4, 2015 at 3:10 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> David Smiley had a place name and general tagging engine that for the life
> of me I can't find.
>
> It didn't do NER for you (I'm not sure you want to do this in the search
> engine) but it helps you tag entities in a search engine based on a
> predefined list. At least that's what I remember.
>
> On Wed, Nov 4, 2015 at 3:05 PM,  wrote:
>
> > Hi everyone,
> >
> > I need to install a plugin to extract Location (Country/State/City) from
> > free text documents - any professional advice?!? Does OpenNLP really does
> > the job? Is it English only? US only? Or does it cover worldwide places
> > names?
> > Could someone help me with this job - installation, configuration,
> > model-training etc?
> >
> > Please help,Kind regards,Christian
> >  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
> >
> >
> >  From: Upayavira 
> >  To: solr-user@lucene.apache.org
> >  Sent: Tuesday, November 3, 2015 12:13 PM
> >  Subject: Re: language plugin
> >
> > Looking at the code, this is not going to work without modifications to
> > Solr (or at least a custom component).
> >
> > The atomic update code is closely embedded into the Solr
> > DistributedUpdateProcessor, which expands the atomic update into a full
> > document and then posts it to the shards.
> >
> > You need to do the update expansion before your lang detect processor,
> > but there is no gap between them.
> >
> > From my reading of the code, you could create an AtomicUpdateProcessor
> > that simply expands updates, and insert that before the
> > LangDetectUpdateProcessor.
> >
> > Upayavira
> >
> > On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > > Hi
> > > When I make atomic update - set field - also on content field and also
> > > another field, the language field became generic. Meaning, it doesn’t
> > > work in the set field, only in the first inserting. Even if in the
> first
> > > time the language was detected, it just became generic after the
> update.
> > > Any idea?
> > >
> > > The chain is
> > >
> > > 
> > >  > >
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > > title,content,text
> > >language_t
> > >language_all_t
> > >generic
> > >false
> > >0.8
> > > 
> > > 
> > >  
> > > 
> > >
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > >
> > >
> > > -Original Message-
> > > From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> > > Sent: Thursday, October 29, 2015 17:04
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: language plugin
> > >
> > > Are you trying to do an atomic update without the content field? If so,
> > > it sounds like Solr needs an enhancement (bug fix?) so that language
> > > detection would be skipped if the input field is not present. Or maybe
> > > that could be an option.
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <
> shani.chau...@intel.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >  I'm using solr language detection plugin on field name "content"
> > > > (solr 4.10, plugin
> LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > > When I'm indexing  on the first time it works fine, but if I want to
> > > > set one field again (regardless if it's the content or not) if goes
> to
> > > > its default language. If I'm setting other field I would like the
> > > > language to stay the way it was before, and o don't want to insert
> all
> > > > the content again. There is an option to set the plugin that it won't
> > > > calculate again the language? (put langid.overwrite to false didn't
> > > > work)
> > > >
> > > > Thanks,
> > > > Shani
> > > >
> > > >
> > > > -
> > > > Intel Electronics Ltd.
> > > >
> > > > This e-mail and any attachments may contain confidential material for
> > > > the sole use of the intended recipient(s). Any review or distribution
> > > > by others is strictly prohibited. If you are not the intended
> > > > recipient, please contact the sender and delete all copies.
> > > >
> > > -
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> >
> >
> >
> >
>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly st

Re: fl=value equals?

2015-11-13 Thread simon
Please do push your script to github - I (re)-compile custom code
infrequently and never remember how to setup the environment.

On Thu, Nov 12, 2015 at 5:14 AM, Upayavira  wrote:

> Okay, makes sense. As to your question - making a new ValueSourceParser
> that handles 'equals' sounds pretty straight-forward.
>
> If it helps, I have somewhere an Ant project that will unpack Solr and
> compile custom components against it. I could push that to github or
> something.
>
> Upayavira
>
> On Thu, Nov 12, 2015, at 07:59 AM, billnb...@gmail.com wrote:
> > fl=$b tells me it works. Or I can do a sort=$b asc
> >
> > The idea is to calculate a score but only include geo if it is not a
> > national search. Do we want to send in a parameter into the QT which
> > allows us to omit geo from national searches
> >
> >
> > Bill Bell
> > Sent from mobile
> >
> > > On Nov 11, 2015, at 1:15 AM, Upayavira  wrote:
> > >
> > > I concur with Jan - what does b= do?
> > >
> > > Also asking, how did you identify that it worked?
> > >
> > > Upayavira
> > >
> > >> On Wed, Nov 11, 2015, at 02:58 AM, William Bell wrote:
> > >> I was able to get it to work kinda with a map().
> > >>
> > >> http://localhost:8983/solr/select?q=*:*&radius=1&b=
> > >> <
> http://localhost:8983/solr/select?q=*:*&radius=national&b=if(equals($radius,%27national%27),0,geodist())
> >
> > >> map($radius,1,1,0,geodist())
> > >>
> > >> Where 1= National
> > >>
> > >> Do you have an example of a SearchComponent? It would be pretty easy
> to
> > >> copy map() and develop an equals() right?
> > >>
> > >> if(equals($radius, 'national'), 0, geodist())
> > >>
> > >> This would probably be useful for everyone.
> > >>
> > >> On Tue, Nov 10, 2015 at 4:05 PM, Jan Høydahl 
> > >> wrote:
> > >>
> > >>> Where is your “b” parameter used? I think that instead of trying to
> set a
> > >>> new “b” http param (which solr will not evaluate as a function), you
> should
> > >>> instead try to insert your function or switch qParser directly where
> the
> > >>> “b” param is used, e.g. in a bq or similar.
> > >>>
> > >>> A bit heavy weight, but you could of course write a custom
> SearchComponent
> > >>> to construct your “b” parameter...
> > >>>
> > >>> --
> > >>> Jan Høydahl, search solution architect
> > >>> Cominvent AS - www.cominvent.com
> > >>>
> >  10. nov. 2015 kl. 23.52 skrev William Bell :
> > 
> >  We are trying to look at a value, and change another value based on
> that.
> > 
> >  For example, for national search we want to pass in
> radius=national, and
> >  then set another variable equal to 0, else set the other variable =
> to
> >  geodist() calculation.
> > 
> >  We tried {!switch} but this only appears to work on fq/q. There is
> no
> >  function for constants for equals
> > >>>
> http://localhost:8983/solr/select?q=*:*&radius=national&b=if(equals($radius,'national'),0,geodist())
> > 
> >  This does not work:
> > 
> >  http://localhost:8983/solr/select?q=*:*&radius=national&b={!switch
> >  case.national=0 default=geodist() v=$radius}
> > 
> >  Ideas?
> > 
> > 
> > 
> >  --
> >  Bill Bell
> >  billnb...@gmail.com
> >  cell 720-256-8076
> > >>
> > >>
> > >> --
> > >> Bill Bell
> > >> billnb...@gmail.com
> > >> cell 720-256-8076
>


Re: Retrieving list of words for highlighting

2015-03-27 Thread simon
There's a JIRA ( https://issues.apache.org/jira/browse/SOLR-4722 )
 describing a highlighter which returns term positions rather than
snippets, which could then be mapped to  the matching words in the indexed
document (assuming that it's stored or that you have a copy elsewhere).

-Simon

On Wed, Mar 25, 2015 at 7:30 PM, Damien Dykman 
wrote:

> In Solr 5 (or 4), is there an easy way to retrieve the list of words to
> highlight?
>
> Use case: allow an external application to highlight the matching words
> of a matching document, rather than using the highlighted snippets
> returned by Solr.
>
> Thanks,
> Damien
>


Custom Function for date reformatting

2015-06-12 Thread simon
Has anyone written a Solr function which will reformat Solr's ISO8601 Date
fields and could be used to generate pseudo-fields in search results ?

I  am converting existing appplications that have baked-in assumptions that
dates are in the format -mm-dd to use Solr, and tracking down every
place where a date format conversion is needed is proving painful indeed ;=(

My thought is to write a custom function of the form
datereformatter(, )  but I thought I'd
check if it's already been done or if someone can suggest a better approach.

regards

-Simon


Solr suddenly starts creating .cfs (compound) segments during indexing

2016-09-27 Thread simon
Our index builds take around 6 hours, and I've noticed recently that
segments created towards the end of the build (in the last hour or so)  use
the compound file format (.cfs). I assumed that this might be due to the
number of open files approaching a maximum, but both the hard and soft open
file limits for the Solr JVM process are set to 65536, so that doesn't seem
very likely. It's obviously not a problem, but I'm curious as to why this
might be happening.


Environment:
OS = Centos 7 Linux

Java:
java -version =>
openjdk version "1.8.0_45"
OpenJDK Runtime Environment (build 1.8.0_45-b13)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)

Solr 5.4 started with the bin/solr script: ps shows

java -server -Xms5g -Xmx5g -XX:NewRatio=3 -XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4
-XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark
-XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000
-XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -Djetty.port=8983
-DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Duser.timezone=EST
-Djetty.home=/home/srosenthal/defsolr/server
-Dsolr.solr.home=/home/srosenthal/defsolr/server/solr
-Dsolr.install.dir=/home/srosenthal/defsolr -Xss256k -jar start.jar
-XX:OnOutOfMemoryError=/home/srosenthal/defsolr/bin/oom_solr.sh 8983
/home/srosenthal/defsolr/server/logs --module=http

solrconfig.xml: basically the default with some minor tweaks in the
indexConfig section
5.0

 


200
1


  20
  60
  20




... everything else is default

Insights as to why this is happening would be welcome.

-Simon


Re: Can Solr find related terms in a document

2016-10-17 Thread simon
Do you already have a set of terms for which you would want to find out
their co-occurence, or are you trying to do data mining, looking in a
collection for terms which occur together more often than by chance ?


On Sun, Oct 16, 2016 at 3:45 AM, Yangrui Guo  wrote:

> Hello
>
> I'm curious to know if Solr can correlate the occurrences of two terms.
> E.g. if "Bush administration" and "stupid mistake" often appear in the same
> article, then Solr will think that the two terms are related. Is there a
> way to achieve this?
>
> Yangrui
>


Re: Solr 6.6 UNLOAD core broken?

2017-06-09 Thread simon
I'm seeing the same behavior. The CoreAdminAPI call (as generated by the
Admin UI) looks correct, and the core.properties file is removed.

I don't see anything in the CHANGES.txt for this release which would imply
such a change in behavior, nor anything in the 6.6 reference guide, so it
looks like a bug.

-Simon

On Fri, Jun 9, 2017 at 5:14 AM, Andreas Hubold  wrote:

> Hi,
>
> I just tried to update from Solr 6.5.1 to Solr 6.6.0 and observed a
> changed behaviour with regard to unloading cores in Solr standalone mode.
>
> After unloading a core using the CoreAdmin API (e.g. via Admin UI), I
> still get search results for that core. It seems, the search request
> triggers automatic recreation of the core now. With Solr 6.5.1 search
> requests to unloaded cores were answered with 404 as expected.
>
> Can you confirm this being a bug in 6.6.0 or is this an intended change?
>
> Cheers,
> Andreas
>
>


Re: Phrase Exact Match with Margin of Error

2017-06-15 Thread simon
I think that's because the KeywordTokenizer by definition produces a single
token (not a phrase).

Perhaps you could create two fields by a copyField - the one you already
have(field1), and one tokenized using StandardTokenizer or
WhiteSpaceTokenizer(field2) which will produce a phrase with multiple
tokens. Then construct a query which searches both  field1 for an exact
match, and field2 using ComplexQueryParser (use the localparams syntax) to
combine them. Boost the field1 (exact match).

HTH

-Simon

On Thu, Jun 15, 2017 at 1:20 PM, Max Bridgewater 
wrote:

> Thanks Susheel. The challenge is that if I search for the word "between"
> alone, I still get plenty of results. In a way I want the query to  match
> the document title exactly (up to a few characters) and the document title
> match the query exactly (up to a few characters). KeywordTokenizer allows
> that. But complexphrase does not seem to work with KeywordTokenizer.
>
> On Thu, Jun 15, 2017 at 10:23 AM, Susheel Kumar 
> wrote:
>
> > CompledPhraseQuery parser is what you need to look
> > https://cwiki.apache.org/confluence/display/solr/Other+
> > Parsers#OtherParsers-ComplexPhraseQueryParser.
> > See below for e.g.
> >
> >
> >
> > http://localhost:8983/solr/techproducts/select?
> debugQuery=on&indent=on&q=
> > manu:%22Bridge%20the%20gat~1%20between%20your%20skills%
> > 20and%20your%20goals%22&defType=complexphrase
> >
> > On Thu, Jun 15, 2017 at 5:59 AM, Max Bridgewater <
> > max.bridgewa...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I am trying to do phrase exact match. For this, I use
> > > KeywordTokenizerFactory. This basically does what I want to do. My
> field
> > > type is defined as follows:
> > >
> > >  > > positionIncrementGap="100">
> > >   
> > > 
> > > 
> > >   
> > >   
> > > 
> > > 
> > >   
> > > 
> > >
> > >
> > > In addition to this, I want to tolerate typos of two or three letters.
> I
> > > thought fuzzy search could allow me to accept this margin of error. But
> > > this doesn't seem to work.
> > >
> > > A typical query I would have is:
> > >
> > > q=subjet:"Bridge the gap between your skills and your goals"
> > >
> > > Now, in this query, if I replace gap with gat, I was hoping I could do
> > > something such as:
> > >
> > > q=subjet:"Bridge the gat between your skills and your goals"~0.8
> > >
> > > But this doesn't quite do what I am trying to achieve.
> > >
> > > Any suggestion?
> > >
> >
>


Re: How Solr knows the Cores it has on startup?

2017-09-12 Thread simon
It will look for cores based on the discovery of core.propeties files. Full
details at the link below.

https://lucene.apache.org/solr/guide/6_6/defining-core-properties.html#defining-core-properties

There is a gotcha in what you want to do in that on  an unload, the
core.properties file is deleted in current versions of Solr - so you'll
have to find a way (outside Solr) to copy it or re-create it.

What is the use case here ?

best

-Simon

On Tue, Sep 12, 2017 at 1:27 PM, Shashank Pedamallu 
wrote:

> Hi,
>
> I wanted to know how does Solr pick up cores on startup. Basically, what I
> would like to try is Read cores created from one Solr instance from another
> Solr instance. i.e.,
>
>   *   I will have 2 Solr Instances SOLR1, SOLR2. Only SOLR1 is started.
>   *   I’m creating a core (Core1) using SOLR1.
>   *   After filling it to some capacity, I unload this core without
> deleting the data.
>   *   I would now start SOLR2 and would like to point SOLR2 to Core1.
> Can someone please share me the details on how I can achieve this?
>
> Thanks,
> Shashank
>


Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
Sounds as though an update request processor will do that, and also
eliminate the need to use the PatternReplaceFilterfactory downstream.

Take a look at the documentation in
https://lucene.apache.org/solr/guide/6_6/update-request-processors.html.
I'm thinking that the RegexReplaceProcessorFactory might work for this.

best

-Simon

On Thu, Sep 14, 2017 at 1:46 PM, Arnold Bronley 
wrote:

> I know I can apply PatternReplaceFilterFactory to remove control characters
> from indexed value. However, is it possible to do similar thing for stored
> value? Because of some control characters included in indexing request,
> Solr throws Illegal Character Exception.
>


Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
@Arnold: are these non UTF-8 control characters (which is what the Nutch
issue was about) or otherwise legal UTF-8  characters which Solr for some
reason is choking on ?

If you could provide a full stack trace it would be really helpful.


On Thu, Sep 14, 2017 at 2:55 PM, Markus Jelsma 
wrote:

> Hello,
>
> You can not do this in Solr, you cannot even send non-character code
> points in the first place. For Apache Nutch we solved the problem by
> stripping those non-character code points from Strings before putting them
> in SolrDocument. Check the ticket, you can easily resuse the strip method.
>
> Perhaps it would be a good idea to move the method to SolrDocument or
> somewhere in SolrJ in the first place, so others don't have to bother with
> this problem.
>
> Regards,
> Markus
>
> https://issues.apache.org/jira/browse/NUTCH-1016
>
>
>
> -Original message-
> > From:Arnold Bronley 
> > Sent: Thursday 14th September 2017 19:46
> > To: solr-user@lucene.apache.org
> > Subject: How to remove control characters in stored value at Solr side
> >
> > I know I can apply PatternReplaceFilterFactory to remove control
> characters
> > from indexed value. However, is it possible to do similar thing for
> stored
> > value? Because of some control characters included in indexing request,
> > Solr throws Illegal Character Exception.
> >
>


Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
looks as though the problem is in parsing some malformed XML,  based on
what I'm seeing:

...
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
((CTRL-CHAR, code 11))
... ( char #11 is a vertical tab).

This should be fixed outside Solr, but if that is not practical, and you
could live with dropping the offending document(s) then you might want to
investigate the TolerantUpdateProcessorFactory Solr 6.1 or later)

-Simon

On Thu, Sep 14, 2017 at 3:56 PM, arnoldbronley 
wrote:

> Thanks for information. Here is the full stack trace. I thought to handle
> it
> from client side but client apps are not under my control and I don't have
> access to them.
>
> org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code
> 11))
>  at [row,col {unknown-source}]: [1,413]
> at org.apache.solr.handler.loader.XMLLoader.load(
> XMLLoader.java:179)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(
> UpdateRequestHandler.java:97)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
> ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:153)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)
> at org.apache.solr.servlet.HttpSolrCall.execute(
> HttpSolrCall.java:654)
> at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:460)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:254)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1668)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1160)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
> at
> org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1092)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:518)
> at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:308)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:244)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(
> FillInterest.java:95)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
> produceAndRun(ExecuteProduceConsume.java:246)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(
> ExecuteProduceConsume.java:156)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:654)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:572)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
> ((CTRL-CHAR, code 11))
>  at [row,col {unknown-source}]: [1,413]
> at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(
> StreamScanner.java:674)
> at
> com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(
> BasicStreamReader.java:4576)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(
> BasicStreamReader.java:2881)
> at com.ctc.wstx.sr.BasicStreamReader.next(
> BasicStreamReader.java:1073)
> at org.apache.solr.handler.loader.XMLLoader.readDoc(
> XMLLoader.java:397)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:249)
> at org.apache.solr.handler.loader.XMLLoader.load(
> XMLLoader.java:177)
> ... 32 more
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Upgrade path from 5.4.1

2017-11-02 Thread simon
though see SOLR-11078 , which is reporting significant query slowdowns
after converting  *Trie to *Point fields in 7.1, compared with 6.4.2

On Wed, Nov 1, 2017 at 9:06 PM, Yonik Seeley  wrote:

> On Wed, Nov 1, 2017 at 2:36 PM, Erick Erickson 
> wrote:
> > I _always_ prefer to reindex if possible. Additionally, as of Solr 7
> > all the numeric types are deprecated in favor of points-based types
> > which are faster on all fronts and use less memory.
>
> They are a good step forward in genera, and faster for range queries
> (and multiple-dimensions), but looking at the design I'd guess that
> they may be slower for exact-match queries?
> Has anyone tested this?
>
> -Yonik
>


Re: use mutiple ssd in solr cloud

2017-11-07 Thread simon
I don't think there's any way to do that within Solr. If you're using
Linux, the Logical Volume Manager can be used to create a single volume
from multiple devices (RAID), from which you can create partitions/file
systems as required. There may be equivalent Windows functionality - I
can't say.

best

-Simon

On Tue, Nov 7, 2017 at 1:44 AM, Amin Raeiszadeh 
wrote:

> Hi
> i want to use more than one ssd in each server of solr cluster but i don't
> know how to set multiple hdd in solr.xml configurations.
> i set on hdd path in solr.xml by:
> /media/ssd
> but i can't set more than one ssd.
> how should i do it.
> thanks.
>


Re: Highlighting, offsets -- external doc store

2016-11-29 Thread simon
You might want to take a look at
https://issues.apache.org/jira/browse/SOLR-4722
( 'highlighter which generates a list of query term positions'). We used it
a while back and doesn't appear to have been used in any Solr > 4.10)

-Simon

On Tue, Nov 29, 2016 at 11:43 AM, John Bickerstaff  wrote:

> All,
>
> One of the questions I've been asked to answer / prove out is around the
> question of highlighting query matches in responses.
>
> BTW - One assumption I'm making is that highlighting is basically a
> function of storing offsets for terms / tokens at index time.  If that's
> not right, I'd be grateful for pointers in the right direction.
>
> My underlying need is to get highlighting on search term matches for
> returned documents.  I need to choose between doing this in Solr and using
> an external document store, so I'm interested in whether Solr can provide
> the doc store with the information necessary to identify which section(s)
> of the doc to highlight in a query response...
>
> A few questions:
>
> 1. This page doesn't say a lot about how things work - is there somewhere
> with more information on dealing with offsets and highlighting? On offsets
> and how they're handled?
> https://cwiki.apache.org/confluence/display/solr/Highlighting
>
> 2. Can I return offset information with a query response or is that
> internal only?  If yes, can I return offset info if I have NOT stored the
> data in Solr but indexed only?
>
> (Explanation: Currently my project is considering indexing only and storing
> the entire text elsewhere -- using Solr to return only doc ID's for
> searches.  If Solr could also return offsets, these could be used in
> processing the text stored elsewhere to provide highlighting)
>
> 3. Do I assume correctly that in order for Solr highlighting to work
> correctly, the text MUST also be stored in Solr (I.E. not indexed only, but
> stored=true)
>
> Many thanks...
>


Unexplainable indexing i/o errors

2017-03-27 Thread simon
I'm seeing an odd error during indexing for which I can't find any reason.

The relevant solr log entry:

2017-03-24 19:09:35.363 ERROR (commitScheduler-30-thread-1) [
x:build0324] o.a.s.u.CommitTracker auto commit
error...:java.io.EOFException: read past EOF:
 MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
 at
org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
...
Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
status indeterminate: remaining=0, please run checkindex for more details
(resource=
BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")))
 at
org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
 at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.(CompressingStoredFieldsReader.java:140)
 followed within a few seconds by

 2017-03-24 19:09:56.402 ERROR (commitScheduler-31-thread-1) [
x:build0324] o.a.s.u.CommitTracker auto commit
error...:org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
...
Caused by: java.io.EOFException: read past EOF:
MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
at
org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)

This error is repeated a few times as the indexing continued and further
autocommits were triggered.

I stopped the indexing process, made a backup snapshot of the index,
 restarted indexing at a checkpoint, and everything then completed without
further incidents

I ran checkIndex on the saved snapshot and it reported no errors
whatsoever. Operations on the complete index (inclcuing an optimize and
several query scripts) have all been error-free.

Some background:
 Solr information from the beginning of the checkindex output:
 ---
 Opening index @ /indexes/solrindexes/build0324.bad/index

Segments file=segments_9s numSegments=105 version=6.3.0
id=7m1ldieoje0m6sljp7xocbz9l userData={commitTimeMSec=1490400514324}
  1 of 105: name=_be maxDoc=1227144
version=6.3.0
id=7m1ldieoje0m6sljp7xocburb
codec=Lucene62
compound=false
numFiles=14
size (MB)=4,926.186
diagnostics = {os=Linux, java.vendor=Oracle Corporation,
java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.3.0,
mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_45-b13,
source=merge, mergeFactor=19, os.version=3.10.0-229.1.2.el7.x86_64,
timestamp=1490380905920}
no deletions
test: open reader.OK [took 0.176 sec]
test: check integrity.OK [took 37.399 sec]
test: check live docs.OK [took 0.000 sec]
test: field infos.OK [49 fields] [took 0.000 sec]
test: field norms.OK [17 fields] [took 0.030 sec]
test: terms, freq, prox...OK [14568108 terms; 612537186 terms/docs
pairs; 801208966 tokens] [took 30.005 sec]
test: stored fields...OK [150164874 total field count; avg 122.4
fields per doc] [took 35.321 sec]
test: term vectorsOK [4804967 total term vector count; avg 3.9
term/freq vector fields per doc] [took 55.857 sec]
test: docvalues...OK [4 docvalues fields; 0 BINARY; 1 NUMERIC;
2 SORTED; 0 SORTED_NUMERIC; 1 SORTED_SET] [took 0.954 sec]
test: points..OK [0 fields, 0 points] [took 0.000 sec]
  -

  The indexing process is a Python script (using the scorched Python
client)  which spawns multiple instance of itself, in this case 6, so there
are definitely concurrent calls ( to /update/json )

Solrconfig and the schema have not been changed for several months, during
which time many ingests have been done, and the documents which were being
indexed at the time of the error have been indexed before without problems,
so I don't think it's a data issue.

I saw the same error occur earlier in the day, and decided at that time to
delete the core and restart the Solr instance.

The server is an Amazon instance running CentOS 7. I checked the system
logs and didn't see any evidence of hardware errors

I'm puzzled as to why this would start happening out of the blue and I
can't find any partiuclarly relevant posts to this forum or Stackexchange.
Anyone have an idea what's going on ?

-Simon


Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-28 Thread simon
You might want to take a look at the patch in
https://issues.apache.org/jira/browse/SOLR-4722 - 'Highlighter which
generates a list of query term position(s) for each item in a list of
documents, or returns null if highlighting is disabled.' I've used it  for
retrieving the term positions with no need for actual highlighting. The
patch is pretty old - I applied it to Solr 4.10 I think, so will probably
need some work for later releases.

HTH

-Simon

On Tue, Mar 28, 2017 at 4:59 AM, forest_soup  wrote:

> Thanks Eric.
>
> Actually solr highlighting function does not meet my requirement. My
> requirement is not showing the highlighted words in snippets, but show them
> in the whole opening document. So I would like to get the term's
> position/offset info from solr. I went through the highlight feature, but
> found that exact info(position/offset) is not returned.
> If you know that info within highlighting feature, could you please point
> it
> out to me?
>
> The most promising way seems to be /tvrh and tv.offsets/tv.positions
> parameters. But I haven't tried it. Any comments on that one?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
> position-offset-in-Solr-tp4326931p4327149.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: keywords not found - google like feature

2017-04-13 Thread simon
Regardless of the business case (which would be good to know) you might
want to try something along the lines of
http://stackoverflow.com/questions/25038080/how-can-i-tell-solr-to-return-the-hit-search-terms-per-document
- basically generate pseudo-fields using the exists() function query which
will return a boolean if the term is in a specific field.
I've used this for simple cases where it worked well, though I wouldn't
like to speculate on how well this scales if you have an edismax query
where you might need to generate multiple term/field combinations.

HTH

-Simon

On Thu, Apr 13, 2017 at 3:34 PM, Alexandre Rafalovitch 
wrote:

> Are you asking visual representation or an actual feature. Because if
> all your keywords/clauses are optional (default SHOULD) then Solr
> automatically tries to match maximum number of them and then less and
> less. So, if all words do not match, it will return results that match
> less number of words.
>
> And words not-matched is effectively your strike-through negative
> space. You can probably recover that from debug info, though it will
> be not pretty and perhaps a bit slower.
>
> The real issue here is ranking. Does Google do something special with
> ranking when they do strike through. Do they do some grouping and
> ranking within groups, not just a global one?
>
> The biggest question is - of course - what is your business - as
> opposed to look-alike - objective. Because explaining your needs
> through a similarity with other product's secret implementation is a
> long way to get there. Too much precision loss in each explanation
> round.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 13 April 2017 at 20:49, Nilesh Kamani  wrote:
> > Hello All,
> >
> > When we search google, sometimes google returns results with mention of
> > keywords not found (mentioned as strike-through)
> >
> > Does Solr provide such feature ?
> >
> >
> > Thanks,
> > Nilesh Kamani
>


Indexing I/O errors and CorruptIndex messages

2017-04-26 Thread simon
reposting this as the problem described is happening again and there were
no responses to the original email. Anyone ?

I'm seeing an odd error during indexing for which I can't find any reason.

The relevant solr log entry:

2017-03-24 19:09:35.363 ERROR (commitScheduler-30-thread-1) [
x:build0324] o.a.s.u.CommitTracker auto commit
error...:java.io.EOFException: read past EOF:  MMapIndexInput(path="/
indexes/solrindexes/build0324/index/_4ku.fdx")
 at org.apache.lucene.store.ByteBufferIndexInput.readByte(
ByteBufferIndexInput.java:75)
...
Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
status indeterminate: remaining=0, please run checkindex for more details
(resource= BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/
solrindexes/build0324/index/_4ku.fdx")))
 at org.apache.lucene.codecs.CodecUtil.checkFooter(
CodecUtil.java:451)
 at org.apache.lucene.codecs.compressing.
CompressingStoredFieldsReader.(CompressingStoredFieldsReader.java:140)
 followed within a few seconds by

 2017-03-24 19:09:56.402 ERROR (commitScheduler-31-thread-1) [
x:build0324] o.a.s.u.CommitTracker auto commit
error...:org.apache.solr.common.SolrException:
Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
...
Caused by: java.io.EOFException: read past EOF:
MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
at org.apache.lucene.store.ByteBufferIndexInput.readByte(
ByteBufferIndexInput.java:75)

This error is repeated a few times as the indexing continued and further
autocommits were triggered.

I stopped the indexing process, made a backup snapshot of the index,
 restarted indexing at a checkpoint, and everything then completed without
further incidents

I ran checkIndex on the saved snapshot and it reported no errors
whatsoever. Operations on the complete index (inclcuing an optimize and
several query scripts) have all been error-free.

Some background:
 Solr information from the beginning of the checkindex output:
 ---
 Opening index @ /indexes/solrindexes/build0324.bad/index

Segments file=segments_9s numSegments=105 version=6.3.0
id=7m1ldieoje0m6sljp7xocbz9l userData={commitTimeMSec=1490400514324}
  1 of 105: name=_be maxDoc=1227144
version=6.3.0
id=7m1ldieoje0m6sljp7xocburb
codec=Lucene62
compound=false
numFiles=14
size (MB)=4,926.186
diagnostics = {os=Linux, java.vendor=Oracle Corporation,
java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.3.0,
mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_45-b13,
source=merge, mergeFactor=19, os.version=3.10.0-229.1.2.el7.x86_64,
timestamp=1490380905920}
no deletions
test: open reader.OK [took 0.176 sec]
test: check integrity.OK [took 37.399 sec]
test: check live docs.OK [took 0.000 sec]
test: field infos.OK [49 fields] [took 0.000 sec]
test: field norms.OK [17 fields] [took 0.030 sec]
test: terms, freq, prox...OK [14568108 terms; 612537186 terms/docs
pairs; 801208966 tokens] [took 30.005 sec]
test: stored fields...OK [150164874 total field count; avg 122.4
fields per doc] [took 35.321 sec]
test: term vectorsOK [4804967 total term vector count; avg 3.9
term/freq vector fields per doc] [took 55.857 sec]
test: docvalues...OK [4 docvalues fields; 0 BINARY; 1 NUMERIC;
2 SORTED; 0 SORTED_NUMERIC; 1 SORTED_SET] [took 0.954 sec]
test: points..OK [0 fields, 0 points] [took 0.000 sec]
  -

  The indexing process is a Python script (using the scorched Python
client)  which spawns multiple instance of itself, in this case 6, so there
are definitely concurrent calls ( to /update/json )

Solrconfig and the schema have not been changed for several months, during
which time many ingests have been done, and the documents which were being
indexed at the time of the error have been indexed before without problems,
so I don't think it's a data issue.

I saw the same error occur earlier in the day, and decided at that time to
delete the core and restart the Solr instance.

The server is an Amazon instance running CentOS 7. I checked the system
logs and didn't see any evidence of hardware errors

I'm puzzled as to why this would start happening out of the blue and I
can't find any partiuclarly relevant posts to this forum or Stackexchange.
Anyone have an idea what's going on ?


Re: Indexing I/O errors and CorruptIndex messages

2017-04-27 Thread simon
Nope ... huge file system (600gb) only 50% full, and a complete index would
be 80gb max.

On Wed, Apr 26, 2017 at 4:04 PM, Erick Erickson 
wrote:

> Disk space issue? Lucene requires at least as much free disk space as
> your index size. Note that the disk full issue will be transient, IOW
> if you look now and have free space it still may have been all used up
> but had some space reclaimed.
>
> Best,
> Erick
>
> On Wed, Apr 26, 2017 at 12:02 PM, simon  wrote:
> > reposting this as the problem described is happening again and there were
> > no responses to the original email. Anyone ?
> > 
> > I'm seeing an odd error during indexing for which I can't find any
> reason.
> >
> > The relevant solr log entry:
> >
> > 2017-03-24 19:09:35.363 ERROR (commitScheduler-30-thread-1) [
> > x:build0324] o.a.s.u.CommitTracker auto commit
> > error...:java.io.EOFException: read past EOF:  MMapIndexInput(path="/
> > indexes/solrindexes/build0324/index/_4ku.fdx")
> >  at org.apache.lucene.store.ByteBufferIndexInput.readByte(
> > ByteBufferIndexInput.java:75)
> > ...
> > Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
> > status indeterminate: remaining=0, please run checkindex for more details
> > (resource= BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/
> > solrindexes/build0324/index/_4ku.fdx")))
> >  at org.apache.lucene.codecs.CodecUtil.checkFooter(
> > CodecUtil.java:451)
> >  at org.apache.lucene.codecs.compressing.
> > CompressingStoredFieldsReader.(CompressingStoredFieldsReader.
> java:140)
> >  followed within a few seconds by
> >
> >  2017-03-24 19:09:56.402 ERROR (commitScheduler-31-thread-1) [
> > x:build0324] o.a.s.u.CommitTracker auto commit
> > error...:org.apache.solr.common.SolrException:
> > Error opening new searcher
> > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
> > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
> > ...
> > Caused by: java.io.EOFException: read past EOF:
> > MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
> > at org.apache.lucene.store.ByteBufferIndexInput.readByte(
> > ByteBufferIndexInput.java:75)
> >
> > This error is repeated a few times as the indexing continued and further
> > autocommits were triggered.
> >
> > I stopped the indexing process, made a backup snapshot of the index,
> >  restarted indexing at a checkpoint, and everything then completed
> without
> > further incidents
> >
> > I ran checkIndex on the saved snapshot and it reported no errors
> > whatsoever. Operations on the complete index (inclcuing an optimize and
> > several query scripts) have all been error-free.
> >
> > Some background:
> >  Solr information from the beginning of the checkindex output:
> >  ---
> >  Opening index @ /indexes/solrindexes/build0324.bad/index
> >
> > Segments file=segments_9s numSegments=105 version=6.3.0
> > id=7m1ldieoje0m6sljp7xocbz9l userData={commitTimeMSec=1490400514324}
> >   1 of 105: name=_be maxDoc=1227144
> > version=6.3.0
> > id=7m1ldieoje0m6sljp7xocburb
> > codec=Lucene62
> > compound=false
> > numFiles=14
> > size (MB)=4,926.186
> > diagnostics = {os=Linux, java.vendor=Oracle Corporation,
> > java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.3.0,
> > mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_45-
> b13,
> > source=merge, mergeFactor=19, os.version=3.10.0-229.1.2.el7.x86_64,
> > timestamp=1490380905920}
> > no deletions
> > test: open reader.OK [took 0.176 sec]
> > test: check integrity.OK [took 37.399 sec]
> > test: check live docs.OK [took 0.000 sec]
> > test: field infos.OK [49 fields] [took 0.000 sec]
> > test: field norms.OK [17 fields] [took 0.030 sec]
> > test: terms, freq, prox...OK [14568108 terms; 612537186 terms/docs
> > pairs; 801208966 tokens] [took 30.005 sec]
> > test: stored fields...OK [150164874 total field count; avg 122.4
> > fields per doc] [took 35.321 sec]
> > test: term vectorsOK [4804967 total term vector count; avg
> 3.9
> > term/freq vector fields per doc] [took 55.857 sec]
> > test: docvalues...OK [4 docvalues fields; 0 BINARY; 1
> NUMERIC;
> > 2 SORTED; 0 SORTED_NUMERIC; 1 SORTED_SET] [took 0.954 sec]
> > test: points..OK [0 fields, 0 poi

Re: Reload an unloaded core

2017-05-02 Thread simon
I ran into the exact same situation recently.  I unloaded from the browser
GUI which does not delete the data or instance dirs, but does delete
core.properties.  I couldn't find any API  either so I eventually manually
recreated core.properties and restarted Solr.

Would be nice if the core.properties file were to be renamed rather than
deleted and if there were a RESCAN action to scan for unloaded cores and
reload them.

On Tue, May 2, 2017 at 12:53 PM, Shashank Pedamallu 
wrote:

> Hi all,
>
> I want to unload a core from Solr without deleting data-dir or
> instance-dir. I’m performing some operations on the data-dir after this and
> then I would like to reload the core from the same data-dir. These are the
> things I tried:
>
>   1.  Reload api – throws an exception saying no such core exists.
>   2.  Create api – throws an exception saying a core with given name
> already exists.
>
> Can someone point me what api I could use to achieve this. Please note
> that, I’m working with Solr in Non-Cloud mode without Zookeeper,
> Collections, etc.
>
> Thanks in advance!
>
> Thanks,
> Shashank Pedamallu
>


Re: Reload an unloaded core

2017-05-02 Thread simon
the core properties definitely disappears if you use a configset, as in

#
#Written by CorePropertiesLocator
#Tue May 02 20:19:40 UTC 2017
name=testcore
dataDir=/indexes/solrindexes/testcore
configSet=myconf

Using a conf directory, as in

#Written by CorePropertiesLocator
#Tue May 02 20:30:44 UTC 2017
name=testcorewithconf
schema=conf/schema.xml
dataDir=/indexes/solrindexes/testcorewithconf

has the same behavior.

This is Solr 6.3.0 standalone, and I share your memory that at one point in
the distant past core.properties was renamed on an unload.

Probably worth submitting a JIRA

-Simon

On Tue, May 2, 2017 at 4:04 PM, Erick Erickson 
wrote:

> IIRC, the core.properties file _is_ renamed to
> core.properties.unloaded or something like that.
>
> Yeah, this is something of a pain. The inverse of "unload" is "create"
> but you have to know exactly how to create a core, and in SolrCloud
> mode that's...interesting. It's much safer to bring the Solr node
> down, do what you want then start it up, although not always possible.
>
> Best,
> Erick
>
> On Tue, May 2, 2017 at 10:55 AM, simon  wrote:
> > I ran into the exact same situation recently.  I unloaded from the
> browser
> > GUI which does not delete the data or instance dirs, but does delete
> > core.properties.  I couldn't find any API  either so I eventually
> manually
> > recreated core.properties and restarted Solr.
> >
> > Would be nice if the core.properties file were to be renamed rather than
> > deleted and if there were a RESCAN action to scan for unloaded cores and
> > reload them.
> >
> > On Tue, May 2, 2017 at 12:53 PM, Shashank Pedamallu <
> spedama...@vmware.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> I want to unload a core from Solr without deleting data-dir or
> >> instance-dir. I’m performing some operations on the data-dir after this
> and
> >> then I would like to reload the core from the same data-dir. These are
> the
> >> things I tried:
> >>
> >>   1.  Reload api – throws an exception saying no such core exists.
> >>   2.  Create api – throws an exception saying a core with given name
> >> already exists.
> >>
> >> Can someone point me what api I could use to achieve this. Please note
> >> that, I’m working with Solr in Non-Cloud mode without Zookeeper,
> >> Collections, etc.
> >>
> >> Thanks in advance!
> >>
> >> Thanks,
> >> Shashank Pedamallu
> >>
>


Re: Indexing I/O errors and CorruptIndex messages

2017-05-04 Thread simon
I've pretty much ruled out system/hardware issues - the AWS instance has
been rebooted,  and indexing to a core on a new and empty  disk/file system
fails in the same way with a CorruptIndexException.
I can  generally get indexing to complete by significantly dialing down the
number of indexer scripts running concurrently, but the duration goes up
proportionately.

-Simon


On Thu, Apr 27, 2017 at 9:26 AM, simon  wrote:

> Nope ... huge file system (600gb) only 50% full, and a complete index
> would be 80gb max.
>
> On Wed, Apr 26, 2017 at 4:04 PM, Erick Erickson 
> wrote:
>
>> Disk space issue? Lucene requires at least as much free disk space as
>> your index size. Note that the disk full issue will be transient, IOW
>> if you look now and have free space it still may have been all used up
>> but had some space reclaimed.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 26, 2017 at 12:02 PM, simon  wrote:
>> > reposting this as the problem described is happening again and there
>> were
>> > no responses to the original email. Anyone ?
>> > 
>> > I'm seeing an odd error during indexing for which I can't find any
>> reason.
>> >
>> > The relevant solr log entry:
>> >
>> > 2017-03-24 19:09:35.363 ERROR (commitScheduler-30-thread-1) [
>> > x:build0324] o.a.s.u.CommitTracker auto commit
>> > error...:java.io.EOFException: read past EOF:
>> MMapIndexInput(path="/
>> > indexes/solrindexes/build0324/index/_4ku.fdx")
>> >  at org.apache.lucene.store.ByteBufferIndexInput.readByte(
>> > ByteBufferIndexInput.java:75)
>> > ...
>> > Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
>> > status indeterminate: remaining=0, please run checkindex for more
>> details
>> > (resource= BufferedChecksumIndexInput(MM
>> apIndexInput(path="/indexes/
>> > solrindexes/build0324/index/_4ku.fdx")))
>> >  at org.apache.lucene.codecs.CodecUtil.checkFooter(
>> > CodecUtil.java:451)
>> >  at org.apache.lucene.codecs.compressing.
>> > CompressingStoredFieldsReader.(CompressingStoredFields
>> Reader.java:140)
>> >  followed within a few seconds by
>> >
>> >  2017-03-24 19:09:56.402 ERROR (commitScheduler-31-thread-1) [
>> > x:build0324] o.a.s.u.CommitTracker auto commit
>> > error...:org.apache.solr.common.SolrException:
>> > Error opening new searcher
>> > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:
>> 1820)
>> > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
>> > ...
>> > Caused by: java.io.EOFException: read past EOF:
>> > MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
>> > at org.apache.lucene.store.ByteBufferIndexInput.readByte(
>> > ByteBufferIndexInput.java:75)
>> >
>> > This error is repeated a few times as the indexing continued and further
>> > autocommits were triggered.
>> >
>> > I stopped the indexing process, made a backup snapshot of the index,
>> >  restarted indexing at a checkpoint, and everything then completed
>> without
>> > further incidents
>> >
>> > I ran checkIndex on the saved snapshot and it reported no errors
>> > whatsoever. Operations on the complete index (inclcuing an optimize and
>> > several query scripts) have all been error-free.
>> >
>> > Some background:
>> >  Solr information from the beginning of the checkindex output:
>> >  ---
>> >  Opening index @ /indexes/solrindexes/build0324.bad/index
>> >
>> > Segments file=segments_9s numSegments=105 version=6.3.0
>> > id=7m1ldieoje0m6sljp7xocbz9l userData={commitTimeMSec=1490400514324}
>> >   1 of 105: name=_be maxDoc=1227144
>> > version=6.3.0
>> > id=7m1ldieoje0m6sljp7xocburb
>> > codec=Lucene62
>> > compound=false
>> > numFiles=14
>> > size (MB)=4,926.186
>> > diagnostics = {os=Linux, java.vendor=Oracle Corporation,
>> > java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.3.0,
>> > mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_45-
>> b13,
>> > source=merge, mergeFactor=19, os.version=3.10.0-229.1.2.el7.x86_64,
>> > timestamp=1490380905920}
>> > no deletions
>> > test: open reader.OK [took 0.176 sec]
>> > test: check integrity.OK [took 37.399 sec

Re: SOLR | De-Duplication | Remove duplicate records based on their status

2017-05-31 Thread simon
Your updateRequestProcessorChain config  snippet specifies the "id" field
to generate a signature, but the sample data doesn't contain an "id" field
... check that out first.

-Simon

On Wed, May 31, 2017 at 12:06 PM, Lebin Sebastian 
wrote:

> Hello,
>
> I am indexing two different model with same data but different status.
>
> Eg:
> *Scenario -1*
> {Model: "", name: "abc", status: "T"}
> {Model: "", name: "abc", status: "A"}
>
> Expected Output
> {Model: "", name: "abc", status: "A"}
>
> *Scenario -2 *
> {Model: "", name: "abc", status: "A"}
> {Model: "", name: "abc", status: "T"}
>
> Expected Output
> {Model: "", name: "abc", status: "A"}
>
> *Scenario -3*
> {Model: "", name: "abc", status: "A"}
> {Model: "", name: "abc", status: "A"}
>
> Expected Output
> {Model: "", name: "abc", status: "A"} either one.
>
>
> *Scenario -4*
> {Model: "", name: "abc", status: "T"}
> {Model: "", name: "abc", status: "T"}
>
> Expected Output
> {Model: "", name: "abc", status: "T"} either one.
>
> .
>
> Scenario 3 & 4 are working as expected with current configuration which I
> have given below.
>
> For Scenario 1 & 2 output should be based on the status of the record.
>
> Please help me to fix scenario 1 & 2.
>
>
> *Solr version : 5.3*
>
> *Solrconfig.xml*
>
> 
>   
> dedupe
>   
> 
>
> 
>   
> true
> signature
> true
> id
> solr.processor.Lookup3Signature
>   
>   
>   
> 
>
>
> Thanks,
>
> Lebin F
>


Re: Luke 4.7.0 released

2014-04-02 Thread simon
Also seeing this on Mac OS X.

java version = Java(TM) SE Runtime Environment (build 1.7.0_51-b13)


On Wed, Apr 2, 2014 at 11:01 AM, Joshua P  wrote:

> Hi there!
>
> I'm recieving the following errors when trying to run luke-with-deps.jar
>
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> Exception in thread "main"
> Exception: java.lang.OutOfMemoryError thrown from the
> UncaughtExceptionHandler in thread "main"
>
> Any ideas?
>
> On Monday, March 10, 2014 5:20:05 PM UTC-4, Dmitry Kan wrote:
>>
>> Hello!
>>
>> Luke 4.7.0 has been released. Download it here:
>>
>> https://github.com/DmitryKey/luke/releases/tag/4.7.0
>>
>> Release based on pull request of Petri Kivikangas (
>> https://github.com/DmitryKey/luke/pull/2) Kiitos, Petri!
>>
>> Tested against the solr-4.7.0 index.
>>
>> 1. Upgraded maven plugins.
>> 2. Added simple Windows launch script: In Windows, Luke can now be
>> launched easily by executing luke.bat. Script sets MaxPermSize to 512m
>> because Luke was found to crash on lower settings.
>>
>> Best regards,
>>
>> Dmitry Kan
>>
>> --
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: http://twitter.com/dmitrykan
>>
>


Re: Luke 4.7.0 released

2014-04-03 Thread simon
adding that worked - thanks.


On Thu, Apr 3, 2014 at 4:18 AM, Dmitry Kan  wrote:

> Hi Joshua, Simon,
>
> do you pass the -XX:MaxPermSize=512m to your jvm?
>
> java -XX:MaxPermSize=512m -jar luke-with-deps.jar
>
> My java runtime environment is of the same version as Simon's: build
> 1.7.0_51-b13, run on ubuntu.
>
> Dmitry
>
>
> On Wed, Apr 2, 2014 at 6:54 PM, simon  wrote:
>
> > Also seeing this on Mac OS X.
> >
> > java version = Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> >
> >
> > On Wed, Apr 2, 2014 at 11:01 AM, Joshua P 
> wrote:
> >
> > > Hi there!
> > >
> > > I'm recieving the following errors when trying to run
> luke-with-deps.jar
> > >
> > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> > > SLF4J: Defaulting to no-operation (NOP) logger implementation
> > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> > further
> > > details.
> > > Exception in thread "main"
> > > Exception: java.lang.OutOfMemoryError thrown from the
> > > UncaughtExceptionHandler in thread "main"
> > >
> > > Any ideas?
> > >
> > > On Monday, March 10, 2014 5:20:05 PM UTC-4, Dmitry Kan wrote:
> > >>
> > >> Hello!
> > >>
> > >> Luke 4.7.0 has been released. Download it here:
> > >>
> > >> https://github.com/DmitryKey/luke/releases/tag/4.7.0
> > >>
> > >> Release based on pull request of Petri Kivikangas (
> > >> https://github.com/DmitryKey/luke/pull/2) Kiitos, Petri!
> > >>
> > >> Tested against the solr-4.7.0 index.
> > >>
> > >> 1. Upgraded maven plugins.
> > >> 2. Added simple Windows launch script: In Windows, Luke can now be
> > >> launched easily by executing luke.bat. Script sets MaxPermSize to 512m
> > >> because Luke was found to crash on lower settings.
> > >>
> > >> Best regards,
> > >>
> > >> Dmitry Kan
> > >>
> > >> --
> > >> Blog: http://dmitrykan.blogspot.com
> > >> Twitter: http://twitter.com/dmitrykan
> > >>
> > >
> >
>
>
>
> --
> Dmitry
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
>


Duplicate Unique Key

2014-04-07 Thread Simon
Hi all,

I know someone has posted similar question before.  But my case is little
different as I don't have the schema set up issue mentioned in those posts
but still get duplicate records.

My unique key in schema is 




id$



Search on Solr- admin UI:   id$:1

I got two documents
{
   "id$": "1",
   "_version_": 1464225014071951400,
"_root_": 1
},
{
"id$": "1",
"_version_": 1464236728284872700,
"_root_": 1
}

I use SolrJ api to add documents.  My understanding solr uniqueKey is like a
database primary key. I am wondering how could I end up with two documents
with same uniqueKey in the index.

Thanks,
Simon




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Duplicate Unique Key

2014-04-07 Thread Simon
Erick,

It's indeed quite odd.  And after I trigger re-indexing all documents (via
the normal process of existing program). The duplication is gone.  It can
not be reproduced easily.  But it did occur occasionally and that makes it a
frustrating task to troubleshoot. 

Thanks,
Simon



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651p4129701.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Duplicate Unique Key

2014-04-08 Thread Simon
MergingIndex is not the case here as I am not doing that.  Even the issue is
gone for now, it is not a relief for me as I am not sure how to explain this
to others (peer, boss and user).  I am thinking of implement a watch dog to
check whenever the total Solr documents exceeds the number of items in
database, it will raise a flag so that I may do something before getting
complaints. 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651p4129894.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Export big extract from Solr to [My]SQL

2014-05-02 Thread simon
The cursor-based deep paging in 4.7+ works very well and the performance on
large extracts (for us, maybe  up to 100K documents) is excellent, though
it will obviously depend on the number and size of fields that you need to
 pull. I wrote a Perl module to do the extractions from Solr without
problems (and DBI takes care of  writing to a database).

I'm probably going to rewrite in Python since the final destination of many
of our extracts is Tableau,  which has  a Python API for creating TDEs
(Tableau data extracts)

regards

-Simon


On Fri, May 2, 2014 at 7:43 AM, Siegfried Goeschl  wrote:

> Hi Per,
>
> basically I see three options
>
> * use a lot of memory to scope with huge result sets
> * user result set paging
> * SOLR 4.7 supports cursors (https://issues.apache.org/
> jira/browse/SOLR-5463)
>
> Cheers,
>
> Siegfried Goeschl
>
>
> On 02.05.14 13:32, Per Steffensen wrote:
>
>> Hi
>>
>> I want to make extracts from my Solr to MySQL. Any tools around that can
>> help med perform such a task? I find a lot about data-import from SQL
>> when googling, but nothing about export/extract. It is not all of the
>> data in Solr I need to extract. It is only documents that full fill a
>> normal Solr query, but the number of documents fulfilling it will
>> (potentially) be huge.
>>
>> Regards, Per Steffensen
>>
>
>


Solr block join

2013-10-28 Thread Simon
Hi,

The block join feature introduced in Solr 4.5 is really helpful in solving
some of the issues in my project. I am able to get it working in simple
cases. However, I couldn't figure out how to use it in some more complex
cases and I could find very little reference about it.
1) how to return both parent documents fields  and child document fields in
same result (in Solrj )?
2) how to apply 'OR' to multiple child documents types (searching for
documents that meet conditions of either child document type 1 or child
document type2)?
3) if result/sort/facet fields coming from child documents, how to define
them in schema? What I can think about is to create a copyField for each
them in parent documents. Is there any better way?
4) is block join working for multiple child level such child, grandchild
documents etc?

Does anyone have had similar issues and would like to share your solutions?

Thanks,
Simon



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-block-join-tp4098128.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr server requirements for 100+ million documents

2014-01-26 Thread simon
Erick's probably too modest to say so ;=) , but he wrote a great blog entry
on indexing with SolrJ -
http://searchhub.org/2012/02/14/indexing-with-solrj/ .  I took the guts of
the code in that blog and  easily customized it to write a very fast
indexer  (content from MySQL, I excised all the Tika code as I am not using
it).

You should replace StreamingUpdateSolrServer by  ConcurrentUpdateSolrServer
and experiment to find the optimal number of threads to configure.

-Simon


On Sun, Jan 26, 2014 at 11:28 AM, Erick Erickson wrote:

> 1> That's what I'd do. For incremental updates you might have to
> create a trigger on the main table and insert rows into another table
> that is then used to do the incremental updates. This is particularly
> relevant for deletes. Consider the case where you've ingested all your
> data then rows are deleted. Removing those same documents from Solr
> requires either a> re-indexing everything or b> getting all the docs
> in Solr and comparing them with the rows in the DB etc. This is
> expensive. c> recording the changes as above and just processing
> deletes from the "change table".
>
> 2> SolrJ is usually the most current. I don't know how much work
> SolrNet gets. However, under the covers it's just HTTP calls so since
> you have access in either to just adding HTTP parameters, you should
> be able to get the full functionality out of either. I _think_ that
> I'd go with whatever you're most comfortable with.
>
> Best,
> Erick
>
> On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar
>  wrote:
> > Thank you Erick for your valuable inputs. Yes, we have to re-index data
> again & again. I'll look into possibility of tuning db access.
> >
> > On SolrJ and automating the indexing (incremental as well as one time) I
> want to get your opinion on below two points. We will be indexing separate
> sets of tables with similar data structure
> >
> > - Should we use SolrJ and write Java programs that can be scheduled to
> trigger indexing on demand/schedule based.
> >
> > - Is using SolrJ a better idea even for searching than using SolrNet? As
> our frontend is in .Net so we started using SolrNet but I am afraid down
> the road when we scale/support SolrClod using SolrJ is better?
> >
> >
> > Thanks
> > Susheel
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Sunday, January 26, 2014 8:37 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > Dumping the raw data would probably be a good idea. I guarantee you'll
> be re-indexing the data several times as you change the schema to
> accommodate different requirements...
> >
> > But it may also be worth spending some time figuring out why the DB
> access is slow. Sometimes one can tune that.
> >
> > If you go the SolrJ route, you also have the possibility of setting up N
> clients to work simultaneously, sometimes that'll help.
> >
> > FWIW,
> > Erick
> >
> > On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar <
> susheel.ku...@thedigitalgroup.net> wrote:
> >> Hi Kranti,
> >>
> >> Attach are the solrconfig & schema xml for review. I did run indexing
> with just few fields (5-6 fields) in schema.xml & keeping the same db
> config but Indexing almost still taking similar time (average 1 million
> records 1 hr) which confirms that the bottleneck is in the data acquisition
> which in our case is oracle database. I am thinking to not use
> dataimporthandler / jdbc to get data from Oracle but to rather dump data
> somehow from oracle using SQL loader and then index it. Any thoughts?
> >>
> >> Thnx
> >>
> >> -Original Message-
> >> From: Kranti Parisa [mailto:kranti.par...@gmail.com]
> >> Sent: Saturday, January 25, 2014 12:08 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Solr server requirements for 100+ million documents
> >>
> >> can you post the complete solrconfig.xml file and schema.xml files to
> review all of your settings that would impact your indexing performance.
> >>
> >> Thanks,
> >> Kranti K. Parisa
> >> http://www.linkedin.com/in/krantiparisa
> >>
> >>
> >>
> >> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> susheel.ku...@thedigitalgroup.net> wrote:
> >>
> >>> Thanks, Svante. Your indexing speed using db seems to really fast.
> >>> Can you please provide some more detail on how you are indexing db

Re: Suggester on Dynamic fields

2014-10-22 Thread Simon
Hi Ramzi,

First thanks for your comments.

Regarding your line
 parameters.set("qt", /suggest")

My understanding is that "suggest" has to be a handler defined in
SolrConfig.xml. And the handler needs to have one of more defined (search)
component(s).  The search component syntax needs dictionary (or file) or a
indexed field to search on.   In my case, I am going with the indexed
fields.  

As I stated in my post, the issue I have is that I have a list of fields
that user configured at run time (which are defined as dynamic fields).
Whenever user configured a field to be used for autosuggestion, it will
trigger indexing.  Later on, when user search in the configured field, he
expects the auto completion feature while he types.  Since the fields are
configured at run time, I can not defined Search components for those fields
in advance unless the Search component syntax support dynamic fields,
something like f_*_auto. But unfortunately I didn't see anything like that. 

My first question is how to make dynamic field works in SearchComponent
definitions.  If this is impossible, the second question  would be how to
create the handler on the fly when user configure a field to be auto
completion. 

Thanks,
Simon



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-on-Dynamic-fields-tp4165270p4165329.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: add documents to the slave

2011-08-30 Thread simon
That's basically it.

remove all /update URLs from the slave config

On Tue, Aug 30, 2011 at 8:34 AM, Miguel Valencia <
miguel.valen...@juntadeandalucia.es> wrote:

> Hi
>
>I've read that it's possible add documents to slave machine:
>
> http://wiki.apache.org/solr/**SolrReplication#What_if_I_add_**
> documents_to_the_slave_or_if_**slave_index_gets_corrupted.3F
>
> ¿Is there anyway to not allow add to documents to slave machine? for
> example, touch on configurations files to only allow handler "/select".
>
> Thanks.
>
>
>


Re: Document Size for Indexing

2011-08-30 Thread simon
what issues exactly ?

are you using 32 bit Java ? That will restrict the JVM heap size to 2GB max.

-Simon

On Tue, Aug 30, 2011 at 11:26 AM, Tirthankar Chatterjee <
tchatter...@commvault.com> wrote:

> Hi,
>
> I have a machine (win 2008R2) with 16GB RAM, I am having issue indexing
> 1/2GB files. How do we avoid creating a SOLRInputDocument or is there any
> way to directly use Lucene Index writer classes.
>
> What would be the best approach. We need some suggestions.
>
> Thanks,
> Tirthankar
>
>
> **Legal Disclaimer***
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *


Re: Document Size for Indexing

2011-08-31 Thread simon
So if I understand you, you are  using Tika /SolrJ together in a Solr client
process which talks to your Solr server ? What is the heap size ? Can you
give us a  stack trace from the OOM exception ?

-Simon

On Wed, Aug 31, 2011 at 10:58 AM, Tirthankar Chatterjee <
tchatter...@commvault.com> wrote:

> I am using 64 bit JVM and we are going out of memory in extraction phase
> where TIKA assigns the content after extracting to SOLRInputDocument in the
> pipeline which gets loaded in memory.
>
> We are using released 3.1 version of SOLR.
>
> Thanks,
> Tirthankar
>
> -Original Message-
> From: simon [mailto:mtnes...@gmail.com]
> Sent: Tuesday, August 30, 2011 1:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Document Size for Indexing
>
> what issues exactly ?
>
> are you using 32 bit Java ? That will restrict the JVM heap size to 2GB
> max.
>
> -Simon
>
> On Tue, Aug 30, 2011 at 11:26 AM, Tirthankar Chatterjee <
> tchatter...@commvault.com> wrote:
>
> > Hi,
> >
> > I have a machine (win 2008R2) with 16GB RAM, I am having issue
> > indexing 1/2GB files. How do we avoid creating a SOLRInputDocument or
> > is there any way to directly use Lucene Index writer classes.
> >
> > What would be the best approach. We need some suggestions.
> >
> > Thanks,
> > Tirthankar
> >
> >
> > **Legal Disclaimer***
> > "This communication may contain confidential and privileged material
> > for the sole use of the intended recipient. Any unauthorized review,
> > use or distribution by others is strictly prohibited. If you have
> > received the message in error, please advise the sender by reply email
> > and delete the message. Thank you."
> > *
> **Legal Disclaimer***
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *
>


Re: Solr Hangs

2011-09-02 Thread simon
That error has nothing to do with Solr - it looks as though you are trying
to start the JVM with a heap size that is too big for the available physical
memory.

-Simon

On Fri, Sep 2, 2011 at 2:15 AM, Rohit  wrote:

> Hi All,
>
>
>
> I am using Solr 3.0 and have 4 cores build into it with the following
> statistics,
>
>
>
> Core1 : numDocs : 69402640maxDoc : 69404745
>
> Core2 : numDocs : 5669231  maxDoc : 5808496
>
> Core3 : numDocs : 6654951  maxDoc : 6654954
>
> Core4: numDocs : 138872  maxDoc : 185723
>
>
>
> The number of updates are very high into 2 of the cores, solr is running in
> tomcat with the following JAVA_OPTS values "-Xms2g -Xmx16g
> -XX:MaxPermSize=3072m -D64" .
>
>
>
> When Solr hangs and I try to restart it I am getting the following error,
> which does indicate that it's a memory problem, but how can overcome the
> problem of hanging.
>
>
>
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
>
>
>
>
>
> Regards,
>
> Rohit
>
> Mobile: +91-9901768202
>
> About Me:  <http://about.me/rohitg> http://about.me/rohitg
>
>
>
>


Re: Using SolrJ over HTTPS

2011-09-02 Thread simon
Not sure about the exact reason for the error. However, there's a related
email thread today with a code fragment that you might find useful -- see

http://www.lucidimagination.com/search/document/a553f89beb41e39a/how_to_use_solrj_self_signed_cert_ssl_basic_auth#a553f89beb41e39a

-Simon

On Fri, Sep 2, 2011 at 7:53 AM, Kissue Kissue  wrote:

> I am using SolrJ with Solr 3.3.0 over HTTPS and getting the following
> exception:
>
> 2011-09-02 12:42:08,111 ERROR
> [org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer] - error
> java.lang.Exception: Not Implemented
>
> Just wanted to find out if there is anything special i need to do in order
> to use solrJ over HTTPS?
>
> Thanks.
>


Re: java.lang.Exception: Not Implemented

2011-09-02 Thread simon
You need to give us more information. The code which throws this exception
will be most helpful.

-Simon

On Fri, Sep 2, 2011 at 5:43 AM, Kissue Kissue  wrote:

> Hi,
>
> I am using apache solr 3.3.0 with SolrJ on a linux box.
>
> I am getting the error below when indexing kicks in:
>
> 2011-09-02 10:35:01,617 ERROR
> [org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer] - error
> java.lang.Exception: Not Implemented
>
> Does anybody have any idea why this error maybe coming up?
>
> Thanks.
>


Re: Rollback to old index stored with SolrDeletionPolicy

2011-09-06 Thread simon
I don't know of any way in Solr land. (One of the reasons for being able to
keep > 1 commit was to be able to deal with NFS semantics whereby files are
immediately deleted in contrast to the normal *x behavior of not deleting
until the last file handl is closed; that way you avoid 'stale file handle'
nastiness).

That said, an API which allows you to open an IndexSearcher to a previous
commit, or indeed to a snapshot saved in another subdirectory would be a
really useful improvement. Maybe an extension to the CoreAdmin API ?

-Simon

On Tue, Sep 6, 2011 at 5:16 PM, Emmanuel Espina wrote:

> With SolrDeletionPolicy you can chose the number of "versions" of the index
> to store ( maxCommitsToKeep, it defaults to 1). Well, how can you revert to
> an arbitrary version that you have stored? Is there anything in Solr or in
> Lucene to pick the version of the index to load?
>
> Thank you
> Emmanuel
>


Re: StreamingUpdateSolrServer#handleError

2011-09-06 Thread simon
If you're batching the documents when you send them to Solr with the #add
method, you may be out of luck - Solr doesn't do a very good job of
reporting which document in a batch caused the failure.

If you reverted to CommonsHTTPServer and added a doc at a time there
wouldn't be any ambiguity, but that would be very slow indeed

-Simon

On Tue, Sep 6, 2011 at 12:58 PM, Leonardo Souza wrote:

> Hi Mark,
>
> The implementation is logging anyway, we have subclassed
> StreamingUpdateSolrServer and used handleError to log,  but inspecting the
> stack trace in in the handleError method
> does not give any clue about the document(s) that failed. We have a
> solution
> that uses Solr as backend for indexing and is important to us to keep track
> of failed and succeeded
> documents so we can take further actions. In the past we used the
> CommonsHttpSolrServer but switched to StreamingUpdateSolrServer for better
> performance.
>
> ERROR SBEPStreamingUpdateSolrServer handleError - Error.
> java.lang.Exception: undefined field field_str_idx_bugged
>
> undefined field field_str_idx_bugged
>
> request: http://localhost:48085/solr/coretest0-clone/update
> at
>
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:162)
> [solr-solrj-3.3.0.jar:3.3.0 1139785 - rmuir - 2011-06-26 09:25:01]
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> [na:1.6.0_23]
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> [na:1.6.0_23]
> at java.lang.Thread.run(Thread.java:662) [na:1.6.0_23]
>
>
> thanks..
>
> --
> Leonardo S Souza
>
>
>
>
> On Mon, Sep 5, 2011 at 7:58 PM, Mark Miller  wrote:
>
> > The default impl logs with slf4j - just setup logging properly and you
> will
> > see the results?
> >
> > Alternatively, you can subclass and impl that method however you'd like.
> >
> > On Sep 5, 2011, at 6:36 PM, Leonardo Souza wrote:
> >
> > > Hi,
> > >
> > > Inspecting StreamingUpdateSolrServer#handleError i can't see how to
> keep
> > > track of failures, i'd like to discover
> > > which documents failed during the request.
> > >
> > > thanks in advance!
> > >
> > > --
> > > Leonardo S Souza
> >
> > - Mark Miller
> > lucidimagination.com
> > 2011.lucene-eurocon.org | Oct 17-20 | Barcelona
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>


Re: Difference b/w SimplepostTool code and posting the file using SOLRJ

2011-09-30 Thread simon
Well, the Solr XML format is the only format which Solr's XML Update handler
knows about, and it's been baked into Solr from the beginning. That said,
there is now an XSLTUpdateRequesthhandler in trunk and 3.4, which allows the
specification of an XSLT transform to convert an arbitrary XML schema to
what Solr expects.

SolrJ would normally be used to construct Solr Documents from the parsed XML
input, whatever its format, and use the SolrJ API to ingest these documents
for indexing.

-Simon

On Fri, Sep 30, 2011 at 9:03 AM, kiran.bodigam wrote:

> We can post the documents from command line by running the post.jar file
> and
> giving the list of files *.xml to the solr to index the document.Here we
> are
> posting the document xml documents which has some unique format i would
> like
> to know what are the advantages that i get from this format?
>   hi  hello
> 
> for this approach i need to transform my original xml file.
> what if i post the original xml as file using SOLRJ code? I would like to
> know the difference b/w SimplepostTool code and posting the  file using
> SOLRJ?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Difference-b-w-SimplepostTool-code-and-posting-the-file-using-SOLRJ-tp3382297p3382297.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Do more fields cause more memory usage?

2011-09-30 Thread simon
More fields will cause some increase in memory use, but it's hard to
quantify it, and it will also depend on how much usage the queries can make
of Solr's caches, number of simultaneous queries. In general, the number of
query fields has more impact on performance.

 A potentially big memory sink, which will depend on the number of fields in
use, is in the Lucene FieldCache, which is used when result sets are to be
sorted.

If you haven't a;ready done so, take a look at
http://wiki.apache.org/solr/SolrPerformanceFactors which describes a lot of
the factors influencing performance and memory use.

-Simon

On Fri, Sep 30, 2011 at 8:27 AM, Pranav Prakash  wrote:

> How will the number of fields increase the amount of RAM usage in Solr 3.4?
> I have about 37 different fields because i've made every field for every
> language. I shall be doing dismax search all the time (or slightly less
> than
> all). So, I will have to send my query like
>
> *qf=title_en,title_ar,title_hi,title_fr... and so on.
>
> Will it cause significantly high memory usage than had there been a unified
> field with no langauge specific tokenizers?
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com>
> |
> Google <http://www.google.com/profiles/pranny>
>


Re: how to add search terms to output of wt=csv?

2011-10-14 Thread simon
There's an open issue -
https://issues.apache.org/jira/browse/SOLR-2731which addresses adding
this kind of metadata to csv output. There's a patch
there which may be useful, and could probably be adapted if needed

-Simon

On Fri, Oct 14, 2011 at 4:37 PM, Fred Zimmerman wrote:

> Hi,
>
> I want to include the search query in the output of wt=csv (or a duplicate
> of it) so that the process that receives this output can do something with
> the search terms. How would I accomplish this?
>
> Fred
>


Re: Integrating Surround Query Parser

2011-12-02 Thread simon
Take a look at https://issues.apache.org/jira/browse/SOLR-2703, which
integrates the surround parser into Solr trunk. There's a dependency on a
Lucene patch which resolves a caching problem  (
https://issues.apache.org/jira/browse/LUCENE-2945 ) which also wasn't
backported to earlier versions of Lucene

I'm not sure how easily this would all would backport to Solr 3.1, but you
could try....

best

-Simon

On Tue, Nov 22, 2011 at 1:05 AM, Rahul Mehta wrote:

> Hello,
>
> I want to Run surround query .
>
>
>   1. Downloading from
>   http://www.java2s.com/Code/Jar/JKL/Downloadlucenesurround241jar.htm
>   2. Moved the lucene-surround-2.4.1.jar  to /apache-solr-3.1.0/example/lib
>   3. Edit  the solrconfig.xml with
>  1. 
>   4. Restart Solr
>
> Got this error :
>
> org.apache.solr.common.SolrException: Error Instantiating
> QParserPlugin, org.apache.lucene.queryParser.surround.parser.QueryParser
> is not a org.apache.solr.search.QParserPlugin
>at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:425)
>
>
>
> --
> Thanks & Regards
>
> Rahul Mehta
>


Re: Integrating Surround Query Parser

2011-12-02 Thread simon
oops, didn't see all of the thread before I hit send. Good work, Erik

On Fri, Dec 2, 2011 at 5:21 PM, simon  wrote:

> Take a look at https://issues.apache.org/jira/browse/SOLR-2703, which
> integrates the surround parser into Solr trunk. There's a dependency on a
> Lucene patch which resolves a caching problem  (
> https://issues.apache.org/jira/browse/LUCENE-2945 ) which also wasn't
> backported to earlier versions of Lucene
>
> I'm not sure how easily this would all would backport to Solr 3.1, but you
> could try
>
> best
>
> -Simon
>
>
> On Tue, Nov 22, 2011 at 1:05 AM, Rahul Mehta wrote:
>
>> Hello,
>>
>> I want to Run surround query .
>>
>>
>>   1. Downloading from
>>   http://www.java2s.com/Code/Jar/JKL/Downloadlucenesurround241jar.htm
>>   2. Moved the lucene-surround-2.4.1.jar  to
>> /apache-solr-3.1.0/example/lib
>>   3. Edit  the solrconfig.xml with
>>  1. 
>>   4. Restart Solr
>>
>> Got this error :
>>
>> org.apache.solr.common.SolrException: Error Instantiating
>> QParserPlugin, org.apache.lucene.queryParser.surround.parser.QueryParser
>> is not a org.apache.solr.search.QParserPlugin
>>at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:425)
>>
>>
>>
>> --
>> Thanks & Regards
>>
>> Rahul Mehta
>>
>
>


Using postCommit event to swap cores

2010-08-19 Thread simon
Hi there,

I have solr configured with 2 cores, "live" and "standby".  "Live" is
used to service search requests from our users.  "Standby" is used to
rebuild the index from scratch each night.  Currently I have the
postCommit hook setup to swap the two cores over as soon as the indexing
on "standby" is complete. 

It seems to work well on my development box, but I have not seen this
approach discussed elsewhere so I was wondering if I was missing
something here.

Feedback gratefully received!

Simon




Re: ExternalFileField best practices

2010-08-29 Thread simon
The extended dismax parser (see SOLR-1553) may do what you are looking for

 From its feature list..

'Supports the "boost" parameter.. like the dismax bf param, but multiplies
the function query instead of adding it in'

On Sun, Aug 29, 2010 at 12:27 AM, Andy  wrote:

> But isn't it the case that bf adds the boost value while {!boost} multiply
> the boost value? In my case I think a multiplication is more appropriate.
>
> So there's no way to use ExternalFileField in {!boost}?
>
> --- On Sat, 8/28/10, Lance Norskog  wrote:
>
> > From: Lance Norskog 
> > Subject: Re: ExternalFileField best practices
> > To: solr-user@lucene.apache.org
> > Date: Saturday, August 28, 2010, 11:55 PM
> > You want the boost function bf=
> > parameter.
> >
> > On Sat, Aug 28, 2010 at 5:32 PM, Andy 
> > wrote:
> > > Lance,
> > >
> > > Thanks for the response.
> > >
> > > Can I use an ExternalFileField as an input to a boost
> > query?
> > >
> > > For example, if I put the field "popularity" in an
> > ExternalFileField, can I still use "popularity" in a boosted
> > query such as:
> > >
> > > {!boost b=log(popularity)}foo
> > >
> > > The doc says ExternalFileField can only be used in
> > FunctionQuery. Does that include a boost query like {!boost
> > b=log(popularity)}?
> > >
> > >
> > > --- On Sat, 8/28/10, Lance Norskog 
> > wrote:
> > >
> > >> From: Lance Norskog 
> > >> Subject: Re: ExternalFileField best practices
> > >> To: solr-user@lucene.apache.org
> > >> Date: Saturday, August 28, 2010, 5:16 PM
> > >> The file is completely reloaded when
> > >> you commit or optimize. There is
> > >> no incremental update available. And, yes, this
> > could be a
> > >> scaling
> > >> problem.
> > >>
> > >> How you update it is completely external to Solr.
> > >>
> > >> On Sat, Aug 28, 2010 at 2:50 AM, Andy 
> > >> wrote:
> > >> > I'm interested in using ExternalFileField to
> > store a
> > >> field "popularity" that is being updated
> > frequently.
> > >> >
> > >> > However ExternalFileField seems to be a
> > pretty obscure
> > >> feature. Have a few questions:
> > >> >
> > >> > 1) Can anyone share your experience using
> > it?
> > >> >
> > >> > 2) What is the most efficient way to update
> > the
> > >> external file?
> > >> > For example, the file could look like:
> > >> >
> > >> > 1=12  // the document with uniqueKey 1
> > has a
> > >> popularity of 12//
> > >> > 2=4
> > >> > 3=45
> > >> > 5=78
> > >> >
> > >> > Now the popularity of document 1 is updated
> > to 13:
> > >> >
> > >> > - What is the best way to update the file to
> > reflect
> > >> the change? Isn't this an O(n) operation?
> > >> > - How to deal with concurrent updates to the
> > file by
> > >> multiple threads?
> > >> >
> > >> > Would this method of using an external file
> > scale?
>
>


Re: SPAN queries in solr

2012-11-23 Thread simon
take a look at SOLR-2703, which was committed for 4.0. It provides a Solr
wrapper for the surround query parser, which supports span queries.

On Fri, Nov 23, 2012 at 3:38 PM, Anirudha Jadhav  wrote:

> What is the best way to use span queries in solr ?
>
> I see https://issues.apache.org/jira/browse/SOLR-839 which enables the XML
> Query parser that supports span queries.
>
>
>
> --
> Anirudha P. Jadhav
>


Re: Bulk update via filter query

2011-05-04 Thread simon
That won't work. External file fields are currently only usable within
function queries, according to the Javadocs

On Wed, May 4, 2011 at 12:16 PM, Rih  wrote:

> This could work. Are there search/index performance drawbacks when using
> it?
>
>
> On Mon, May 2, 2011 at 6:22 PM, Ahmet Arslan  wrote:
>
> >
> >
> > Is there an efficient way to update multiple documents with common values
> > (e.g. color = white)? An example would be to mark all white-colored items
> > as
> > sold-out.
> >
> >
> >
> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.htmlcanbe
>  an option.
> >
> >
>


Re; DIH Scheduling

2011-06-23 Thread simon
The Wiki page describes a design for a scheduler, which has not been
committed to Solr yet (I checked). I did see a patch the other day
(see https://issues.apache.org/jira/browse/SOLR-2305) but it didn't
look well tested.

I think that you're basically stuck with something like cron at this
time. If your application is written in java, take a look at the
Quartz scheduler - http://www.quartz-scheduler.org/

-Simon


Re: response time for pdf indexing

2011-06-23 Thread simon
How long are the documents ? indexing a large document can be slow
(although 2 seconds is very slow indeed).

2011/6/22 Rode González (libnova) :
> Hi !
>
>
>
> We are using Zend Search based on Lucene. Our indexing pdf consultations
> take longer than 2 seconds.
>
> We want to change to solr to try to solve this problem.
>
> i. Can anyone tell me the response time for querys on pdf documents on solr?
>
>
> ii. Can anyone tell me some strategies to reduce this response time?
>
>
>
> Note: the pdf is not indexed in a simple way. The pdf is converted to text
> previously and then, indexed with some additional information needed.
>
> Thank you.
> ---
>
> Rode González
>  _
>
> No se encontraron virus en este mensaje.
> Comprobado por AVG - www.avg.com
> Versión: 10.0.1382 / Base de datos de virus: 1513/3719 - Fecha de
> publicación: 06/22/11
>
>


Re: Removing duplicate documents from search results

2011-06-23 Thread simon
have you checked out the deduplication process that's available at
indexing time ? This includes a fuzzy hash algorithm .

http://wiki.apache.org/solr/Deduplication

-Simon

On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash  wrote:
> This approach would definitely work is the two documents are *Exactly* the
> same. But this is very fragile. Even if one extra space has been added, the
> whole hash would change. What I am really looking for is some %age
> similarity between documents, and remove those documents which are more than
> 95% similar.
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>
>
> On Thu, Jun 23, 2011 at 15:16, Omri Cohen  wrote:
>
>> What you need to do, is to calculate some HASH (using any message digest
>> algorithm you want, md5, sha-1 and so on), then do some reading on solr
>> field collapse capabilities. Should not be too complicated..
>>
>> *Omri Cohen*
>>
>>
>>
>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295
>>
>>
>>
>>
>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image:
>> Twitter] <http://www.twitter.com/omricohe> [image:
>> WordPress]<http://omricohen.me>
>>  Please consider your environmental responsibility. Before printing this
>> e-mail message, ask yourself whether you really need a hard copy.
>> IMPORTANT: The contents of this email and any attachments are confidential.
>> They are intended for the named recipient(s) only. If you have received
>> this
>> email by mistake, please notify the sender immediately and do not disclose
>> the contents to anyone or make copies thereof.
>> Signature powered by
>> <
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>> >
>> WiseStamp<
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>> >
>>
>>
>>
>> -- Forwarded message --
>> From: Pranav Prakash 
>> Date: Thu, Jun 23, 2011 at 12:26 PM
>> Subject: Removing duplicate documents from search results
>> To: solr-user@lucene.apache.org
>>
>>
>> How can I remove very similar documents from search results?
>>
>> My scenario is that there are documents in the index which are almost
>> similar (people submitting same stuff multiple times, sometimes different
>> people submitting same stuff). Now when a search is performed for
>> "keyword",
>> in the top N results, quite frequently, same document comes up multiple
>> times. I want to remove those duplicate (or possible duplicate) documents.
>> Very similar to what Google does when they say "In order to show you most
>> relevant result, duplicates have been removed". How can I achieve this
>> functionality using Solr? Does Solr has an implied or plugin which could
>> help me with it?
>>
>>
>> *Pranav Prakash*
>>
>> "temet nosce"
>>
>> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com
>> >
>> |
>> Google <http://www.google.com/profiles/pranny>
>>
>


Re: Can Master push data to slave

2011-08-08 Thread simon
You could configure a PostCommit event listener on the master which
would send a HTTP fetchindex request to the slave you want to carry
out replication  - see
http://wiki.apache.org/solr/SolrReplication#HTTP_API

But why do you want the master to push to the slave ?

-Simon

On Mon, Aug 8, 2011 at 5:26 PM, Markus Jelsma
 wrote:
> Hi,
>
>> Hi
>>
>> I am using Solr 1.4. and doing a replication process where my slave is
>> pulling data from Master. I have 2 questions
>>
>> a. Can Master push data to slave
>
> Not in current versions. Not sure about exotic patches for this.
>
>> b. How to make sure that lock file is not created while replication
>
> What do you mean?
>
>>
>> Please help
>>
>> thanks
>> Pawan
>


Re: Same id on two shards

2011-08-08 Thread simon
Only one should be returned, but it's non-deterministic. See
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

-Simon

On Sat, Aug 6, 2011 at 6:27 AM, Pooja Verlani  wrote:
> Hi,
>
> We have a multicore solr with 6 cores. We merge the results using shards
> parameter or distrib handler.
> I have a problem, I might post one document on one of the cores and then
> post it after some days on another core, as I have a time-sliced multicore
> setup!
>
> The question is if I retrieve a document which is posted on both the shards,
> will solr return me only one document or both. And if only one document will
> be return, which one?
>
> Regards,
> Pooja
>


Re: Same id on two shards

2011-08-08 Thread simon
I think the first one to respond is indeed the way it works, but
that's only deterministic up to a point (if your small index is in the
throes of a commit and everything required for a response happens to
be  cached on the larger shard ... who knows ?)

On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey  wrote:
> On 8/8/2011 4:07 PM, simon wrote:
>>
>> Only one should be returned, but it's non-deterministic. See
>>
>> http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
>
> I had heard it was based on which one responded first.  This is part of why
> we have a small index that contains the newest content and only distribute
> content to the other shards once a day.  The hope is that the small index
> (less than 1GB, fits into RAM on that virtual machine) will always respond
> faster than the other larger shards (over 18GB each).  Is this an incorrect
> assumption on our part?
>
> The build system does do everything it can to ensure that periods of overlap
> are limited to the time it takes to commit a change across all of the
> shards, which should amount to just a few seconds once a day.  There might
> be situations when the index gets out of whack and we have duplicate id
> values for a longer time period, but in practice it hasn't happened yet.
>
> Thanks,
> Shawn
>
>


Re: frange not working in query

2011-08-10 Thread simon
Could you tell us what you're trying to achieve with the range query ?
It's not clear.

-Simon

On Wed, Aug 10, 2011 at 5:57 AM, Amit Sawhney  wrote:
> Hi All,
>
> I am trying to sort the results on a unix timestamp using this query.
>
> http://url.com:8983/solr/db/select/?indent=on&version=2.1&q={!frange%20l=0.25}query($qq)&qq=nokia&sort=unix-timestamp%20desc&start=0&rows=10&qt=dismax&wt=dismax&fl=*,score&hl=on&hl.snippets=1
>
> When I run this query, it says 'no field name specified in query and no 
> defaultSearchField defined in schema.xml'
>
> As soon as I remove the frange query and run this, it starts working fine.
>
> http://url.com:8983/solr/db/select/?indent=on&version=2.1&q=nokia&sort=unix-timestamp%20desc&start=0&rows=10&qt=dismax&wt=dismax&fl=*,score&hl=on&hl.snippets=1
>
> Any pointers?
>
>
> Thanks,
> Amit


Re: frange not working in query

2011-08-10 Thread simon
I meant the frange query, of course

On Wed, Aug 10, 2011 at 10:21 AM, simon  wrote:
> Could you tell us what you're trying to achieve with the range query ?
> It's not clear.
>
> -Simon
>
> On Wed, Aug 10, 2011 at 5:57 AM, Amit Sawhney  wrote:
>> Hi All,
>>
>> I am trying to sort the results on a unix timestamp using this query.
>>
>> http://url.com:8983/solr/db/select/?indent=on&version=2.1&q={!frange%20l=0.25}query($qq)&qq=nokia&sort=unix-timestamp%20desc&start=0&rows=10&qt=dismax&wt=dismax&fl=*,score&hl=on&hl.snippets=1
>>
>> When I run this query, it says 'no field name specified in query and no 
>> defaultSearchField defined in schema.xml'
>>
>> As soon as I remove the frange query and run this, it starts working fine.
>>
>> http://url.com:8983/solr/db/select/?indent=on&version=2.1&q=nokia&sort=unix-timestamp%20desc&start=0&rows=10&qt=dismax&wt=dismax&fl=*,score&hl=on&hl.snippets=1
>>
>> Any pointers?
>>
>>
>> Thanks,
>> Amit
>


Re: paging size in SOLR

2011-08-10 Thread simon
Worth remembering there are some performance penalties with deep
paging, if you use the page-by-page approach. may not be too much of a
problem if you really are only looking to retrieve 10K docs.

-Simon

On Wed, Aug 10, 2011 at 10:32 AM, Erick Erickson
 wrote:
> Well, if you really want to you can specify start=0 and rows=1 and
> get them all back at once.
>
> You can do page-by-page by incrementing the "start" parameter as you
> indicated.
>
> You can keep from re-executing the search by setting your queryResultCache
> appropriately, but this affects all searches so might be an issue.
>
> Best
> Erick
>
> On Wed, Aug 10, 2011 at 9:09 AM, jame vaalet  wrote:
>> hi,
>> i want to retrieve all the data from solr (say 10,000 ids ) and my page size
>> is 1000 .
>> how do i get back the data (pages) one after other ?do i have to increment
>> the "start" value each time by the page size from 0 and do the iteration ?
>> In this case am i querying the index 10 time instead of one or after first
>> query the result will be cached somewhere for the subsequent pages ?
>>
>>
>> JAME VAALET
>>
>


Re: Error loading a custom request handler in Solr 4.0

2011-08-10 Thread simon
Th attachment isn't showing up (in gmail, at least). Can you inline
the relevant bits of code ?

On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer  wrote:
> Hi,
> Apologies if this is really basic. I'm trying to learn how to create a
> custom request handler, so I wrote the minimal class (attached), compiled
> and jar'd it, and placed it in example/lib. I added this to solrconfig.xml:
>     
> When I started Solr with java -jar start.jar, I got this:
>     ...
>     SEVERE: java.lang.NoClassDefFoundError:
> org/apache/solr/handler/RequestHandlerBase
> at java.lang.ClassLoader.defineClass1(Native Method)
>         ...
> So I copied all the dist/*.jar files into lib and tried again. This time it
> seemed to start ok, but browsing to http://localhost:8983/solr/ displayed
> this:
>     org.apache.solr.common.SolrException: Error Instantiating Request
> Handler, FlaxTestHandler is not a org.apache.solr.request.SolrRequestHandler
>
>   at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410) ...
>
> Any ideas?
> thanks,
> Tom
>


Re: Error loading a custom request handler in Solr 4.0

2011-08-10 Thread simon
It's working for me. Compiled, inserted in solr/lib, added the config
line to solrconfig.

  when I send a /flaxtest request i get



0
16

Hello!


I was doing this within a core defined in solr.xml

-Simon

On Wed, Aug 10, 2011 at 11:46 AM, Tom Mortimer  wrote:
> Sure -
>
> import org.apache.solr.request.SolrQueryRequest;
> import org.apache.solr.response.SolrQueryResponse;
> import org.apache.solr.handler.RequestHandlerBase;
>
> public class FlaxTestHandler extends RequestHandlerBase {
>
>    public FlaxTestHandler() { }
>
>    public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
> rsp)
>        throws Exception
>    {
>        rsp.add("FlaxTest", "Hello!");
>    }
>
>    public String getDescription() { return "Flax"; }
>    public String getSourceId() { return "Flax"; }
>    public String getSource() { return "Flax"; }
>    public String getVersion() { return "Flax"; }
>
> }
>
>
>
> On 10 August 2011 16:43, simon  wrote:
>
>> Th attachment isn't showing up (in gmail, at least). Can you inline
>> the relevant bits of code ?
>>
>> On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer  wrote:
>> > Hi,
>> > Apologies if this is really basic. I'm trying to learn how to create a
>> > custom request handler, so I wrote the minimal class (attached), compiled
>> > and jar'd it, and placed it in example/lib. I added this to
>> solrconfig.xml:
>> >     
>> > When I started Solr with java -jar start.jar, I got this:
>> >     ...
>> >     SEVERE: java.lang.NoClassDefFoundError:
>> > org/apache/solr/handler/RequestHandlerBase
>> > at java.lang.ClassLoader.defineClass1(Native Method)
>> >         ...
>> > So I copied all the dist/*.jar files into lib and tried again. This time
>> it
>> > seemed to start ok, but browsing to http://localhost:8983/solr/displayed
>> > this:
>> >     org.apache.solr.common.SolrException: Error Instantiating Request
>> > Handler, FlaxTestHandler is not a
>> org.apache.solr.request.SolrRequestHandler
>> >
>> >       at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410)
>> ...
>> >
>> > Any ideas?
>> > thanks,
>> > Tom
>> >
>>
>


Re: Error loading a custom request handler in Solr 4.0

2011-08-10 Thread simon
This is in trunk (up to date). Compiler is 1.6.0_26

classpath was  
dist/apache-solr-solrj-4.0-SNAPSHOT.jar:dist/apache-solr-core-4.0-SNAPSHOT.jar
built from trunk just prior by 'ant dist'

I'd try again with a clean trunk .

-Simon

On Wed, Aug 10, 2011 at 1:20 PM, Tom Mortimer  wrote:
> Interesting.. is this in trunk (4.0)? Maybe I've broken mine somehow!
>
> What classpath did you use for compiling? And did you copy anything other
> than the new jar into lib/ ?
>
> thanks,
> Tom
>
>
> On 10 August 2011 18:07, simon  wrote:
>
>> It's working for me. Compiled, inserted in solr/lib, added the config
>> line to solrconfig.
>>
>>  when I send a /flaxtest request i get
>>
>> 
>> 
>> 0
>> 16
>> 
>> Hello!
>> 
>>
>> I was doing this within a core defined in solr.xml
>>
>> -Simon
>>
>> On Wed, Aug 10, 2011 at 11:46 AM, Tom Mortimer  wrote:
>> > Sure -
>> >
>> > import org.apache.solr.request.SolrQueryRequest;
>> > import org.apache.solr.response.SolrQueryResponse;
>> > import org.apache.solr.handler.RequestHandlerBase;
>> >
>> > public class FlaxTestHandler extends RequestHandlerBase {
>> >
>> >    public FlaxTestHandler() { }
>> >
>> >    public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
>> > rsp)
>> >        throws Exception
>> >    {
>> >        rsp.add("FlaxTest", "Hello!");
>> >    }
>> >
>> >    public String getDescription() { return "Flax"; }
>> >    public String getSourceId() { return "Flax"; }
>> >    public String getSource() { return "Flax"; }
>> >    public String getVersion() { return "Flax"; }
>> >
>> > }
>> >
>> >
>> >
>> > On 10 August 2011 16:43, simon  wrote:
>> >
>> >> Th attachment isn't showing up (in gmail, at least). Can you inline
>> >> the relevant bits of code ?
>> >>
>> >> On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer  wrote:
>> >> > Hi,
>> >> > Apologies if this is really basic. I'm trying to learn how to create a
>> >> > custom request handler, so I wrote the minimal class (attached),
>> compiled
>> >> > and jar'd it, and placed it in example/lib. I added this to
>> >> solrconfig.xml:
>> >> >     
>> >> > When I started Solr with java -jar start.jar, I got this:
>> >> >     ...
>> >> >     SEVERE: java.lang.NoClassDefFoundError:
>> >> > org/apache/solr/handler/RequestHandlerBase
>> >> > at java.lang.ClassLoader.defineClass1(Native Method)
>> >> >         ...
>> >> > So I copied all the dist/*.jar files into lib and tried again. This
>> time
>> >> it
>> >> > seemed to start ok, but browsing to
>> http://localhost:8983/solr/displayed
>> >> > this:
>> >> >     org.apache.solr.common.SolrException: Error Instantiating Request
>> >> > Handler, FlaxTestHandler is not a
>> >> org.apache.solr.request.SolrRequestHandler
>> >> >
>> >> >       at
>> org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410)
>> >> ...
>> >> >
>> >> > Any ideas?
>> >> > thanks,
>> >> > Tom
>> >> >
>> >>
>> >
>>
>


Re: query time problem

2011-08-10 Thread simon
Off the top of my head ...

Can you tell if GC is happening more frequently than usual/expected  ?

Is the index optimized - if not, how many segments ?

It's possible that one of the shards is behind a flaky network connection.

Is the 10s performance just for the Solr query or wallclock time at
the browser ?

You can monitor cache statistics from the admin console 'statistics' page

Are you seeing anything untoward in the solr logs ?

-Simon

On Wed, Aug 10, 2011 at 1:11 PM, Charles-Andre Martin
 wrote:
> Hi,
>
>
>
> I've noticed poor performance for my solr queries in the past few days.
>
>
>
> Queries of that type :
>
>
>
> http://server:5000/solr/select?q=story_search_field_en:(water boston) OR 
> story_search_field_fr:(water boston)&rows=350&start=0&sort=r_modify_date 
> desc&shards=shard1:5001/solr,shard2:5002/solr&fq=type:(cch_story OR 
> cch_published_story)
>
>
>
> Are slow (more than 10 seconds).
>
>
>
> I would like to know if someone knows how I could investigate the problem ? I 
> tried to specify the parameters &debugQuery=on&explainOther=on but this 
> doesn't help much.
>
>
>
> I also monitored the shards log. Sometimes, there is broken pipe in the 
> shards logs.
>
>
>
> Also, is there a way I could monitor the cache statistics ?
>
>
>
> For your information, every shards master and slaves computers have enough 
> RAM and disk space.
>
>
>
>
>
> Charles-André Martin
>
>
>
>
>
>


Re: Increasing the highlight snippet size

2011-08-10 Thread simon
an hl.fragsize of 1000 is problematical, as Solr parses that
parameter as a 32 bit int... that's several bits more.

-Simon

On Wed, Aug 10, 2011 at 4:59 PM, Sang Yum  wrote:
> Hi,
>
> I have been trying to increase the size of the highlight snippets using
> "hl.fragSize" parameter, without much success. It seems that hl.fragSize is
> not making any difference at all in terms of snippet size.
>
> For example, compare the following two set of query/results:
>
> http://10.1.1.51:8983/solr/select?q=%28bookCode%3abarglewargle+AND+content%3awriting+AND+id:6970%29&rows=1&sort=id+asc&fl=id%2cbookCode%2cnavPointId%2csectionTitle&hl=true&hl.fl=content&hl.snippets=100&hl.fragSize=10&hl.maxAnalyzedChars=-1&version=2.2
>
>  to class="werd"> write a
>
> http://10.1.1.51:8983/solr/select?q=%28bookCode%3abarglewargle+AND+content%3awriting+AND+id:6970%29&rows=1&sort=id+asc&fl=id%2cbookCode%2cnavPointId%2csectionTitle&hl=true&hl.fl=content&hl.snippets=100&hl.fragSize=1000&hl.maxAnalyzedChars=-1&version=2.2
>
>  to class="werd"> write a
>
> Because of our particular needs, the content has been "spanified", each word
> with its own span id. I do apply HTMLStrip during the index time.
>
> What I would like to do is to increase the size of snippet so that the
> highlighted snippets contain more surrounding words.
>
> Although hl.fragSize went from 10 to 1000, the result is the same.
> This leads me to believe that hl.fragSize might not be the correct parameter
> to achieve the effect i am looking for. If so, what parameter should I use?
>
> Thanks!
>


Re: Minimum score filter

2011-08-15 Thread simon
The absolute value of a relevance score doesn't have a lot of meaning and
the range of scores can vary a lot depending on any boost you may apply.
Even if you normalize them (say on a 1-100 scale where 100 is the max
relevance) you can't really draw any valid conclusions from those values.

It would help if you described exactly what problem you're trying to solve.

-Simon

On Mon, Aug 15, 2011 at 1:02 PM, Donald J. Organ IV
wrote:

> Is there a way to set a minimum score requirement so that matches below a
> given score are not return/included in facet counts.


Re: Update field value in the document based on value of another field in the document

2011-08-18 Thread simon
An  UpdateRequestProcessor would do the trick. Look at the (rather minimal)
documentation and code example in
http://wiki.apache.org/solr/UpdateRequestProcessor

-Simon

On Thu, Aug 18, 2011 at 4:15 PM, bhawna singh  wrote:

> Hi All,
> I have a requirement to update a certain field value depending on the field
> value of another field.
> To elaborate-
> I have a field called 'popularity' and a field called 'URL'. I need to
> assign popularity value depending on the domain (URL) ( I have the
> popularity and domain mapping in a text file).
>
> I am using CSVRequestHandler to import the data.
>
> What are the suggested ways to achieve this.
> Your quick response is much appreciated.
>
> Thanks,
> Bhawna
>


Re: Custom FilterFactory is when called

2011-08-22 Thread simon
On Mon, Aug 22, 2011 at 5:34 AM, occurred <
schaubm...@infodienst-ausschreibungen.de> wrote:

> Hi,
>
> I've created my own custom FilterFactory or better to say rewritten an
> existing one:
> KeywordMarkerFilterFactory
> to:
> CachingKeywordMarkerFilterFactory
>
> It will/should reload the protwords every minute.
>
> But now I found out that this FilterFactory is only called a few times when
> me server startup, but then never again.
>
> Is there a config needed to have this FilterFactory called every time for
> indexing and quering?
>
That's correct. The factory will be called at startup once for each
occurrence in the schema. It will create the object containing the protected
words (e.g. CharArraySet protectedWords) at that time. The #create method
uses this object when a field is being tokenized..

Can you post your code ?

and - what problem are you trying to solve with the
CachingKeyworkMarkerFilter ?

FWIW, I've been looking at a more generalized way of tracking  changes in
protwords/stopwords/ etc and it's turning out to be quite complex.

-Simon

>
> cheers
> Charlie
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Custom-FilterFactory-is-when-called-tp3274503p3274503.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: not equals query in solr

2011-08-25 Thread simon
http://wiki.apache.org/solr/SolrQuerySyntax has answers for you.

-Simon

On Thu, Aug 25, 2011 at 1:04 AM, Ranveer Kumar wrote:

> any help...
>
> On Wed, Aug 24, 2011 at 12:58 PM, Ranveer Kumar  >wrote:
>
> > Hi,
> >
> > is it right way to do :
> > q=(state:[* TO *] AND city:[* TO *])
> >
> > regards
> > Ranveer
> >
> >
> > On Wed, Aug 24, 2011 at 12:54 PM, Ranveer Kumar  >wrote:
> >
> >> Hi All,
> >>
> >> How to do negative query in solr. Following are the criteria :
> >> I have state and city field where I want to filter only those state and
> >> city which is not blank. something like: state NOT "" AND city NOT "".
> >> I tried -state:"" but its not working.
> >>
> >> Or suggest  me to do this in better way..
> >>
> >> regards
> >> Ranveer
> >>
> >>
> >>
> >>
> >
>


Re: Solr in a windows shared hosting environment

2011-08-25 Thread simon
That's not a question we can answer in this group - you need to take it up
with your hosting provider - they may already have it available.

On Thu, Aug 25, 2011 at 2:59 PM, Devora  wrote:

> Thank you!
>
> Since it's shared hosting, how do I install java?
>
> -Original Message-
> From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov]
> Sent: Thursday, August 25, 2011 4:34 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr in a windows shared hosting environment
>
> Yes, but since Solr is written in Java to run in a JEE container, you would
> host Solr in a web application server, either Jetty (which comes packaged),
> or something else (say, Tomcat or WebSphere or something like that).
>
> As a result, you aren't going to find anything that says how to run Solr
> under IIS because it doesn't run under IIS.
>
> It doesn't need IIS, though it can certainly coexist alongside IIS.  If you
> want the requests to go thru IIS you might need a plug-in in IIS to handle
> that (IBM's WebSphere has such a plugin).  If you don't need the requests
> to
> go thru IIS, then that isn't an issue.
>
> Hope that helps.
>
> JRJ
>
> -Original Message-
> From: Devora [mailto:devora...@gmail.com]
> Sent: Thursday, August 25, 2011 5:15 AM
> To: solr-user@lucene.apache.org
> Subject: Solr in a windows shared hosting environment
>
> Hi,
>
>
>
> Is it possible to install Solr in  a windows (IIS 7 or IIS 7.5)  shared
> hosting environment?
>
> If yes, where can I find instructions how to do that?
>
>
>
> Thank you!
>
>


Re: DIH importing

2011-08-26 Thread simon
It sounds as though you are optimizing the index after the delta import. If
you don't do that, then only new segments will be replicated and syncing
will be much faster.


On Fri, Aug 26, 2011 at 12:08 PM, Mark  wrote:

> We are currently delta-importing using DIH after which all of our servers
> have to download the full index (16G). This obviously puts quite a strain on
> our slaves while they are syncing over the index. Is there anyway not to
> sync over the whole index, but rather just the parts that have changed?
>
> We would like to get to the point where are no longer using DIH but rather
> we are constantly sending documents over HTTP to our master in realtime. We
> would then like our slaves to download these changes as soon as possible. Is
> something like this even possible?
>
> Thanks for you help
>


Re: New IndexSearcher and autowarming

2011-08-26 Thread simon
The multicore API (see http://wiki.apache.org/solr/CoreAdmin ) allows you to
swap, unload, reload cores. That should allow you to do what you want,

-Simon

On Fri, Aug 26, 2011 at 11:13 AM, Mike Austin wrote:

> I would like to have the ability to keep requests from being slowed from
> new
> document adds and commits by having a separate index that gets updated.
> Basically a read-only and an updatable index. After the update index has
> finished updating with new adds and commits, I'd like to switch the update
> to the "live" read-only.  At the same time, it would be nice to have the
> old
> read-only index become "updated" with the now live read-only index before I
> start this update process again.
>
> 1. Index1 is live and read-only and doesn't get slowed by updates
> 2. Index2 is updated with Index1 and gets new adds and commits
> 3. Index2 gets cache warming
> 4. Index2 becomes the live index read-only index
> 5. Index1 gets synced with Index2 so that when these steps start again, the
> updating is happening on an updated index.
>
> I know that this is possible but can't find a simple tutorial on how to do
> this.  By the way, I'm using SolrNet in a windows environment.
>
> Thanks,
> Mike
>


Re: where should i keep the class files to perform scheduling?

2011-08-26 Thread simon
The built-in DIH scheduling was never implemented as far as I know - the
Wiki section is just a design proposal and explicitly says "Hasn't been
committed to SVN (published only here) "

On Windows, you can use the Task Scheduler to do the kinds of things that
cron does on Unix/Linux.

-Simon

On Fri, Aug 26, 2011 at 9:21 AM, nagarjuna wrote:

> Thank u very much for ur reply Erick Erickson
>i am using solr 3.3.0 version
>  and i have no idea about the cron job i thought that it would be for unix
> but i am using windows
> and i would like to integrate my scheduling task with my solr server
>
> please give me the suggestion
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/where-should-i-keep-the-class-files-to-perform-scheduling-tp3286562p3286827.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Installing Solr on a shared hosting server?

2012-10-10 Thread simon
some time back I used dreamhost for a Solr based project. Looks as though
all their offerings, including shared  hosting have Java support - see
http://wiki.dreamhost.com/What_We_Support. I was very happy with their
service and support.

-Simon

On Tue, Oct 9, 2012 at 10:44 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> Bluehost doesn't seem to support Java processes, so unfortunately the
> answer seems to be no.
>
> You might want to look into getting a Linode or some other similar VPS
> hosting. Solr needs RAM to function well, though, so you're not going
> to be able to go with the cheapest option.
>
> Michael Della Bitta
>
> 
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Tue, Oct 9, 2012 at 9:27 AM, caiod  wrote:
> > I was wondering if I can install Solr on bluehost's shared hosting to
> use as
> > a website search, and also how do I do so? Thank you...
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Installing-Solr-on-a-shared-hosting-server-tp4012708.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: multi-core sharing synonym map

2012-10-12 Thread simon
I definitely haven't tried this ;=) but perhaps you could create your own
XXXSynonymFilterFactory  as a subclass of SynonymFilterFactory,  which
would allow you to share the synonym map across all cores - though I think
there would need to be a nasty global variable to hold a reference to it...

-Simon

On Fri, Oct 12, 2012 at 12:27 PM, Phil Hoy wrote:

> Hi,
>
> We have a multi-core set up with a fairly large synonym file, all cores
> share the same schema.xml and synonym file but when solr loads the cores,
> it loads multiple instances of the synonym map, this is a little wasteful
> of memory and lengthens the start-up time. Is there a way to get all cores
> to share the same map?
>
>
> Phil
>


Re: solr/jetty not working for anything other than localhost

2009-11-25 Thread simon
first, check what port 8983 is bound to - should be listening on all
interfaces

netstat -an |grep 8983

You should see

tcp0  0 0.0.0.0:8983  0.0.0.0:*   LISTEN

-Simon

On Wed, Nov 25, 2009 at 3:55 PM, Joel Nylund  wrote:

> Hi, if I try to use any other hostname jetty doesnt work, gives a blank
> page, if I telnet too the server/port it just disconnects.
>
> I tried editing the scripts.conf to change the hostname, that didnt seem to
> help.
>
> For example I tried editing my etc/hosts file and added:
>
> 127.0.0.1 solriscool
>
> then:
> ping solriscool
> PING solriscool (127.0.0.1): 56 data bytes
> 64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.055 ms
> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.095 ms
>
>
> sh-3.2# telnet solriscool 8983
> Trying 127.0.0.1...
> Connected to solriscool.
> Escape character is '^]'.
> GET / HTTP/1.1
> Connection closed by foreign host.
>
>
> telnet localhost 8983
> Trying ::1...
> Connected to localhost.
> Escape character is '^]'.
> GET /solr HTTP/1.1
> Host: localhost
>
> HTTP/1.1 302 Found
> Location: http://localhost/solr/
> Content-Length: 0
> Server: Jetty(6.1.3)
>
>
> any ideas?
>
> thanks
> Joel
>
>


Re: solr/jetty not working for anything other than localhost

2009-11-25 Thread simon
On Wed, Nov 25, 2009 at 5:27 PM, Joel Nylund  wrote:

> I see:
>
> tcp46  0  0  *.8983 *.*LISTEN
> tcp4   0  0  127.0.0.1.8983 *.*LISTEN
>

Not the same version of linux/netstat as mine, but I'd guess that the
second  line is the key to the problem -looks as though TCP over IPv4 is onl
y listening on the localhost interface, which is a network configuration
issue.

what does the Solr log say after it's started - should be a line

 INFO:  Started SelectChannelConnector @ 0.0.0.0:8983


-Simon


> thanks
> Joel
>
>
> On Nov 25, 2009, at 5:21 PM, simon wrote:
>
>  first, check what port 8983 is bound to - should be listening on all
>> interfaces
>>
>> netstat -an |grep 8983
>>
>> You should see
>>
>> tcp0  0 0.0.0.0:8983  0.0.0.0:*   LISTEN
>>
>> -Simon
>>
>> On Wed, Nov 25, 2009 at 3:55 PM, Joel Nylund  wrote:
>>
>>  Hi, if I try to use any other hostname jetty doesnt work, gives a blank
>>> page, if I telnet too the server/port it just disconnects.
>>>
>>> I tried editing the scripts.conf to change the hostname, that didnt seem
>>> to
>>> help.
>>>
>>> For example I tried editing my etc/hosts file and added:
>>>
>>> 127.0.0.1 solriscool
>>>
>>> then:
>>> ping solriscool
>>> PING solriscool (127.0.0.1): 56 data bytes
>>> 64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.055 ms
>>> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.095 ms
>>>
>>>
>>> sh-3.2# telnet solriscool 8983
>>> Trying 127.0.0.1...
>>> Connected to solriscool.
>>> Escape character is '^]'.
>>> GET / HTTP/1.1
>>> Connection closed by foreign host.
>>>
>>>
>>> telnet localhost 8983
>>> Trying ::1...
>>> Connected to localhost.
>>> Escape character is '^]'.
>>> GET /solr HTTP/1.1
>>> Host: localhost
>>>
>>> HTTP/1.1 302 Found
>>> Location: http://localhost/solr/
>>> Content-Length: 0
>>> Server: Jetty(6.1.3)
>>>
>>>
>>> any ideas?
>>>
>>> thanks
>>> Joel
>>>
>>>
>>>
>


Re: Cleaning up dirty OCR

2010-03-09 Thread simon
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir  wrote:

> > Can anyone suggest any practical solutions to removing some fraction of
> the tokens containing OCR errors from our input stream?
>
> one approach would be to try
> http://issues.apache.org/jira/browse/LUCENE-1812
>
> and filter terms that only appear once in the document.
>

In another life (and with another search engine) I also had to find a
solution to the dirty OCR problem. Fortunately only in English,
unfortunately a corpus containing many non-American/non-English names, so we
also had to be very conservative and reduce the number of false positives.

There wasn't any completely satisfactory solution; there were a large number
of two and three letter n-grams so we were able to use a dictionary approach
to eliminate those (names tend to be longer).  We also looked for runs of
punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.

Hope this helps

-Simon

>
> --
>


Re: checksum failed (hardware problem?)

2018-09-26 Thread simon
I saw something like this a year ago which i reported as a possible bug  (
https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
description and stack traces)

This occurred very randomly on an AWS instance; moving the index directory
to a different file system did not fix the problem Eventually I cloned our
environment to a new AWS instance, which proved to be the solution. Why, I
have no idea...

-Simon

On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar 
wrote:

> Got it. I'll have first hardware folks check and if they don't see/find
> anything suspicious then i'll return here.
>
> Wondering if any body has seen similar error and if they were able to
> confirm if it was hardware fault or so.
>
> Thnx
>
> On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson 
> wrote:
>
> > Mind you it could _still_ be Solr/Lucene, but let's check the hardware
> > first ;)
> > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar 
> > wrote:
> > >
> > > Hi Erick,
> > >
> > > Thanks so much for your reply.  I'll now look mostly into any possible
> > > hardware issues than Solr/Lucene.
> > >
> > > Thanks again.
> > >
> > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > > > There are several of reasons this would "suddenly" start appearing.
> > > > 1> Your disk went bad and some sector is no longer faithfully
> > > > recording the bits. In this case the checksum will be wrong
> > > > 2> You ran out of disk space sometime and the index was corrupted.
> > > > This isn't really a hardware problem.
> > > > 3> Your disk controller is going wonky and not reading reliably.
> > > >
> > > > The "possible hardware issue" message is to alert you that this is
> > > > highly unusual and you should at leasts consider doing integrity
> > > > checks on your disk before assuming it's a Solr/Lucene problem
> > > >
> > > > Best,
> > > > Erick
> > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar  >
> > > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I am still trying to understand the corrupt index exception we saw
> > in our
> > > > > logs. What does the hardware problem comment indicates here?  Does
> > that
> > > > > mean it caused most likely due to hardware issue?
> > > > >
> > > > > We never had this problem in last couple of months. The Solr is
> > 6.6.2 and
> > > > > ZK: 3.4.10.
> > > > >
> > > > > Please share your thoughts.
> > > > >
> > > > > Thanks,
> > > > > Susheel
> > > > >
> > > > > Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> > > > > failed *(hardware
> > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > > >
> > > >
> > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > [slice=_i27s_Lucene50_0.tim])
> > > > >
> > > > > It suddenly started in the logs and before which there was no such
> > error.
> > > > > Searches & ingestions all seems to be working prior to that.
> > > > >
> > > > > 
> > > > >
> > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL
> s:shard1
> > > > > r:core_node1 x:COLL_shard1_replica1]
> > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > update-script#processAdd:
> > > > >
> > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> 08480_1-en_US
> > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL
> s:shard1
> > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > > > > org.apache.solr.common.SolrException: Exception writing document
> id
> > > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> 1-en_US
> > to
> > > > the
> > > > > index; possible analysis error.
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> ateHandler2.java:206)
> > > > > at
> > > > >
> > > >

Re: Solr search word NOT followed by another word

2018-02-12 Thread simon
Tim:

How up to date is the Solr-5410  patch/zip in JIRA ?.  Looking to use the
Span Query parser in 6.5.1, migrating to 7.x sometime soon.

Would love to see these committed !

-Simon

On Mon, Feb 12, 2018 at 10:41 AM, Allison, Timothy B. 
wrote:

> That requires a SpanNotQuery.  AFAIK, there is no way to do this with the
> current parsers included in Solr.
>
> My SpanQueryParser does cover this, and I'm hoping to port it to 7.x today
> or tomorrow.
>
> Syntax would be "Leonardo [da vinci]"!~0,1
>
> https://issues.apache.org/jira/browse/LUCENE-5205
>
> https://github.com/tballison/lucene-addons/tree/master/lucene-5205
>
> https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205
>
> With Solr wrapper: https://issues.apache.org/jira/browse/SOLR-5410
>
>
> -Original Message-
> From: ivan [mailto:i...@presstoday.com]
> Sent: Monday, February 12, 2018 6:00 AM
> To: solr-user@lucene.apache.org
> Subject: Solr search word NOT followed by another word
>
> What i'm trying to do is to only get results for "Leonardo" when is not
> followed by "da vinci".
> So any result containing "Leonardo" (not followed by "da vinci") is fine
> even if i have "Leonardo da vinci" in the result. I want to filter out only
> the results where i don't have "Leonardo" without "da vinci".
>
> Examples:
> "Leonardo abc abc abc"   OK
> "Leonardo da vinci abab"  KO
> "Leonardo is the name of Leonardo da Vinci"  OK
>
>
> I can't seem to find any way to do that using solr queries. I can't use
> regex (i have a tokenized text field) and any combination of boolean logic
> doesn't seem to work.
>
> Any help?
> Thanks
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Defining Document Transformers in Solr Configuration

2018-02-27 Thread simon
We do quite complex data pulls from a Solr index for subsequent analytics,
currently using a home-grown Python API. Queries might include  a handful
of pseudofields which this API rewrites to an aliased field invoking a
Document Transformer in the 'fl' parameter list.

For example 'numcites' is transformed to

'fl= ,numcites:[subquery]&numcites.fl=pmid&numcites.q={!terms
f=md_c_pmid v=$row.pmid}&numcites.rows=10&numcites.logParamsList=q',...'

What I'd ideally like to be able to do would be have this transformation
defined in Solr configuration so that it's not tied  to one particular
external API -  defining a macro, if you will, so that you could supply
'fl='a,b,c,%numcites%,...' in the request and have Solr do the expansion.

Is there some way to do this that I've overlooked ? if not, I think it
would be a useful new feature.


-Simon


Re: Defining Document Transformers in Solr Configuration

2018-02-27 Thread simon
On Tue, Feb 27, 2018 at 5:34 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net> wrote:

> I don't think you can define docTrasformer in the SolrConfig at the
> moment, I agree it would be a cool feature.
>
> Maybe one possibility could be to use the update request processors [1],
> and precompute the fields at index time, it would be more expensive in disk
> and index time,  but then it would simplify the fl logic and also
> performance at query time.
>

Wouldn't work for my use case - we are taking a field which contains the
IDs of documents which are cited  by the current document and 'inverting'
this to compute the number of documents which cite the current document in
the subquery. Anything precomputed could change as the index is updated.

>
> Cheers,
> Diego
>
> [1] https://lucene.apache.org/solr/guide/6_6/update-request-
> processors.html
>
> From: solr-user@lucene.apache.org At: 02/27/18 20:21:08To:
> solr-user@lucene.apache.org
> Subject: Defining Document Transformers in Solr Configuration
>
> We do quite complex data pulls from a Solr index for subsequent analytics,
> currently using a home-grown Python API. Queries might include  a handful
> of pseudofields which this API rewrites to an aliased field invoking a
> Document Transformer in the 'fl' parameter list.
>
> For example 'numcites' is transformed to
>
> 'fl= ,numcites:[subquery]&numcites.fl=pmid&numcites.q={!terms
> f=md_c_pmid v=$row.pmid}&numcites.rows=10&numcites.logParamsList=q',...'
>
> What I'd ideally like to be able to do would be have this transformation
> defined in Solr configuration so that it's not tied  to one particular
> external API -  defining a macro, if you will, so that you could supply
> 'fl='a,b,c,%numcites%,...' in the request and have Solr do the expansion.
>
> Is there some way to do this that I've overlooked ? if not, I think it
> would be a useful new feature.
>
>
> -Simon
>
>
>


Re: Defining Document Transformers in Solr Configuration

2018-02-28 Thread simon
Thanks Mikhail:

I considered that, but not all queries would request that field, and there
are in fact a couple more similar DocTransformer-generated aliased fields
which we can optionally request, so it's not a general enough solution.

-Simon

On Wed, Feb 28, 2018 at 1:18 AM, Mikhail Khludnev  wrote:

> Hello, Simon.
>
> You can define a search handler where have 
> numcites:[subquery]&numcites.fl=pmid&numcites.q={!terms
> f=md_c_pmid v=$row.pmid}&numcites.rows=10&numcites.logParamsList=q
> 
> or something like that.
>
> On Tue, Feb 27, 2018 at 11:20 PM, simon  wrote:
>
> > We do quite complex data pulls from a Solr index for subsequent
> analytics,
> > currently using a home-grown Python API. Queries might include  a handful
> > of pseudofields which this API rewrites to an aliased field invoking a
> > Document Transformer in the 'fl' parameter list.
> >
> > For example 'numcites' is transformed to
> >
> > 'fl= ,numcites:[subquery]&numcites.fl=pmid&numcites.q={!terms
> > f=md_c_pmid v=$row.pmid}&numcites.rows=10&numcites.logParamsList=q',...'
> >
> > What I'd ideally like to be able to do would be have this transformation
> > defined in Solr configuration so that it's not tied  to one particular
> > external API -  defining a macro, if you will, so that you could supply
> > 'fl='a,b,c,%numcites%,...' in the request and have Solr do the expansion.
> >
> > Is there some way to do this that I've overlooked ? if not, I think it
> > would be a useful new feature.
> >
> >
> > -Simon
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: CURL command problem on Solr

2018-05-29 Thread simon
Could it be that the header should be 'Content-Type' (which is what I see
in the relevant RFC) rather than 'Content-type' as shown in your email ? I
don't know if headers are case-sensitive, but it's worth checking.

-Simon

On Tue, May 29, 2018 at 11:02 AM, Roee Tarab  wrote:

> Hi ,
>
> I am having some troubles with pushing a features file to solr while
> building an LTR model. I'm trying to upload a JSON file on windows cmd
> executable from an already installed CURL folder, with the command:
>
> curl -XPUT 'http://localhost:8983/solr/techproducts/schema/feature-store'
> --data-binary "@/path/myFeatures.json" -H 'Content-type:application/json'.
>
> I am receiving the following error massage:
>
> {
>   "responseHeader":{
> "status":500,
> "QTime":7},
>   "error":{
> "msg":"Bad Request",
> "trace":"Bad Request (400) - Invalid content type
> application/x-www-form-urlencoded; only application/json is
> supported.\r\n\tat org.apache.solr.rest.RestManager$ManagedEndpoint.
> parseJsonFromRequestBody(RestManager.java:407)\r\n\tat
> org.apache.solr.rest.
> RestManager$ManagedEndpoint.put(RestManager.java:340) 
>
> This is definitely a technical issue, and I have not been able to overcome
> it for 2 days.
>
> Is there another option of uploading the file to our core? Is there
> something we are missing in our command?
>
> Thank you in advance for any help,
>


Re: Sorting and pagination in Solr json range facet

2018-07-11 Thread simon
Looking carefully at the documentation for JSON facets, it looks as though
the offset parameter is not supported for range facets, only for term
facets.  You'd have to do pagination in your application.

-Simon

On Tue, Jul 10, 2018 at 11:45 AM, Anil  wrote:

> HI Eric,
>
> i mean pagination is offset and limit for facet results. Basically i am
> trying to sort the daily totals (from json facet field) and apply offset,
> limit to the buckets.
>
> json.facet=
> {
> daily_totals: {
> type: range,
> field: daily_window,
> start : "2017-11-01T00:00:00Z",
> end : "2018-03-14T00:00:00Z",
> gap:"%+1DAY",
> sort: daily_total,
> mincount:1,
> facet: {
> daily_total: "sum(daily_views)"
> }
> }
> }
>
> please let me know if you have any questions. thanks.
>
> Regards,
> Anil
>
> On 10 July 2018 at 20:22, Erick Erickson  wrote:
>
> > What exactly do you mean by "pagination" here? Facets are computed over
> > the entire result set. That is, if the number of documents found for the
> > query
> > is 1,000,000, the facets are returned counted over all 1M docs, even if
> > your
> > rows parameter is 10. The same numbers will be returned for facets
> > regardless of the start and rows parameters.
> >
> > This feels like an XY problem, you're asking how to do X (paginate
> facets)
> > to solve problem Y, but haven't stated what Y is. What's the use-case
> here?
> >
> > Best,
> > Erick
> >
> >
> >
> > On Tue, Jul 10, 2018 at 5:36 AM, Anil  wrote:
> > > Hi,
> > >
> > > Good Morning.
> > >
> > > I am trying solr json facet features. sort, offset, limit fields are
> not
> > > working for Range facet.
> > >
> > > and could not find the support in the documentation. is there any way
> to
> > > achieve sort and pagination for Range facet ? please help.
> > >
> > > Documentation of range facet says -
> > >
> > > Parameters:
> > >
> > >- field – The numeric field or date field to produce range buckets
> > from
> > >- mincount – Minimum document count for the bucket to be included in
> > the
> > >response. Defaults to 0.
> > >- start – Lower bound of the ranges
> > >- end – Upper bound of the ranges
> > >- gap – Size of each range bucket produced
> > >- hardend – A boolean, which if true means that the last bucket will
> > end
> > >at “end” even if it is less than “gap” wide. If false, the last
> > bucket will
> > >be “gap” wide, which may extend past “end”.
> > >- other – This param indicates that in addition to the counts for
> each
> > >range constraint between facet.range.start and facet.range.end,
> counts
> > >should also be computed for…
> > >   - "before" all records with field values lower then lower bound
> of
> > >   the first range
> > >   - "after" all records with field values greater then the upper
> > bound
> > >   of the last range
> > >   - "between" all records with field values between the start and
> end
> > >   bounds of all ranges
> > >   - "none" compute none of this information
> > >   - "all" shortcut for before, between, and after
> > >- include – By default, the ranges used to compute range faceting
> > >between facet.range.start and facet.range.end are inclusive of their
> > lower
> > >bounds and exclusive of the upper bounds. The “before” range is
> > exclusive
> > >and the “after” range is inclusive. This default, equivalent to
> lower
> > >below, will not result in double counting at the boundaries. This
> > behavior
> > >can be modified by the facet.range.include param, which can be any
> > >combination of the following options…
> > >   - "lower" all gap based ranges include their lower bound
> > >   - "upper" all gap based ranges include their upper bound
> > >   - "edge" the first and last gap ranges include their edge bounds
> > (ie:
> > >   lower for the first one, upper for the last one) even if the
> > > corresponding
> > >   upper/lower option is not specified
> > >   - "outer" the “before” and “after” ranges will be inclusive of
> > their
> > >   bounds, even if the first or last ranges already include those
> > boundaries.
> > >   - "all" shorthand for lower, upper, edge, outer
> > >
> > >
> > >
> > >  Thanks,
> > > Anil
> >
>


Simple Sort Is Not Working In Solr 4.7?

2015-02-17 Thread Simon Cheng
Hi,

I don't know whether it is my setup or any other reasons. But the fact is
that a very simple sort is not working in my Solr 4.7 environment.

The query is very simple :
http://localhost:8983/solr/bibs/select?q=author:soros&fl=id,author,title&sort=title+asc&wt=xml&start=0&indent=true

And the output is NOT sorted according to title :



0
1

title asc
id,author,title
true
0
author:soros
xml




9018

Soros, George, 1930-


The alchemy of finance : reading the mind of the market / George Soros



15785

Soros, George, 1930-
Soros Foundations

Bosnia / by George Soros


16281

Soros, George, 1930-
Soros Foundations


Prospect for European disintegration / by George Soros



25807

Soros, George


Open society : reforming global capitalism / George Soros



27440
George Soros on globalization

Soros, George, 1930-



22254

Soros, George, 1930-


The crisis of global capitalism : open society endangered / George Soros



16914

Soros, George, 1930-
Soros Fund Management

The theory of reflexivity / by George Soros


17343

Financial turmoil in Europe and the United States : essays / George Soros


Soros, George, 1930-



15542

Soros, George, 1930-
Harvard Club of New York City


Nationalist dictatorships versus open society / by George Soros



15891

Soros, George


The new paradigm for financial markets : the credit crisis of 2008 and what
it means / George Soros





Thank you for the help in advance,
Simon.


Re: Simple Sort Is Not Working In Solr 4.7?

2015-02-17 Thread Simon Cheng
Hi Alex,

It's simply defined like this in the schema.xml :

   

and it is cloned to the other multi-valued field o_title :

   

Should I simply change the type to be "string" instead?

Thanks again,
Simon.


On Wed, Feb 18, 2015 at 12:00 PM, Alexandre Rafalovitch 
wrote:

> What's the field definition for your "title" field? Is it just string
> or are you doing some tokenizing?
>
> It should be a string or a single token cleaned up (e.g. lower-cased)
> using KeywordTokenizer. In the example schema, you will normally see
> the original field tokenized and the sort field separately with
> copyField connection. In latest Solr, docValues are also recommended
> for sort fields.
>
> Regards,
>Alex.
>


Re: Simple Sort Is Not Working In Solr 4.7?

2015-02-17 Thread Simon Cheng
Hi Alex,

It's okay after I added in a new field "s_title" in the schema and
re-indexed.

   
   

But how can I ignore the articles ("A", "An", "The") in the sorting. As you
can see from the below example :

http://localhost:8983/solr/bibs/select?q=singapore&fl=id,title&sort=s_title+asc&wt=xml&start=0&rows=20&indent=true



0
0

singapore
true
id,title
0
s_title asc
20
xml




36

5th SEACEN-Toronto Centre Leadership Seminar for Senior Management of
Central Banks on Financial System Oversight, 16-21 Oct 2005, Singapore



70

Anti-money laundering & counter-terrorism financing / Commercial Affairs
Dept



15

China's anti-secession law : a legal perspective / Zou, Keyuan



12

China's currency peg : firm in the eye of the storm / Calla Wiemer



22

China's politics in 2004 : dawn of the Hu Jintao era / Zheng Yongnian & Lye
Liang Fook



92

Goods and Services Tax Act [2005 ed.] (Chapter 117A)



13

Governing capacity in China : creating a contingent of qualified personnel
/ Kjeld Erik Brodsgaard



21
Health care marketization in urban China / Gu Xin


85
Lianhe Zaobao, Sunday


84

Singapore : vision of a global city / Jones Lang LaSalle



7

Singapore real estate investment trusts : leveraged value / Tony Darwell



96

Singapore's success : engineering economic growth / Henri Ghesquiere



23

The Chen-Soong meeting : the beginning of inter-party rapprochement in
Taiwan? / Raymond R. Wu



17

The Haw Par saga in the 1970s / project sponsor, Low Kwok Mun; team leader,
Sandy Ho; team members, Audrey Low ... et al



78
The New paper on Sunday


95

The little Red Dot : reflections by Singapore's diplomats / editors, Tommy
Koh, Chang Li Lin



52

[Press releases and articles on policy changes affecting the Singapore
property market] / compiled by the Information Resource Centre, Monetary
Authority of Singapore



dataq

Simon is testing Solr - This one is in English. Color of the Wind. 我是中国人 ,
БOΛbШ OЙ PYCCKO-KИTAЙCKИЙ CΛOBAPb , Français-Chinois






Re: Simple Sort Is Not Working In Solr 4.7?

2015-02-18 Thread Simon Cheng
Great help and thanks to you, Alex.


On Wed, Feb 18, 2015 at 2:48 PM, Alexandre Rafalovitch 
wrote:

> Like I mentioned before. You could use string type if you just want
> title it is. Or you can use a custom type to normalize the indexed
> value, as long as you end up with a single token.
>
> So, if you want to strip leading A/An/The, you can use
> KeywordTokenizer, combined with whatever post-processing you need. I
> would suggest LowerCase filter and perhaps Regex filter to strip off
> those leading articles. You may need to iterate a couple of times on
> that specific chain.
>
> The good news is that you can just make a couple of type definitions
> with different values/order, reload the index (from Cores screen of
> the Web Admin UI) and run some of your sample titles through those
> different definitions without having to reindex in the Analysis
> screen.
>
> Regards,
>   Alex.
>
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
> On 17 February 2015 at 22:36, Simon Cheng  wrote:
> > Hi Alex,
> >
> > It's okay after I added in a new field "s_title" in the schema and
> > re-indexed.
> >
> > > multiValued="false"/>
> >
> >
> > But how can I ignore the articles ("A", "An", "The") in the sorting. As
> you
> > can see from the below example :
>


Creating a collection/core on HDFS with SolrCloud

2015-02-25 Thread Simon Minery

Hello,

I'm trying to create a collection on HDFS with Solr 5.0.0.
I have my solrconfig.xml with the HDFS parameters, following the 
confluence guidelines.



When creating with the bin/Solr script  "bin/solr create -c 
collectionHDFS -d /my/conf/ I have this error:



failure":{"":"org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: https://192.168.200.32:8983/solr"}}



With the GUI on the SolrCloud server, I have this one:

Error CREATEing SolrCore 'collectionHDFS': Unable to create core 
[collectionHDFS] Caused by: hadoop.security.authentication set to: 
simple, not kerberos, but attempting to connect to HDFS via kerberos


On my /my/conf/solrconfig.xml, I have already double-checked that :
true
name="solr.hdfs.security.kerberos.keytabfile">/my/conf/solr.keytab
name="solr.hdfs.security.kerberos.principal">solr/@CLUSTER.HADOOP


and on Hadoop' core-site.xml, my hadoop.security.authentication 
parameter is set to Kerberos.

Am I missing something ?
Thank you very much for your input, have a great day.
Simon M.


solr.DictionaryCompoundWordTokenFilterFactory extracts words in string

2015-03-31 Thread Simon Martinelli
Hi,

I configured solr.DictionaryCompoundWordTokenFilterFactory using a
dictionary with the following content:

- lindor
- schlitten
- dorsch
- filet

I want to index the compound words

- dorschfilet
- lindorschlitten

dorschfilet is processed as expected

dorsch filet

but lindorschlitten is compound of

lindor and schlitten

but i get

lindor dorsch schlitten

so the filter is extracting dorsch but the word before (lin) and after
(litten) are not valid word parts.

Is there any better compound word filter for German?

Thanks, Simon


Re: Alphanumeric Wild card search

2015-04-02 Thread Simon Martinelli
Hi,

Have a look at the generated terms to see how they look.

Simon

On Thu, Apr 2, 2015 at 9:43 AM, Palagiri, Jayasankar <
jayashankar.palag...@honeywell.com> wrote:

> Hello Team,
>
> Below is my field type
>
>  positionIncrementGap="100" autoGeneratePhraseQueries="true">
>   
> 
> 
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  ignoreCase="true" expand="true"/>
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
> 
>
>
> And my field is
>
> 
>
> I have few docunets in my index
>
> Like 1234-305, 1234-308,1234-318.
>
> When I search Name:"1234-*" I get desired results, but when I search like
> Name:"123-3*" I get 0 results
>
> Can some one help to find what is wrong with my indexing?
>
> When I search
> Thanks and Regards,
> Jaya
>
>


How to trace error records during POST?

2015-04-07 Thread Simon Cheng
Good morning,

I used Solr 4.7 to post 186,745 XML files and 186,622 files have been
indexed. That means there are 123 XML files with errors. How can I trace
what these files are?

Thank you in advance,
Simon Cheng.


Metadata and HTML ending up in searchable text

2016-05-26 Thread Simon Blandford

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting 
with...
body_txt_en: " stream_size 36499 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By"


And then once it gets to the actual text I get CSS class names appearing 
that were in  or  tags etc.
e.g. "the power of calibre3 silence calibre2 and", where 
"calibre3" etc are the CSS class names.


All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in 
searching for.


Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command
curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true"; 
-F 
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"


5) HTML document index using following command
curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true"; 
-F 
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"


6) Query using URL: 
http://localhost:8983/solr/mycore/select?q=especially&wt=json


Result:

For the txt file, I get the following JSON for the document...

{
id: "doc1",
attr_stream_size: [
"8107"
],
attr_x_parsed_by: [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
attr_stream_content_type: [
"text/plain"
],
attr_stream_name: [
"UsingMailingLists.txt"
],
attr_stream_source_info: [
"content/UsingMailingLists.txt"
],
attr_content_encoding: [
"ISO-8859-1"
],
attr_content_type: [
"text/plain; charset=ISO-8859-1"
],
body_txt_en: " stream_size 8107 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By 
org.apache.tika.parser.txt.TXTParser stream_content_type text/plain 
stream_name UsingMailingLists.txt stream_source_info 
content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type 
text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text] 
Solr_Wiki Login ** UsingMailingLists ** * FrontPage * 
RecentChanges...etc",

_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
id: "doc2",
attr_stream_size: [
"20440"
],
attr_x_parsed_by: [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"
],
attr_stream_content_type: [
"text/html"
],
attr_stream_name: [
"UsingMailingLists.html"
],
attr_stream_source_info: [
"content/UsingMailingLists.html"
],
attr_dc_title: [
"UsingMailingLists - Solr Wiki"
],
attr_content_encoding: [
"UTF-8"
],
attr_robots: [
"index,nofollow"
],
attr_title: [
"UsingMailingLists - Solr Wiki"
],
attr_content_type: [
"text/html; charset=utf-8"
],
body_txt_en: " stylesheet text/css utf-8 all 
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen 
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print 
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection 
/wiki/modernized/css/projection.css alternate Solr Wiki: 
UsingMailingLists 
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1 
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw Alternate print Print View 
/solr/UsingMailingLists?action=print Search /solr/FindPage Index 
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting 
stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser 
X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type 
text/html stream_name UsingMailingLists.html stream_source_info...etc",

_version_: 1535398408383103000
}





Re: Metadata and HTML ending up in searchable text

2016-05-27 Thread Simon Blandford

Hi Timothy,

Thanks for responding.

java -jar tika-app-1.13.jar -t 
"/home/user/Documents/library/UsingMailingLists.txt"
...gives a clean result with no CSS or other nasties in the output. So 
it looks like the latest version of tika itself is OK.


I was basing the test case on this doc page as closely as possible, 
including the prefix and content mapping.

https://wiki.apache.org/solr/ExtractingRequestHandler

From the same page, extractFormat=text only applies when extractOnly is 
true, which just shows the output from tika without indexing the 
document. Running it in "extractOnly" mode resulting in a XML output. 
The difference between selecting "text" or "xml" format is that the 
escaped document in the  tag is either the original HTML (xml 
mode) or stripped HTML (text mode). It seems some Javascript creeps into 
the text version. (See below)


Regards,
Simon

HTML mode sample:


0name="QTime">51<?xml 
version="1.0" encoding="UTF-8"?>

<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<link
rel="stylesheet" type="text/css" charset="utf-8" 
media="all" href="/wiki/modernized/css/common.css"/>

<link rel="stylesheet" type="text/css" charset="utf-8"
media="screen" href="/wiki/modernized/css/screen.css"/>
<link rel="stylesheet" type="text/css" charset="utf-8"
media="print" href="/wiki/modernized/css/print.css"/>...

TEXT mode (Blank lines stripped):

0name="QTime">47

UsingMailingLists - Solr Wiki
Search:
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
Solr Wiki
Login





On 27/05/16 13:31, Allison, Timothy B. wrote:

I'm only minimally familiar with Solr Cell, but...

1) It looks like you aren't setting extractFormat=text.  According to [0]...the 
default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"?  This 
would strip out the attr_ fields so they wouldn't even be indexed...if you don't want 
them.

As for the HTML file, it looks like Tika is failing to strip out the style 
section.  Try running the file alone with tika-app: java -jar tika-app.jar -t 
inputfile.html.  If you are finding the noise there.  Please open an issue on 
our JIRA: https://issues.apache.org/jira/browse/tika


[0] 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika


-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting with...
body_txt_en: " stream_size 36499 X-Parsed-By org.apache.tika.parser.DefaultParser 
X-Parsed-By"

And then once it gets to the actual text I get CSS class names appearing that were in 
 or  tags etc.
e.g. "the power of calibre3 silence calibre2 and", where "calibre3" etc 
are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in searching 
for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true";
-F
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true";
-F
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL:
http://localhost:8983/solr/mycore/select?q=especially&wt=json

Result:

For the txt file, I get the following JSON for the document...

{
  id: "doc1",
  attr_stream_size: [
  "8107"
  ],
  attr_x_parsed_by: [
  "org.apache.tika.parser.DefaultParser",
  "org.apache.tika.parser.txt.TXTParser"
  ],
  attr_stream_content_type: [
 

Re: Metadata and HTML ending up in searchable text

2016-05-31 Thread Simon Blandford

Hi Alex,

That sounds similar. I am puzzled by what I am seeing because it looks 
like a major bug and I am following the docs for curl as closely as 
possible, but hardly anyone else seems to have noticed it. To me it is a 
show-stopper.


If I convert the docs to txt with html2text first then I can sort-of 
live with the results, although I'd rather not have the metadata in the 
document, but at least the main text body doesn't have tag content in 
it, as it does with HTML source.


I just want to make sure I'm not missing something really obvious before 
submitting a bug report.


Regards,
Simon


On 27/05/16 20:22, Alexandre Rafalovitch wrote:

I think Solr's layer above Tika was merging in metadata and text all
together without a way (that I could see) to separate them.

That's all I remember of my examination of this issue when I run into
something similar. Not very helpful, I know.

Regards,
Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 27 May 2016 at 23:48, Simon Blandford  wrote:

Hi Timothy,

Thanks for responding.

java -jar tika-app-1.13.jar -t
"/home/user/Documents/library/UsingMailingLists.txt"
...gives a clean result with no CSS or other nasties in the output. So it
looks like the latest version of tika itself is OK.

I was basing the test case on this doc page as closely as possible,
including the prefix and content mapping.
https://wiki.apache.org/solr/ExtractingRequestHandler

 From the same page, extractFormat=text only applies when extractOnly is
true, which just shows the output from tika without indexing the document.
Running it in "extractOnly" mode resulting in a XML output. The difference
between selecting "text" or "xml" format is that the escaped document in the
 tag is either the original HTML (xml mode) or stripped HTML (text
mode). It seems some Javascript creeps into the text version. (See below)

Regards,
Simon

HTML mode sample:


051<?xml
version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<link
 rel="stylesheet" type="text/css" charset="utf-8" media="all"
href="/wiki/modernized/css/common.css"/>
 <link rel="stylesheet" type="text/css" charset="utf-8"
 media="screen" href="/wiki/modernized/css/screen.css"/>
 <link rel="stylesheet" type="text/css" charset="utf-8"
 media="print" href="/wiki/modernized/css/print.css"/>...

TEXT mode (Blank lines stripped):

047
UsingMailingLists - Solr Wiki
Search:
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
Solr Wiki
Login






On 27/05/16 13:31, Allison, Timothy B. wrote:

I'm only minimally familiar with Solr Cell, but...

1) It looks like you aren't setting extractFormat=text.  According to
[0]...the default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"?
This would strip out the attr_ fields so they wouldn't even be indexed...if
you don't want them.

As for the HTML file, it looks like Tika is failing to strip out the style
section.  Try running the file alone with tika-app: java -jar tika-app.jar
-t inputfile.html.  If you are finding the noise there.  Please open an
issue on our JIRA: https://issues.apache.org/jira/browse/tika


[0]
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika


-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting
with...
body_txt_en: " stream_size 36499 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By"

And then once it gets to the actual text I get CSS class names appearing
that were in  or  tags etc.
e.g. "the power of calibre3 silence calibre2 and", where
"calibre3" etc are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in
searching for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT doc

Re: Metadata and HTML ending up in searchable text

2016-06-01 Thread Simon Blandford

Thanks Timothy,

Will give the DIH a try. I have submitted a bug report.

Regards,
Simon

On 31/05/16 13:22, Allison, Timothy B. wrote:

  From the same page, extractFormat=text only applies when extractOnly
is true, which just shows the output from tika without indexing the document.

Y, sorry.  I just looked through the source code.  You're right.  If you use DIH 
(TikaEntityProcessor) instead of Solr Cell (ExtractingDocumentLoader), you should be able to set 
the handler type by setting the "format" attribute, and "text" is one option 
there.


I just want to make sure I'm not missing something really obvious before 
submitting a bug report.

I don't think you are.


  From the same page, extractFormat=text only applies when extractOnly
is true, which just shows the output from tika without indexing the document.
Running it in "extractOnly" mode resulting in a XML output. The
difference between selecting "text" or "xml" format is that the
escaped document in the  tag is either the original HTML
(xml mode) or stripped HTML (text mode). It seems some Javascript
creeps into the text version. (See below)

Regards,
Simon

HTML mode sample:
  051<?xml
version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<link
  rel="stylesheet" type="text/css" charset="utf-8" media="all"
href="/wiki/modernized/css/common.css"/>
  <link rel="stylesheet" type="text/css" charset="utf-8"
  media="screen" href="/wiki/modernized/css/screen.css"/>
  <link rel="stylesheet" type="text/css" charset="utf-8"
  media="print" href="/wiki/modernized/css/print.css"/>...

TEXT mode (Blank lines stripped):

047
UsingMailingLists - Solr Wiki
Search:
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none'; var e =
document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
Solr Wiki
Login






On 27/05/16 13:31, Allison, Timothy B. wrote:

I'm only minimally familiar with Solr Cell, but...

1) It looks like you aren't setting extractFormat=text.  According
to [0]...the default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"?
This would strip out the attr_ fields so they wouldn't even be
indexed...if you don't want them.

As for the HTML file, it looks like Tika is failing to strip out the
style section.  Try running the file alone with tika-app: java -jar
tika-app.jar -t inputfile.html.  If you are finding the noise there.
Please open an issue on our JIRA:
https://issues.apache.org/jira/browse/tika


[0]
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with
+Solr+Cell+using+Apache+Tika


-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text
starting with...
body_txt_en: " stream_size 36499 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By"

And then once it gets to the actual text I get CSS class names
appearing that were in  or  tags etc.
e.g. "the power of calibre3 silence calibre2 and", where
"calibre3" etc are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in
searching for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true";
-F

"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true";
-F

"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL:
http://localhost:8983/solr/mycore/select?q=especially&wt=json

Result:

For the txt file, I get the following JSON for the document...

{
id: "doc1",
attr_strea

Re: Metadata and HTML ending up in searchable text

2016-06-02 Thread Simon Blandford
I have investigated different Solr versions. I have found that 4.10.3 is 
the last version that completely strips the HTML to text as expected. 
4.10.4 starts introducing some HTML comments and Javascript and anything 
over 5.0 is full of mangled HTML and attribute artefacts such as 
"X-Parsed-By".


So for now the best solution for me is to just use 4.10.3, although I 
really miss the core and process management.


https://issues.apache.org/jira/browse/SOLR-9178

On 31/05/16 13:22, Allison, Timothy B. wrote:

  From the same page, extractFormat=text only applies when extractOnly
is true, which just shows the output from tika without indexing the document.

Y, sorry.  I just looked through the source code.  You're right.  If you use DIH 
(TikaEntityProcessor) instead of Solr Cell (ExtractingDocumentLoader), you should be able to set 
the handler type by setting the "format" attribute, and "text" is one option 
there.


I just want to make sure I'm not missing something really obvious before 
submitting a bug report.

I don't think you are.


  From the same page, extractFormat=text only applies when extractOnly
is true, which just shows the output from tika without indexing the document.
Running it in "extractOnly" mode resulting in a XML output. The
difference between selecting "text" or "xml" format is that the
escaped document in the  tag is either the original HTML
(xml mode) or stripped HTML (text mode). It seems some Javascript
creeps into the text version. (See below)

Regards,
Simon

HTML mode sample:
  051<?xml
version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<link
  rel="stylesheet" type="text/css" charset="utf-8" media="all"
href="/wiki/modernized/css/common.css"/>
  <link rel="stylesheet" type="text/css" charset="utf-8"
  media="screen" href="/wiki/modernized/css/screen.css"/>
  <link rel="stylesheet" type="text/css" charset="utf-8"
  media="print" href="/wiki/modernized/css/print.css"/>...

TEXT mode (Blank lines stripped):

047
UsingMailingLists - Solr Wiki
Search:
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none'; var e =
document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
Solr Wiki
Login






On 27/05/16 13:31, Allison, Timothy B. wrote:

I'm only minimally familiar with Solr Cell, but...

1) It looks like you aren't setting extractFormat=text.  According
to [0]...the default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"?
This would strip out the attr_ fields so they wouldn't even be
indexed...if you don't want them.

As for the HTML file, it looks like Tika is failing to strip out the
style section.  Try running the file alone with tika-app: java -jar
tika-app.jar -t inputfile.html.  If you are finding the noise there.
Please open an issue on our JIRA:
https://issues.apache.org/jira/browse/tika


[0]
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with
+Solr+Cell+using+Apache+Tika


-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text
starting with...
body_txt_en: " stream_size 36499 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By"

And then once it gets to the actual text I get CSS class names
appearing that were in  or  tags etc.
e.g. "the power of calibre3 silence calibre2 and", where
"calibre3" etc are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in
searching for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true";
-F

"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command curl
"http://localhost:8983/sol

  1   2   3   >