solr atomic updates stored="true", and copyField limitation

2013-09-19 Thread Tanguy Moal
Hello, I'm using solr 4.4. I have a solr core with a schema defining a bunch of different fields, and among them, a date field: - date: indexed and stored // the date used at search time In practice it's a TrieDateField but I think that's not relevant for the concern. It also has a multi

Re: Tokenization at query time

2013-08-12 Thread Tanguy Moal
Hello Andrea, I think you face a rather common issue involving keyword tokenization and query parsing in Lucene: The query parser splits the input query on white spaces, and then each token is analysed according to your configuration. So those queries with a whitespace won't behave as expected be

Re: Solr Suggest

2013-07-03 Thread Tanguy Moal
Hello Adrien, Looking quickly at your schema, I suspect that the suggestions field isn't populated, so the suggester dictionary is empty. How is input sent to that field ? Providing a few sample documents you are indexing could help understand what is going on. If you intended to copy content

Re: Easy question ? docs with empty geodata field

2012-10-19 Thread Tanguy Moal
Hello, Did you try q=-geodata:[* TO *] ? (Note the '-' (minus)) This reads as "documents without any value for field named geodata". Also if you plan to use this intensively, you'd better declare a boolean field telling if geodata are set or not and set a value to each doc, because the -field_nam

Re: Copy Field Question

2012-10-15 Thread Tanguy Moal
Hello, I think you don't have that much tuning possiblities using only the schema.xml file. You will have to write some custom Java code (subclasses of UpdateRequestProcessor and UpdateRequestProcessorFactory), build a Java jar containing your custom code, put that jar in one of the path declared

Re: How can I create about 100000 independent indexes in Solr?

2012-09-27 Thread Tanguy Moal
to do that... -- Tanguy 2012/9/26 韦震宇 > Hi, Tanguy > I would do as your suggestion. > Best Regards! > Monton > - Original Message - > From: "Tanguy Moal" > To: ; > Sent: Tuesday, September 25, 2012 11:05 PM > Subject: Re: How can I create about

Re: How can I create about 100000 independent indexes in Solr?

2012-09-25 Thread Tanguy Moal
That is an interesting issue... I was wondering if relying on dynamic fields could be an option... Something like : * : * customer : string * *_field_a1 : type_a * *_field_a2 : type_a * *_field_b1 : type_b * ... And the prefix each field by the customer name, so for customer1, indexed documents

Re: how to boosting fisrt substring in a string solr.

2012-09-13 Thread Tanguy Moal
Hi, Did you try issuing a query like : "+Yoga Teacher" (without the double-quotes) ? See http://lucene.apache.org/core/3_6_1/queryparsersyntax.html#Boolean operators for more details one lucene's query parser's syntax. Hope this helps, -- Tanguy 2012/9/13 veena rani > Hi , > > In solr, If i m

Re: terms component search

2012-09-06 Thread Tanguy Moal
Hi Peter, Yes if you want to do complex things in suggest mode, you'd better rely on the SearchComponent... For example, this blog post is a good read http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/ , if you have complex requirements on the searched fields. (Although y

Re: Document Processing

2012-09-06 Thread Tanguy Moal
If your interest is focusing on the real textual content of a web page, you could try this : JReadability (https://github.com/ifesdjeen/jReadability , Apache 2.0 license), which wraps JSoup (as Lance suggested) and applies a set of predefined rules to scrap crap (nav, headers, footers, ...) off of

Re: Faceting Facets

2012-09-03 Thread Tanguy Moal
I think it's not possible to combine pivots with facet queries, nor with facet ranges (or facet dates), please someone correct me if I'm wrong... I think only "standard" fields are "pivotable" :) That said, if you always use the same ranges for your DateTime field, you *could* have a "string" ver

Re: Hierarchical faceting and filter query exclusions

2012-08-30 Thread Tanguy Moal
You are correct, it doesn't work : Queries like : http://localhost:8983/solr/collection1/select?q=*:*&facet=on&facet.pivot={!ex=a_tag}field1,field2&facet.limit=5&rows=0&fq={!tag=a_tag}field3:"filter"; result in the following response : 400 1 on *:* 5 {!ex=a_tag}field1,field

Re: Querying top n of each facet value

2012-08-23 Thread Tanguy Moal
Hello Kiran, I think you can try turning grouping on and group "on", and ask solr to group on the "Category" field. Nevertheless, this will *not* ensure you that groups are returned in facet counts order. This will *not* ensure you the mincount per group neither. Hope this helps, -- Tanguy 201

Re: termFrequncy off and still use fastvector highlighter?

2012-08-07 Thread Tanguy Moal
code into a jar and make that jar accessible to solr, see http://wiki.apache.org/solr/SolrPlugins for how to plug your custom code into Solr. The main drawback of that approach is that it will be activated for all queries and all fields... -- Tanguy 2012/8/7 Tanguy Moal > May be it wasn

Re: termFrequncy off and still use fastvector highlighter?

2012-08-07 Thread Tanguy Moal
May be it wasn't clear in my response, sorry! You can use a different field for searching (qf parameter for dismax) than the one for highlighting (hl.fl) : q="a phrase query"&qf="text_without_termFreqs"&hl=on&hl.fl="text_with_termFreqs". Scoring will be based on fq's fields only (i.e. those withou

Re: Stemming questions

2012-08-07 Thread Tanguy Moal
Dear Alexander, A few questions on stemming support in Solr 3.6.1: > - Can you do non-English stemming? > With solr, many languages are supported, see http://wiki.apache.org/solr/LanguageAnalysis - We're using solr.PorterStemFilterFactory on the "text_en" field type. We > will index a ton of PD

Re: termFrequncy off and still use fastvector highlighter?

2012-08-02 Thread Tanguy Moal
If think you could use a field without the term frequencies for searching, that will solve your relevancy issues. You can then have the exact same content in an other field (using a copyField directive in your schema), having terms frequencies and positions turned on, and use this particuliar for h

Re: SOLR 3.4 GeoSpatial Query Returning distance

2012-08-02 Thread Tanguy Moal
Hi, I've not tested it by myself but I think that can take advantage of Solr 4's pseudo fields, by adding something like : &fl=*,geodist(),score I think you could even pass several geodist() calls with different parameters if you want to have the distance wrt several POIs ^-^ SOLR 4 only. -- Ta

Re: Building a heat map from geo data in index

2012-06-11 Thread Tanguy Moal
gt; problem. > > On Mon, Jun 11, 2012 at 10:55 AM, Tanguy Moal > wrote: > > There is definitely something interesting to do around geohashes. > > > > I'm wondering how one could map the N by N tiles requested tiles to a > range > > of geohashes. (Where the ga

Re: Building a heat map from geo data in index

2012-06-11 Thread Tanguy Moal
There is definitely something interesting to do around geohashes. I'm wondering how one could map the N by N tiles requested tiles to a range of geohashes. (Where the gap would be a function of N). What I try to mean is that I don't know if a bijective function exist between tiles and geohash rang

Re: edismax and untokenized field

2012-06-11 Thread Tanguy Moal
Hello, I think you have to issue a phrase query in such a case because otherwise each "token" is searched independently in the merchant field : the query parser splits the query on spaces! Check the difference between debug outputs when you search for "Jones New York", you'd get what you expected.

Re: solr tokenizer not splitting unbreakable expressions

2012-05-22 Thread Tanguy Moal
Hello Elisabeth, Wouldn't it be more simple to have a custom component inside of the front-end to your search server that would transform a query like <> into <<"hotel de ville" paris>> (I.e. turning each occurence of the sequence "hotel de ville" into a phrase query ) ? Concerning protections in

Re: Strategy for maintaining De-normalized indexes

2012-05-22 Thread Tanguy Moal
It all depends on the frequency at which you refresh your data, on your deployment (master/slave setup), ... Many things need to be taken into account! Did you face any performance issue while building your index? If you didn't, rebuilding it shouldn't be more problematic. -- Tanguy 2012/5/22 So

Re: Strategy for maintaining De-normalized indexes

2012-05-22 Thread Tanguy Moal
Hello, Can't the ID (uniqueKey) of the indexed documents (i.e. denormalized data) be a combination of the master product id and the child product id ? Therefor whenever you update your master product db entry, you simply need to reindex documents depending on the master product entry. You can ev

Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

2012-05-16 Thread Tanguy Moal
d > FrenchLightStemmer is the only one of them that does this arbitrary > duplicate sequence compression. (FinnishLightStemmer does repetition > compression too, but restricts the operation to chars 'k', 'p', and 't'.) > > Thanks, > Steve > >

Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

2012-05-16 Thread Tanguy Moal
Any idea someone ? I think this is important since this could produce weird results on collections with numbers mixed in text. >From my understanding, there are a few options to address the issue : 1) Make *LightStemmer token type aware and don't try to stem on things that are not text (alpha/alp

Re: need help with getting exact matches to score higher

2012-05-15 Thread Tanguy Moal
Hello, >From the response you pasted here, it looks like the field "itemNoExactMatchStr" never matched. Can you try matching in that field only and ensure you have matches ? Given the ^30 boost, you should have high scores on this field... Hope this helps, -- Tanguy 2012/5/15 geeky2 > Hello a

FrenchLightStemFilterFactory : normalizing tokens made of a single character repeated more than 5 times

2012-05-09 Thread Tanguy Moal
Dear list, I recently figured out that the FrenchLightStemFilterFactory performs some interestingly undocumented normalization on tokens... There's a norm() helper called for each produced token that performs, amongst other things, deletions on repeated characters... Only for tokens with mor

Re: Disseminate results from different sources

2012-03-21 Thread Tanguy Moal
Hello Franck, I've had the same issue in the past. I addressed that by adding a random value to each document. I use this value in the "bf" parameter, so that the random value alters more or less the documents' score. This results in a natural shuffling of documents which had the same score

Re: utf8 encoding for solr not working

2012-03-16 Thread Tanguy Moal
I think you're using PHP to request solr. You can ask solr to respond in several different formats (xml, json, php, ...), see http://wiki.apache.org/solr/QueryResponseWriter . Depending on how you connect to solr from php, you may want to use html_entity_decode before using mb_substr. -- Ta

Re: Query results

2012-03-16 Thread Tanguy Moal
That's because of the space. If you want to include the space in the search query (performing exact match), then use double quotes around your search terms : q=multiplex_name:"Agent Vinod" Online documentation : * http://wiki.apache.org/solr/SolrQuerySyntax * http://lucene.apache.org/core/ol

Re: exact match with id field (represented as url) in solr 3.5

2012-03-16 Thread Tanguy Moal
Hello Roberto, Exact match needs extra " (double-quotes) surrounding the exact thing you want to query in the id field. Give a try to a query like this : id:"http://127.0.0.1:/my/personal/testuser/Personal Documents/cal9.pdf" See this wiki page :

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

2012-03-13 Thread Tanguy Moal
Hi all, I think that depending on the language detector implemention, things may vary... For Tika, it performs better with longer inputs than shorter ones (as it seems to depend on the probabilistic distribution of ngrams -- of different sizes -- to perform distance computations with precomput

Re: docBoost with "fq" search

2012-03-09 Thread Tanguy Moal
Hi Gian Marco, I don't know if it's possible to exploit documents' boost values from function queries (see http://wiki.apache.org/solr/FunctionQuery), but if you store your boost in a search-able numeric field, you could either : do q=*:* AND _val_:"your_boost_field" if you're using default

Re: indexing cpu utilization

2012-03-08 Thread Tanguy Moal
How are you sending documents to solr ? If you push solr input documents via HTTP (which is what SolrJ does), you could increase CPU consumption (and therefor reduce indexing time) by sending your update requests asynchronously, using multiple updating threads, to your single solr core. Some

SpatialSearch, geofilt and documents missing a value in sfield

2012-01-11 Thread Tanguy Moal
Dear ML, I'm performing some developments relying on spatial capabilities of solr. I'm using Solr 3.5, have been reading http://wiki.apache.org/solr/SpatialSearch#Spatial_Query_Parameters and have the basic behaviours I wanted working. I use geofilt on a latlong field, with geodist() in the b

Re: Optional filter queries

2012-01-04 Thread Tanguy Moal
get the results-set you were expected. You might then want to sort on that field, and this time my previous answer could help ;-). Sorry for confusing you! Le 04/01/2012 09:32, Tanguy Moal a écrit : Hello, If the number stored is not in a string field, you will need solr >= 3.5 to perf

Re: Optional filter queries

2012-01-04 Thread Tanguy Moal
Hello, If the number stored is not in a string field, you will need solr >= 3.5 to perform what you want. Since solr 3.5 it's possible to set the attribute sortMissingLast or sortMissingFirst to true, within the field definition (an example is available in the schema.xml provided with solr 3

Re: solr keep old docs

2011-12-28 Thread Tanguy Moal
Hello Alexander, I don't know much about your requirements in terms of size and performances, but I've had a similar use case and found a pretty simple workaround. If your duplicate rate is not too high, you can have the SignatureProcessor to generate fingerprint of documents (you already did

Was:Re: hl.boundaryScanner and hl.bs.chars [off topic]

2011-12-28 Thread Tanguy Moal
Dear list, I'd like to bounce on that issue... IMHO, configuration parsing could be a little bit stricter... At least, what stands for a "severe" configuration error could be user-defined. Let me give some examples that are common errors and that don't trigger the "abortOnConfigurationError"

Re: Solr 3.5 | Highlighting

2011-12-22 Thread Tanguy Moal
Le 21/12/2011 23:49, Koji Sekiguchi a écrit : (11/12/21 22:28), Tanguy Moal wrote: Dear all, [...] I tried using both legacy highlighter and FVH but the same issue occurs. The issue only triggers when relying on hl.q. Thank you very much for any help, -- Tanguy Tanguy, Thank you for

Re: Solr - Mutivalue field search on different elements

2011-12-21 Thread Tanguy Moal
Hello, I think that the positionIncrementGap attribute of your field has to changed to 0 (instead of 100 by default). (See http://lucene.472066.n3.nabble.com/positionIncrementGap-in-schema-xml-td488338.html ) Hope this helps, -- Tanguy Le 21/12/2011 15:39, meghana a écrit : Hi all, i

Solr 3.5 | Highlighting

2011-12-21 Thread Tanguy Moal
Dear all, I'm try to get highlighting working, and I'm almost done, but that's not perfect yet... Basically my documents have a title and a description. I have two kind of text fields : text : generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange

Re: Fwd: Reload core

2011-12-08 Thread Tanguy Moal
Hello, Usually, when such an error occur, there are some good hints of what's wrong with your new configuration in solr logs. Depending on how you setup your solr instance and configured logging for solr (http://wiki.apache.org/solr/SolrLogging), log files may be located at different places.

Re: &(fq=field1:val1 AND field2:val2) VS &fq=field1:val1&fq=field2:val2 and filterCache

2011-12-01 Thread Tanguy Moal
Hello, Quoting http://wiki.apache.org/solr/SolrCaching#filterCache : The filter cache stores the results of any filter queries ("fq" parameters) that Solr is explicitly asked to execute. (Each filter is executed and cached separately. When it's time to use them to limit the number of results re

Core reload vs servlet container restart

2011-11-10 Thread Tanguy Moal
Dear list, I've experienced a weird (unexpected?) behaviour concerning core reload on a master instance. My setup : master/slave on separate hosts. On the master, I update the schema.xml file, adding a dynamic field of type random sort field. I reload the master using core admin. The new f

Re: best way for sum of fields

2011-11-07 Thread Tanguy Moal
Hi again, Since you have a custom high availability solution over your solr instances, I can't help much I guess... :-) I usually rely on master/slave replication to separate index build and index search processes. The fact is that resources consumption at build time and search time are not

Re: best way for sum of fields

2011-11-07 Thread Tanguy Moal
Hi, If you only need to sum over "displayed" results, go with the post-processing of hits solution, that's fast and easy. If you sum over the whole data set (i.e your sum is not query dependant), have it computed at indexing time, depending on your indexing workflow. Otherwise, (sum over the

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-06-01 Thread Tanguy Moal
overwritedupes to false and set the signiture key to be the id. That way solr will manage updates? from the wiki http://wiki.apache.org/solr/Deduplication HTH Lee On 30 May 2011 08:32, Tanguy Moal wrote: Hello, Sorry for re-posting this but it seems my message got lost in the mailin

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-30 Thread Tanguy Moal
yone a few hints on how to optimize the handling of index time deduplication ? More details on my setup and the state of my understanding are in my previous message here-after. Thank you very much in advance. Regards, Tanguy On 05/25/11 15:35, Tanguy Moal wrote: Dear list, I'm posting

Re: how can i index data in different documents

2011-05-26 Thread Tanguy Moal
Hi Romi, A simple way to do so is to define in your schema.xml the union of all the columns you need plus a "type" field to distinguish your entities. eg, In your DB table1 : - col1 : varchar - col2 : int - col3 : float table2 : - col1 : int - col2 : varchar - col3 : int - col4 : varchar in

Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-25 Thread Tanguy Moal
Dear list, I'm posting here after some unsuccessful investigations. In my setup I push documents to Solr using the StreamingUpdateSolrServer. I'm sending a comfortable initial amount of documents (~250M) and wished to perform overwriting of duplicated documents at index time, during the update

Re: Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread Tanguy Moal
Hello, Have you tried reading : http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function From that page I would try something like : http://host:port/solr/select?q=sony&sort=min(min(priceCash,priceCreditCard),priceCoupon)+asc&rows=10&indent=on&debugQuery=on Is that of any help ? -- Tanguy

Re: Are there any restrictions on what kind of how many fields you can use in Pivot Query? I get ClassCastException when I use some of my string fields, and don't when I use some other sting fields

2011-02-16 Thread Tanguy Moal
Hello Ravish, Erick, I'm facing the same issue with solr-trunk (as of r1071282) - Field configuration : positionIncrementGap="100"> - Schema configuration : In my test index, I have documents with sparse values : Some documents may or may not have a value for f1, f2 and/or f3 The

Re: Tuning StatsComponent

2011-01-10 Thread Tanguy Moal
Hello, You could try taking advantage of Solr's facetization feature : provided that you have the amount stored in the amount field and the currency stored in the currency field, try the following request : http://host:port /solr/select?q=YOUR_QUERY&stats=on&stats.field=amount&f.amount.stats.facet

Re: PHPSolrClient

2010-12-16 Thread Tanguy Moal
Hi Dennis, Not particular to the client you use (solr-php-client) for sending documents, think of update as an overwrite. This means that if you update a particular document, the previous version indexed is lost. Therefore, when updating a document, make sure that all the fields to be indexed and

Re: Google like search

2010-12-14 Thread Tanguy Moal
To do so, you have several possibilities, I don't know if there is a best one. It depends pretty much on the format of the input file(s), your affinities with a given programing language,some libraries you might need and the time you're ready to spend on this task. Consider having a look at SolrJ

Re: Google like search

2010-12-14 Thread Tanguy Moal
Satya, In fact the highlighter will select the relevant part of the whole text and return it with the matched terms highlighted. If you do so for a whole book, you will face the issue spotted by Dave (too long text). To address that issue, you have the possibility to split your book in chapters,

Re: Google like search

2010-12-14 Thread Tanguy Moal
Hi Satya, I think what you'e looking for is called "highlighting" in the sense of "highlighting" the query terms in their matching context. You could start by googling "solr highlight", surely the first results will make sense. Solr's wiki results are usually a good entry point : http://wiki.apa

Re: Autosuggest terms which GOOGLE uses?

2010-12-08 Thread Tanguy Moal
Kind of : their suggestions are based on users queries with some filtering. You can have a little read there : http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=106230 They perform "little" filtering to remove offending content such as "hate speech, violence and pornography" (quot

Re: [Wildcard query] Weird behaviour

2010-12-03 Thread Tanguy Moal
this would be to see if you can index > your content in a way to avoid these expensive queries. But this is > just a suggestion, what you are doing should still work fine. > > On Fri, Dec 3, 2010 at 6:56 AM, Robert Muir wrote: >> On Fri, Dec 3, 2010 at 6:28 AM, Tanguy Mo

Re: "Virtual field", Statistics

2010-10-18 Thread Tanguy Moal
rskog : > Please add a JIRA issue requesting this. A bunch of things are not > supported for functions: returning as a field value, for example. > > On Thu, Oct 14, 2010 at 8:31 AM, Tanguy Moal wrote: >> Dear solr-user folks, >> >> I would like to use the stats modu

"Virtual field", Statistics

2010-10-14 Thread Tanguy Moal
Dear solr-user folks, I would like to use the stats module to perform very basic statistics (mean, min and max) which is actually working just fine. Nethertheless I found a little limitation that bothers me a tiny bit : how to perform the exact same statistics, but on the result of a function que