from:"Luis"

Re: Extracting excerpt from solr

2013-03-20 Thread Luis

Nevermind, I found a solution.  I created an excerpt field in the schema.xml,
then I used the copyField method with the maxChars parameter declared to
copy the content into it with a limitation of the amount of characters that
I wanted. 

Thanks anyways.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Extracting-excerpt-from-solr-tp4049067p4049358.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr indexing binary files

2013-03-14 Thread Luis

Hi, I am new with Solr and I am extracting metadata from binary files through
URLs stored in my database.  I would like to know what fields are available
for indexing from PDFs (the ones that would be initiated as in column=””). 
For example how would I extract something like file size, format or file
type.  

I would also like to know how to create customized fields in Solr.  How
those metadata and text content are mapped into Solr schema?  Would I have
to declare that in the solrconfig.xml or do some more tweaking somewhere
else?  If someone has a code snippet that could show me it would be greatly
appreciated.

Thank you in advance.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

2013-03-15 Thread Luis

Hi Jack, thanks a lot for your reply.  I did that .  However, when I run Solr it gives me a
bunch of errors.  It actually displays the content of my files on my command
line and shows some logs like this:

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:468)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:350)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:234)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)
15-Mar-2013 9:56:29 AM org.apache.solr.handler.dataimport.DocBuilder execute

I do have an uniqueKey though.  Any ideas what the problem might be?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047690.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

2013-03-15 Thread Luis

Hi Gora, thank you for your reply.  I am not using any commands, I just go on
the Solr dashboard, db > Dataimport and execute a full-import.

*My schema.xml looks like this:*

 
   
   
   
   
   
   
   
   
   
   
   


   
   
  
   
   
   
   
   
   
   


   
   
   
   
   
   
   
 
   
   
   
   
  
   



   
   
   
   
   
   
   

   


  * *

*My db-data-config.xml looks like this:*








  







   




*In my solrconfig.xml I have this:*



db-data-config.xml

  
  
   

  true
  metadata_
last_modified
text
size
initials
name
subject
company
title
comments
words
last_modified_by
 true

  

Thank you for your help!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047702.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

2013-03-15 Thread Luis

Sorry, Gora.  It is ${fileSourcePaths.urlpath} actually.

*My complete schema.xml is this:*







  

  
  
   






































  

  




  








  
  








  





  








  




  








  


 
 

 


 
   


   
   
   
   
   
   
   
   

   
   
   
   
   


  
   
   
  
   
   
   
   
   
   
   


   
   
   
   
   
   
   
   
   

   
   
   
   
   
   
  
   
   
   

   
   
   

   

   
   
   
   
   
   
   
   
   
   
   
   
   
   

   
   

   
   
   

   
   

   
   
   

   
   
   
   
   
   
   
   
   

   


   
   
 

 
 id

 
 text

 
 

  
   

   
   
   
   
   
   
   

   

 
 






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047778.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

2013-03-18 Thread Luis

Hi Gora,

Yes, my urlpath points to an url like that.  I do not get why uncommenting
the catch all dynamic field ("*") does not work for me.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4048542.html
Sent from the Solr - User mailing list archive at Nabble.com.

Extracting excerpt from solr

2013-03-19 Thread Luis

Hi,
I am using solr to index data from binary files using BinURLDataSource.  I
was wondering if anyone knows how to extract an excerpt of the indexed data
during search.  For example if someone made a search it would return 200
characters as a preview of the whole text content.  I read online that hl
would do the trick.  I tried it even though I am not as interested in
highlighting as I am in pulling the excerpt.  However, so far I have not
been able to make it work.

I added a /browse requestHandler to my solrconfig.xml like this:



  explicit
   true
   [HIGHLIGHT] 
[/HIGHLIGHT] 
text title
true
colored
3
70
true
   0.5

[-\w ,/\n\"']{20,200}

  

I tried it in other requestHandlers as well without any success.
Does anyone have some hints?  Thanks In advance.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Extracting-excerpt-from-solr-tp4049067.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr incorrectly fetching elements that do not match conditions in where as all-null rows

2016-08-15 Thread Luis Sepúlveda

Hello,

Solr is trying to process non-existing child/nested entities. By
non-existing I mean that they exist in DB but should not be at Solr side
because they don't match the conditions in the query I use to fetch them.

I have the below solr data configuration. The relationship between tables
is complicated, but the point is that I need to fetch child/nested entities
and perform some calculations at query time. My problem is that some
products have onSite services that are not enabled. I would expect Solr
from ignoring those elements because of the conditions in the query. If I
turn debug on when importing, I can see that all fields are null. However,
Solr still tries to process them, which results in invalid SQL queries
because it replaces null fields with nothing.



















The problem seems to be related to the condition s.enabled=true in the
query, because are rows with enabled=false that are causing problems (Solr
interprets them as rows with all fields null). I get an invalid SQL query
SELECT CONCAT( * (1 - percentage), ',', 'USD') AS fullReducedOnSitePrice
FROM discounts WHERE companyId=65.

How can I force Solr to ignore, as it should, those elements?

Re: Solr incorrectly fetching elements that do not match conditions in where as all-null rows

2016-08-15 Thread Luis Sepúlveda

Thanks for the promp reply.

h.enabled=true is a typo. It should be c.enabled=true, because the table
companies also has a column called enabled. That part is working fine (it
doesn't fetch companies with enabled=false).

About the DB queries, I've taken, by turning Debug and Verbose on in the
Dataimport tab, the queries that Solr is sending to DB, executed the same
queries in my MySQL client. It clearly says '0 row(s) returned'.

2016-08-15 15:37 GMT+02:00 Alexandre Rafalovitch :

> Solr (well DIH) just passes that query to the DB, so if you are
> getting extra rows (not extra fields), than I would focus on the
> database side of the situation.
>
> Specifically, I would confirm from the database logs what the sent
> query actually looks like.
>
> Very specifically, in your very first entity, I see the condition
> "h.enabled=true" where "h" does not match the table names in the FROM
> statement. Perhaps, that's the problem?
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 15 August 2016 at 23:27, Luis Sepúlveda  wrote:
> > Hello,
> >
> > Solr is trying to process non-existing child/nested entities. By
> > non-existing I mean that they exist in DB but should not be at Solr side
> > because they don't match the conditions in the query I use to fetch them.
> >
> > I have the below solr data configuration. The relationship between tables
> > is complicated, but the point is that I need to fetch child/nested
> entities
> > and perform some calculations at query time. My problem is that some
> > products have onSite services that are not enabled. I would expect Solr
> > from ignoring those elements because of the conditions in the query. If I
> > turn debug on when importing, I can see that all fields are null.
> However,
> > Solr still tries to process them, which results in invalid SQL queries
> > because it replaces null fields with nothing.
> >
> > 
> >
> >  > query="SELECT s.serviceType, sl.language FROM services s
> > LEFT JOIN serviceLanguages sl ON s.id=sl.serviceId WHERE
> > companyId=${product.companyId} AND s.enabled=true">
> > 
> > 
> > 
> >
> >  > query="SELECT s.id, s.enabled, ${product.unitPrice} +
> > (hourlyPrice * MIN(hours)) AS onSitePriceRaw,
> CONCAT(${product.unitPrice} +
> > (hourlyPrice * MIN(hours)), ',', '${product.currency}') AS onSitePrice
> FROM
> > services s LEFT JOIN serviceHourlyPrices shp ON s.id=shp.serviceId WHERE
> > companyId=${product.companyId} AND s.enabled=true AND
> s.serviceType='OS'">
> > 
> >  > query="SELECT CONCAT(${onSite.onSitePriceRaw} * (1 -
> > percentage), ',', '${product.currency}') AS fullReducedOnSitePrice FROM
> > discounts WHERE companyId=${product.companyId} AND category='FULL'">
> >  > column="fullReducedOnSitePrice"/>
> > 
> >  > query="SELECT CONCAT(${onSite.onSitePriceRaw} * (1 -
> > percentage), ',', '${product.currency}') AS partialReducedOnSitePrice
> FROM
> > discounts WHERE companyId=${product.companyId} AND category='PARTIAL'">
> >  > column="partialReducedOnSitePrice"/>
> > 
> > 
> > 
> >
> > The problem seems to be related to the condition s.enabled=true in the
> > query, because are rows with enabled=false that are causing problems
> (Solr
> > interprets them as rows with all fields null). I get an invalid SQL query
> > SELECT CONCAT( * (1 - percentage), ',', 'USD') AS fullReducedOnSitePrice
> > FROM discounts WHERE companyId=65.
> >
> > How can I force Solr to ignore, as it should, those elements?
>

Re: Solr incorrectly fetching elements that do not match conditions in where as all-null rows

2016-08-15 Thread Luis Sepúlveda

I'm very sorry, but you're right. Using one of the queries from the query
log, I get a 1 row(s) returned. So it itsn't a Solr issue.

Thanks a lot Alexandre.

2016-08-15 16:17 GMT+02:00 Alexandre Rafalovitch :

> Hmm. I would still take as truth the database logs as opposed to Solr
> logs. Or at least network traces using something like Wireshark.
>
> Otherwise, you need some way to reduce your DIH query to the minimum
> reproducible example. I am used to reading tech support emails and
> even then I am not sure I can parse the significant configuration
> aspects from the multiple parallel and nested entities. Can you reduce
> this to the simplest (two level?) entity definition with a single
> field and explain what you expected and what you are seeing.
>
> Regards,
>Alex.
> P.s. Solr DIH does have a gotcha with SQL import that it automagically
> tries to match table column names to fields defined in schema and
> populate them even if not explicitly declared. This does not match to
> the way you describe the problem (your select statement still needs to
> return those fields), but perhaps it interacts with something else to
> trigger it.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 15 August 2016 at 23:54, Luis Sepúlveda  wrote:
> > Thanks for the promp reply.
> >
> > h.enabled=true is a typo. It should be c.enabled=true, because the table
> > companies also has a column called enabled. That part is working fine (it
> > doesn't fetch companies with enabled=false).
> >
> > About the DB queries, I've taken, by turning Debug and Verbose on in the
> > Dataimport tab, the queries that Solr is sending to DB, executed the same
> > queries in my MySQL client. It clearly says '0 row(s) returned'.
> >
> > 2016-08-15 15:37 GMT+02:00 Alexandre Rafalovitch :
> >
> >> Solr (well DIH) just passes that query to the DB, so if you are
> >> getting extra rows (not extra fields), than I would focus on the
> >> database side of the situation.
> >>
> >> Specifically, I would confirm from the database logs what the sent
> >> query actually looks like.
> >>
> >> Very specifically, in your very first entity, I see the condition
> >> "h.enabled=true" where "h" does not match the table names in the FROM
> >> statement. Perhaps, that's the problem?
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Newsletter and resources for Solr beginners and intermediates:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 15 August 2016 at 23:27, Luis Sepúlveda  wrote:
> >> > Hello,
> >> >
> >> > Solr is trying to process non-existing child/nested entities. By
> >> > non-existing I mean that they exist in DB but should not be at Solr
> side
> >> > because they don't match the conditions in the query I use to fetch
> them.
> >> >
> >> > I have the below solr data configuration. The relationship between
> tables
> >> > is complicated, but the point is that I need to fetch child/nested
> >> entities
> >> > and perform some calculations at query time. My problem is that some
> >> > products have onSite services that are not enabled. I would expect
> Solr
> >> > from ignoring those elements because of the conditions in the query.
> If I
> >> > turn debug on when importing, I can see that all fields are null.
> >> However,
> >> > Solr still tries to process them, which results in invalid SQL queries
> >> > because it replaces null fields with nothing.
> >> >
> >> > 
> >> >
> >> >  >> > query="SELECT s.serviceType, sl.language FROM
> services s
> >> > LEFT JOIN serviceLanguages sl ON s.id=sl.serviceId WHERE
> >> > companyId=${product.companyId} AND s.enabled=true">
> >> > 
> >> > 
> >> > 
> >> >
> >> >  >> > query="SELECT s.id, s.enabled, ${product.unitPrice} +
> >> > (hourlyPrice * MIN(hours)) AS onSitePriceRaw,
> >> CONCAT(${product.unitPrice} +
> >> > (hourlyPrice * MIN(hours)), ',', '${product.currency}') AS onSitePrice
> >> FROM
> >> > services s LEFT JOIN serviceHourlyPrices shp ON s.id=shp.serviceId
> WHERE
> >> > companyId=${product.companyId} AND s.enabled=true AND
>

Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-03-31 Thread Luis Lebolo

Hi Salman,

I was interested in something similar, take a look at the following thread:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCADSoL-i04aYrsOo2%3DGcaFqsQ3mViF%2Bhn24ArDtT%3D7kpALtVHzA%40mail.gmail.com%3E#archives

I never followed through, however.

-Luis


On Mon, Mar 31, 2014 at 6:24 AM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Anyone?
>
>
> On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > With reference to this thread<
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E>I
> wanted to know if there was any response to that or if Chris Harris
> > himself can comment on what he ended up doing, that would be great!
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
> >
>
>
> --
> Regards,
>
> Salman Akram
>

Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-04-01 Thread Luis Lebolo

I got responses, but no easy solution to allow me to directly cancel a
request. The responses did point to:

   - timeAllowed query parameter that returns partial results -
   
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter
   - A possible hack that I never followed through -
   
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCANGii8eaSouePGxa7JfvOBhrnJUL++Ct4rQha2pxMefvaWhH=g...@mail.gmail.com%3E

Maybe one of those will help you? If they do, make sure to report back!

-Luis


On Tue, Apr 1, 2014 at 3:13 AM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> So you too never got any response...
>
>
> On Mon, Mar 31, 2014 at 6:57 PM, Luis Lebolo 
> wrote:
>
> > Hi Salman,
> >
> > I was interested in something similar, take a look at the following
> thread:
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCADSoL-i04aYrsOo2%3DGcaFqsQ3mViF%2Bhn24ArDtT%3D7kpALtVHzA%40mail.gmail.com%3E#archives
> >
> > I never followed through, however.
> >
> > -Luis
> >
> >
> > On Mon, Mar 31, 2014 at 6:24 AM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > Anyone?
> > >
> > >
> > > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram <
> > > salman.ak...@northbaysolutions.net> wrote:
> > >
> > > > With reference to this thread<
> > >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
> > >I
> > > wanted to know if there was any response to that or if Chris Harris
> > > > himself can comment on what he ended up doing, that would be great!
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Salman Akram
> > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > >
> >
>
>
>
> --
> Regards,
>
> Salman Akram
>

Re: Replication: slow first query after replication.

2013-11-05 Thread Luis Cappa


Hello, Shawn!

I have seen that when disabling replication and executing queries the time 
responses are good. Interesting... I can't ser the solution, then, because slow 
replication tomes are needed to almost always get 'fresh' documents in slaves 
to search by, but this appareantly slows down first queries launched because of 
caches warm up. There must be a solution for this scenario - I think that it 
should be very common. Do you think that disabling caches will improve this?

Thanks a lot!


- Luis Cappa

> El 05/11/2013, a las 23:29, Shawn Heisey  escribió:
> 
>> On 11/5/2013 10:16 AM, Luis Cappa Banda wrote:
>> I have a master-slave replication (Solr 4.1 version) with a 30 seconds
>> polling interval and continuously new documents are indexed, so after 30
>> seconds always new data must be replicated. My test index is not huge: just
>> 5M documents.
>> 
>> I have experimented that a simple "q=*:*" query appears to be very slow (up
>> to 10 secs of QTime). After that first slow query the following "q=*:*"
>> queries are much quicker. I feel that warming up caches after replication
>> has something to say about this weird behavior, but maybe an index re-built
>> is also involved.
>> 
>> Question time:
>> 
>> *1.* How can I warm up caches against? There exists any solrconfig.xml
>> searcher to configure to be executed after replication events?
>> 
>> *2. *My system needs to execute queries to the slaves continuously. If
>> there exists any warm up way to reload caches, some queries will experience
>> slow response times until reload has finished, isn't it?
>> 
>> *3. *After a replication has done, does Solr execute any index rebuild
>> operation that slow down query responses, or this poor performance is just
>> due to caches?
>> 
>> *4. *My system is always querying by the latest documents indexed (I'm
>> filtering by document dates), and I don't use "fq" to execute that queries.
>> In this scenario, do you recommend to disable caches?
> 
> I suspect that you may be running into a situation where you don't have 
> enough OS disk cache for your index.  When you replicate, the new data that 
> has just been replicated pushes existing data out of the cache.  You run your 
> query that is slow, and the *Solr* caches (not the same thing as the OS disk 
> cache) get populated, making later queries fast.  You should be able to 
> configure autowarming on your Solr caches to help with this, but be aware 
> that autowarming can be time-consuming, and if you have replications 
> happening potentially every 30 seconds, you may find that your autowarming is 
> taking more time than that.  This can lead to other problems.
> 
> If the amount of disk space taken up by those 5 million documents is 
> significantly larger than the amount of memory available on the server that 
> is not allocated directly to programs like Solr itself, then the only true 
> solution will be to add memory to the server.
> 
> Thanks,
> Shawn
>

Re: Facet field query on subset of documents

2013-11-21 Thread Luis Lebolo

Hi Erick,

Thanks for the reply and sorry, my fault, wasn't clear enough. I was
wondering if there was a way to remove terms that would always be zero
(because the term came from a document that didn't match the filter query).

Here's an example. I have a bunch of documents with fields 'manufacturer'
and 'location'. If I set my filter query to "manufacturer = Sony" and all
Sony documents had a value of 'Florida' for location, then I want 'Florida'
NOT to show up in my facet field results. Instead, it shows up with a count
of zero (and it'll always be zero because of my filter query).

Using mincount = 1 doesn't solve my problem because I don't want it to hide
zeroes that came from documents that actually pass my filter query.

Does that make more sense?

On Thu, Nov 21, 2013 at 4:36 PM, Erick Erickson wrote:

> That's what faceting does. The facets are only tabulated
> for documents that satisfy they query, including all of
> the filter queries and anh other criteria.
>
> Otherwise, facet counts would be the same no matter
> what the query was.
>
> Or I'm completely misunderstanding your question...
>
> Best,
> Erick
>
>
> On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo 
> wrote:
>
> > Hi All,
> >
> > Is it possible to perform a facet field query on a subset of documents
> (the
> > subset being defined via a filter query for instance)?
> >
> > I understand that facet pivoting might work, but it would require that
> the
> > subset be defined by some field hierarchy, e.g. manufacturer -> price
> (then
> > only look at the results for the manufacturer I'm interested in).
> >
> > What if I wanted to define a more complex subset (where the name starts
> > with A but ends with Z and some other field is greater than 5 and yet
> > another field is not 'x', etc.)?
> >
> > Ideally I would then define a "facet field constraining query" to include
> > only terms from documents that pass this query.
> >
> > Thanks,
> > Luis
> >
>

Facet field query on subset of documents

2013-11-21 Thread Luis Lebolo

Hi All,

Is it possible to perform a facet field query on a subset of documents (the
subset being defined via a filter query for instance)?

I understand that facet pivoting might work, but it would require that the
subset be defined by some field hierarchy, e.g. manufacturer -> price (then
only look at the results for the manufacturer I'm interested in).

What if I wanted to define a more complex subset (where the name starts
with A but ends with Z and some other field is greater than 5 and yet
another field is not 'x', etc.)?

Ideally I would then define a "facet field constraining query" to include
only terms from documents that pass this query.

Thanks,
Luis

Cancel Solr query?

2014-01-13 Thread Luis Lebolo

Hi All,

Is it possible to cancel a Solr query/request currently in progress?

Suppose the user starts searching for something (that takes a long time for
Solr to process), then decides the modify the query. I can simply ignore
the previous request and create a new request, but Solr is still processing
the old request, correct?

Is there any way to cancel that first request?

Thanks,
Luis

Problem querying large StrField?

2014-02-05 Thread Luis Lebolo

Hi All,

It seems that I can't query on a StrField with a large value (say 70k
characters). I have a Solr document with a string type:



and field:

   

Note that it's stored, in case that matters.

Across my documents, the length of the value in this StrField can be up to
~70k characters or more.

The query I'm trying is 'someFieldName_1:*'. If someFieldName_1 has values
with length < ~10k characters, then it works fine and I retrieve various
documents with values in that field.

However, if I query 'someFieldName_2:*' and someFieldName_2 has values with
length ~60k, I don't get back any documents. Even though I *know* that many
documents have a value in someFieldName_2.

If I query *:* and add someFieldName_2 in the field list, I am able to see
the (large) value in someFieldName_2.

So is there some type of limit to the length of strings in StrField that I
can query against?

Thanks,
Luis

Re: Problem querying large StrField?

2014-02-05 Thread Luis Lebolo

Update: It seems I get the bad behavior (no documents returned) when the
length of a value in the StrField is greater than or equal to 32,767
(2^15). Is this some type of bit overflow somewhere?


On Wed, Feb 5, 2014 at 12:32 PM, Luis Lebolo  wrote:

> Hi All,
>
> It seems that I can't query on a StrField with a large value (say 70k
> characters). I have a Solr document with a string type:
>
> 
>
> and field:
>
> stored="true" />
>
> Note that it's stored, in case that matters.
>
> Across my documents, the length of the value in this StrField can be up to
> ~70k characters or more.
>
> The query I'm trying is 'someFieldName_1:*'. If someFieldName_1 has values
> with length < ~10k characters, then it works fine and I retrieve various
> documents with values in that field.
>
> However, if I query 'someFieldName_2:*' and someFieldName_2 has values
> with length ~60k, I don't get back any documents. Even though I *know* that
> many documents have a value in someFieldName_2.
>
> If I query *:* and add someFieldName_2 in the field list, I am able to see
> the (large) value in someFieldName_2.
>
> So is there some type of limit to the length of strings in StrField that I
> can query against?
>
> Thanks,
> Luis
>

Re: Problem querying large StrField?

2014-02-09 Thread Luis Lebolo

Hi Yonik,

Thanks for the response. Our use case is perhaps a little unusual. The
actual domain is in bioinformatics, but I'll try to generalize. We have two
types of entities, call them A's and B's. For a given pair of entities
(a_i, b_j) we may or may not have an associated data value z. Standard many
to many stuff in a DB. Users can select an arbitrary set of entities from
A. What we'd then like to ask of Solr is: Which entities of type B have a
data value for any of the A's I've selected.

The way we've approached this to date is to index the set of B, such that
each document has a multivalued field containing the id's of all entities A
that have a data value. If I select a set of A (a1, a2, a5, a9), then I
would query data availability across B as dataAvailabilityField:(a1 OR a2
OR a5 OR a9).

The sets of A and B are fairly large (~10 - 30k). This was working ok, but
our datasets have increased and now the giant OR is getting too slow.

As an alternative approach, we developed a ValueParser plugin that took
advantage of our ability to sort the list of entity id's and do some clever
things, like binary searches and short circuits on the results. For this to
work, we concatenated all the id's into a single comma delimited value. So
the data availability field is now single valued, but has a term that looks
like "a1,a3,a6,a7". Our function query then takes the list of A id's
that we're interested in and searches the documents for ones that match any
value. Worked great and quite fast when the id list was short enough. But
then we tried it on the full data set and the indexed terms of id's are
HUGE.

I know it's a bit of an odd use case, but have you seen anything like this
before? Do you have any thoughts on how we might better accomplish this
functionality?

Thanks!

On Wed, Feb 5, 2014 at 1:42 PM, Yonik Seeley  wrote:

> On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo  wrote:
> > Update: It seems I get the bad behavior (no documents returned) when the
> > length of a value in the StrField is greater than or equal to 32,767
> > (2^15). Is this some type of bit overflow somewhere?
>
> I believe that's the maximum size of an indexed token.
> Can you share your use-case?  Why are you trying to index such large
> values as a single token?
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr
>

Re: SOLR online reference document - WIKI

2013-06-27 Thread Luis Lebolo

This page never came up on any of my Google searches, so thanks for the
heads up! Looks good.

-Luis


On Tue, Jun 25, 2013 at 12:32 PM, Learner  wrote:

> I just came across a wonderful online reference wiki for SOLR and thought
> of
> sharing it with the community..
>
>
> https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SOLR-online-reference-document-WIKI-tp4073110.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

CachedSqlEntityProcessor not adding fields

2013-07-30 Thread Luis Lebolo

Hi All,

I'm trying to use CachedSqlEntityProcessor in one of my sub-entities, but
the field never gets populated. I'm using Solr 4.4. The field is a
multi-valued field:

The relevant part of my data-config.xml looks like:



 

 

 

 
 

 


 


...




Let me know if you need more info. Any ideas appreciated!

Thanks,
Luis

Re: CachedSqlEntityProcessor not adding fields

2013-07-30 Thread Luis Lebolo

I'm noticing some very odd behavior using dataimport from the Admin UI.

Whenever I limit the number of rows to 75 or below, the aliases field never
gets populated. As soon as I increase the limit to 76 or more, the aliases
field gets populated!

What am I not understanding here?

On Tue, Jul 30, 2013 at 11:04 AM, Luis Lebolo  wrote:

> Hi All,
>
> I'm trying to use CachedSqlEntityProcessor in one of my sub-entities, but
> the field never gets populated. I'm using Solr 4.4. The field is a
> multi-valued field:
>
> The relevant part of my data-config.xml looks like:
>
>
>
>  
> 
>  
> 
>  
> 
>  
>  
> 
>  
>
>  cacheKey="ALIAS_MODEL_ID" cacheLookup="model.MODEL_ID">
>  
> 
>
> ...
> 
>
>
>
> Let me know if you need more info. Any ideas appreciated!
>
> Thanks,
>  Luis
>

DataImportHandler rows parameter and performance

2013-07-30 Thread Luis Lebolo

Hi All,

I'm using the Admin UI dataimport page to load some documents into my
index. There's a rows parameter that you can leave blank (to load all
documents).

When I change it to the maximum number of documents, the performance drops
by a factor of 10.

For example, I have 1627 root entities. If I fill in row with 1627,
indexing occurs at about 10 docs per second. If I leave it blank, it occurs
at about 1 doc per second.

Thanks,
Luis

Query on all dynamic fields or wildcard field query

2013-03-27 Thread Luis Lebolo

Hi All,

First I have to apologize and admit that I'm asking this question before
doing any real research =( Was hoping for some preliminary help before I
start this endeavor tomorrow. So here goes:

Can I query for a value in multiple (wildcarded) fields?

For example, if I have dynamic fields fieldName_someToken (e.g.
fieldName_1, fieldName_2, fieldName_3), can I construct a query like
fieldName_*:someValue?

The query itself doesn't work, but is there a way to query numerous dynamic
fields without explicitly listing them?

Thanks,
Luis

SolrException parsing error

2013-04-15 Thread Luis Lebolo

Hi All,

I'm using Solr 4.1 and am receiving an org.apache.solr.common.SolrException
"parsing error" with root cause java.io.EOFException (see below for stack
trace). The query I'm performing is long/complex and I wonder if its size
is causing the issue?

I am querying via POST through SolrJ. The query (fq) itself is ~20,000
characters long in the form of:

fq=(mutation_prot_mt_1_1:2374 + OR + mutation_prot_mt_2_1:2374 + OR +
mutation_prot_mt_3_1:2374 + ...) + OR + (mutation_prot_mt_1_2:2374 + OR +
mutation_prot_mt_2_2:2374 + OR + mutation_prot_mt_3_2:2374+...) + OR + ...

In short, I am querying for an ID throughout multiple dynamically created
fields (mutation_prot_mt_#_#).

Any thoughts on how to further debug?

Thanks in advance,
Luis

--

SEVERE: Servlet.service() for servlet [X] in context with path [/x] threw
exception [Request processing failed; nested exception is
org.apache.solr.common.SolrException: parsing error] with root cause
java.io.EOFException
at
org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:193)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:107)
 at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:387)
 at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)
 at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at x.x.x.x.x.x.someMethod(x.java:111)
 at x.x.x.x.x.x.otherMethod(x.java:222)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
at
org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213)
 at
org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126)
at
org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96)
 at
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617)
at
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578)
 at
org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
at
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923)
 at
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
 at
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330)
 at x.x.x.x.x.yetAnotherMethod(x.java:333)
at
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
 at
org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:118)
at
org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:84)
 at
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
at
org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:113)
 at
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
at
org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:113)
 at
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
at
org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:146)
 at
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
at
org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:54)
 at
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterCha

Re: Query Parser OR AND and NOT

2013-04-15 Thread Luis Lebolo

What if you try

city:(*:* -H*) OR zip:30*

Sometimes Solr requires a list of documents to subtract from (think of "*:*
-someQuery" converts to "all documents without someQuery").

You can also try looking at your query with debugQuery = true.

-Luis


On Mon, Apr 15, 2013 at 12:25 PM, Peter SchÃ¼tt  wrote:

> Hallo,
>
>
> Roman Chyla  wrote in
> news:caen8dywjrl+e3b0hpc9ntlmjtrkasrqlvkzhkqxopmlhhfn...@mail.gmail.com:
>
> > should be: -city:H* OR zip:30*
> >
> -city:H* OR zip:30*   numFound:2520
>
> gives the same wrong result.
>
>
> Another Idea?
>
> Ciao
>   Peter SchÃ¼tt
>
>
>

Re: SolrException parsing error [Solved]

2013-04-15 Thread Luis Lebolo

Sorry, spoke to soon. Turns out I was not sending the query via POST.
Changing the method to POST solved the issue. Apologies for the spam!

-Luis


On Mon, Apr 15, 2013 at 11:47 AM, Luis Lebolo  wrote:

> Hi All,
>
> I'm using Solr 4.1 and am receiving an
> org.apache.solr.common.SolrException "parsing error" with root cause
> java.io.EOFException (see below for stack trace). The query I'm performing
> is long/complex and I wonder if its size is causing the issue?
>
> I am querying via POST through SolrJ. The query (fq) itself is ~20,000
> characters long in the form of:
>
> fq=(mutation_prot_mt_1_1:2374 + OR + mutation_prot_mt_2_1:2374 + OR +
> mutation_prot_mt_3_1:2374 + ...) + OR + (mutation_prot_mt_1_2:2374 + OR +
> mutation_prot_mt_2_2:2374 + OR + mutation_prot_mt_3_2:2374+...) + OR + ...
>
> In short, I am querying for an ID throughout multiple dynamically created
> fields (mutation_prot_mt_#_#).
>
> Any thoughts on how to further debug?
>
> Thanks in advance,
> Luis
>
> --
>
> SEVERE: Servlet.service() for servlet [X] in context with path [/x] threw
> exception [Request processing failed; nested exception is
> org.apache.solr.common.SolrException: parsing error] with root cause
> java.io.EOFException
> at
> org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:193)
> at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:107)
>  at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:387)
>  at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)
>  at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
> at x.x.x.x.x.x.someMethod(x.java:111)
>  at x.x.x.x.x.x.otherMethod(x.java:222)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
> at
> org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213)
>  at
> org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126)
> at
> org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96)
>  at
> org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617)
> at
> org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578)
>  at
> org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
> at
> org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923)
>  at
> org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
> at
> org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
>  at
> org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
>  at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> at
> org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330)
>  at x.x.x.x.x.yetAnotherMethod(x.java:333)
> at
> org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
>  at
> org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:118)
> at
> org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:84)
>  at
> org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
> at
> org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:113)
>  at
> org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
> at
> org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(Anon

Re: SolrException parsing error

2013-04-16 Thread Luis Lebolo

Turns out I spoke too soon. I was *not* sending the query via POST.
Changing the method to POST solved the issue for me (maybe I was hitting a
GET limit somewhere?).

-Luis


On Tue, Apr 16, 2013 at 7:38 AM, Marc des Garets  wrote:

> Did you find anything? I have the same problem but it's on update requests
> only.
>
> The error comes from the solrj client indeed. It is solrj logging this
> error. There is nothing in solr itself and it does the update correctly.
> It's fairly small simple documents being updated.
>
>
> On 04/15/2013 07:49 PM, Shawn Heisey wrote:
>
>> On 4/15/2013 9:47 AM, Luis Lebolo wrote:
>>
>>> Hi All,
>>>
>>> I'm using Solr 4.1 and am receiving an org.apache.solr.common.**
>>> SolrException
>>> "parsing error" with root cause java.io.EOFException (see below for stack
>>> trace). The query I'm performing is long/complex and I wonder if its size
>>> is causing the issue?
>>>
>>> I am querying via POST through SolrJ. The query (fq) itself is ~20,000
>>> characters long in the form of:
>>>
>>> fq=(mutation_prot_mt_1_1:2374 + OR + mutation_prot_mt_2_1:2374 + OR +
>>> mutation_prot_mt_3_1:2374 + ...) + OR + (mutation_prot_mt_1_2:2374 + OR +
>>> mutation_prot_mt_2_2:2374 + OR + mutation_prot_mt_3_2:2374+...) + OR +
>>> ...
>>>
>>> In short, I am querying for an ID throughout multiple dynamically created
>>> fields (mutation_prot_mt_#_#).
>>>
>>> Any thoughts on how to further debug?
>>>
>>> Thanks in advance,
>>> Luis
>>>
>>> --**
>>>
>>> SEVERE: Servlet.service() for servlet [X] in context with path [/x] threw
>>> exception [Request processing failed; nested exception is
>>> org.apache.solr.common.**SolrException: parsing error] with root cause
>>> java.io.EOFException
>>> at
>>> org.apache.solr.common.util.**FastInputStream.readByte(**FastInputStream.java:193)
>>>
>>> at org.apache.solr.common.util.**JavaBinCodec.unmarshal(**
>>> JavaBinCodec.java:107)
>>>   at
>>> org.apache.solr.client.solrj.**impl.BinaryResponseParser.**
>>> processResponse(**BinaryResponseParser.java:41)
>>> at
>>> org.apache.solr.client.solrj.**impl.HttpSolrServer.request(**HttpSolrServer.java:387)
>>>
>>>   at
>>> org.apache.solr.client.solrj.**impl.HttpSolrServer.request(**HttpSolrServer.java:181)
>>>
>>> at
>>> org.apache.solr.client.solrj.**request.QueryRequest.process(**QueryRequest.java:90)
>>>
>>>   at org.apache.solr.client.solrj.**SolrServer.query(SolrServer.**
>>> java:301)
>>>
>>
>> I am guessing that this log is coming from your SolrJ client, but That is
>> not completely clear, so is it SolrJ or Solr that is logging this error?
>>  If it's SolrJ, do you see anything in the Solr log, and vice versa?
>>
>> This looks to me like a network problem, where something is dropping the
>> connection before transfer is complete.  It could be an unusual server-side
>> config, OS problems, timeout settings in the SolrJ code, NIC
>> drivers/firmware, bad cables, bad network hardware, etc.
>>
>> Thanks,
>> Shawn
>>
>>
>

SolrJ Custom RowMapper

2013-04-25 Thread Luis Lebolo

Hi All,

Does SolrJ have an option for a custom RowMapper or BeanPropertyRowMapper
(I'm using Spring/JDBC terms).

I know the QueryResponse has a getBeans method, but I would like to create
my own mapping and plug it in.

Any pointers?

Thanks,
Luis

SolrDocument getFieldNames() exclude dynamic fields?

2013-04-26 Thread Luis Lebolo

Hi All,

I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a query.
When I use SolrDocument's getFieldNames(), I get back a list of fields that
excludes dynamic fields (even though I know they are not empty).

Is there a way to get a list of all fields for a given SolrDocument?

Thanks,
Luis

Re: SolrDocument getFieldNames() exclude dynamic fields?

2013-04-26 Thread Luis Lebolo

Apologies, I wasn't storing these dynamic fields.


On Fri, Apr 26, 2013 at 11:01 AM, Luis Lebolo  wrote:

> Hi All,
>
> I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a
> query. When I use SolrDocument's getFieldNames(), I get back a list of
> fields that excludes dynamic fields (even though I know they are not empty).
>
> Is there a way to get a list of all fields for a given SolrDocument?
>
> Thanks,
> Luis
>

Re: Add copyTo Field without re-indexing?

2011-09-16 Thread Luis Cappa

Hello. 

You can also develop an application by yourself that uses Solrj to retrieve all 
the documents from your índex, process and add all the new information (fields) 
desired and the index them into another Solr index. Its easy. 

Goodbye!



El 16/09/2011, a las 17:39, "Olson, Ron"  escribió:

> Hi all-
> 
> I have an 11 gig index that I realize I need to add another field to, but not 
> from the actual query using DIH, but via copyTo.
> 
> Is there any way to re-parse an existing index, adding the new copyTo field, 
> without having to basically start all over again with DIH?
> 
> Thanks,
> 
> Ron
> 
> DISCLAIMER: This electronic message, including any attachments, files or 
> documents, is intended only for the addressee and may contain CONFIDENTIAL, 
> PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
> recipient, you are hereby notified that any use, disclosure, copying or 
> distribution of this message or any of the information included in or with it 
> is  unauthorized and strictly prohibited.  If you have received this message 
> in error, please notify the sender immediately by reply e-mail and 
> permanently delete and destroy this message and its attachments, along with 
> any copies thereof. This message does not create any contractual obligation 
> on behalf of the sender or Law Bulletin Publishing Company.
> Thank you.

Distributed search has problems with some field names

2011-09-28 Thread Luis Neves




Hello all,

I'm experimenting with the "Distributed Search" bits in the nightly 
builds and I'm facing a problem.


I have on my schema.xml some dynamic fields defined like this:


multiValued="true" />




When hitting a single shard the following query works fine:

http:///select?q=*:*&fl=ts,$distinct_boxes

But when I add the "&distrib=true" parameter I get a NullPointerException:


java.lang.NullPointerException
at 
org.apache.solr.handler.component.QueryComponent.returnFields(QueryComponent.java:1025)
at 
org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:725)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:700)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:292)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1451)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)

at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)

at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)




The "$" in "$distinct_boxes" appears to be the culprit somehow, the query:

/select?q=*:*&fl=ts,distinct_boxes&distrib=true>

works without errors, but of course doesn't retrieve the field I want.

Funnily enough when requesting the uniqueKey field there are no errors:

/select?q=*:*&fl=tid,ts,$distinct_boxes&distrib=true>

But somehow the data from the field "$distinct_boxes" doesn't appear in 
the output.


Is there some workaround? Using "fl=*" returns all the data from the 
fields that start with "$" but it severely increases the size of the 
response.



--
Luis Neves

Re: Distributed search has problems with some field names

2011-09-29 Thread Luis Neves



Hi,

On 09/29/2011 03:10 PM, Erick Erickson wrote:

I know I've seen other anomalies with odd characters
in field names. In general, it's much safer to use
only letters, numbers, and underscores. In fact, I
even prefer lowercase letters. Since you're pretty
sure those work, why not just use them?


Yes, that's what I ended up doing, but it involved a reindex. I was 
trying to avoid that.


Thanks!

--
Luis Neves

r1201855 broke stats.facet on long fields

2011-12-08 Thread Luis Neves




Hello,

I've a "long" field defined in my schema:

omitNorms="true" positionIncrementGap="0" />




Before r1201855 I could use "stats.facet=ts" which allowed me to have a 
timeseries of sorts, now I get an error:



"Stats can only facet on single-valued fields, not: 
ts[long{class=org.apache.solr.schema.TrieLongField,analyzer=org.apache.solr.analysis.TokenizerChain,args={precisionStep=0, 
positionIncrementGap=0, omitNorms=true}}]"



Is there any hope of having the old behavior back?

Looking at the changed code I see this:

if (facetFieldType.isTokenized() || facetFieldType.isMultiValued()) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
 "Stats can only facet on single-valued fields, not: " + facetField
  + "[" + facetFieldType + "]");
   }

this seem to also "fix" SOLR-1782.


--
Luis Neves

Re: r1201855 broke stats.facet on long fields

2011-12-09 Thread Luis Neves


On 12/08/2011 11:16 PM, Chris Hostetter wrote:



...so if you don't have a version param, or your version param is "1.0"
then that would explain this error


I have the version param set to "1.4".



(If that doens't fix the problem for you.


It doesn't.


> then i'm genuinely baffled, and

please file a Jira bug with as much details as possible about your setup
(ideally a fully usable solrconfig.xml+schema.xml that demonstrates your
problem) because the StatsComponentTest most certainly already tests that
stats can be computed on a multiValued="false" TrieLongField with
precisionsStep="0")


I will try to set up a reproducible test case.

Thanks!

--
Luis Neves.

Querying for ~2000 integers - better model?

2013-02-05 Thread Luis Lebolo

Hello! First time poster so {insert ignorance disclaimer here ;)}.

I'm building a web application backed by an Oracle database and we're using
Lucene Solr to index various lists of "entities" (via DIH). We then harness
Solr's faceting to allow the user to filter through their searches.

One aspect we're having trouble modeling is the concept of data
availability. A dataset will have a data value for various entity pairs. To
generalize, say we have two entities: Apples and Oranges. Therefore,
there's a data value for various Apple and Orange pairs (e.g. apple1 &
orange5 have value 6.566).

The question we want to model is "which Apples have data for a specific set
of Oranges." The problem is that the list of Oranges can be ~2000.

Our first (and albeit ugly) approach was to create a dataAvailability field
in each Apple document. It's a multi-valued field that holds a list of
Oranges (actually a list of Orange IDs) that have data for that specific
Apple.

Our facet query then becomes ...facet.query=dataAvailability:(1 OR 2 OR 4
OR 45 OR 200 OR ...)...

For > 1000 Oranges, the query takes a long time to run the first time a
user performs it (afterwards it gets cached so it runs fairly quickly). Any
thoughts on how to speed this up? Is there a better model to use?

One idea was to use the autowarming features. However, the list of Oranges
will always be dynamically built by the user (and it's not feasible to
autowarm all possible permutations of ~2000 Oranges =)).

Hope the generalization isn't too stupid, and thanks in advance!

Cheers,
Luis

Re: Querying for ~2000 integers - better model?

2013-02-05 Thread Luis Lebolo

Hi Mikhail,

Thanks for the interest! The user selects various Oranges from the website.
The list of Orange IDs then gets placed into a table in our database.

For example, the user may want to search oranges from Florida (a state
filter) planted a week ago (a data filter). We then display 600 Oranges
that fit this query and the user says "select them all". We then store all
600 IDs in our database.

For the data availability filter, we get the list of Orange IDs from the
database first then use SolrJ to create the facet query.

-Luis


On Tue, Feb 5, 2013 at 12:03 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello Luis,
>
> Your problem seems fairly obvious (hard to solve problem).
> Where these set of orange id come from? Does an user enter thousand of
> these ids into web-form?
>
>
> On Tue, Feb 5, 2013 at 8:49 PM, Luis Lebolo  wrote:
>
> > Hello! First time poster so {insert ignorance disclaimer here ;)}.
> >
> > I'm building a web application backed by an Oracle database and we're
> using
> > Lucene Solr to index various lists of "entities" (via DIH). We then
> harness
> > Solr's faceting to allow the user to filter through their searches.
> >
> > One aspect we're having trouble modeling is the concept of data
> > availability. A dataset will have a data value for various entity pairs.
> To
> > generalize, say we have two entities: Apples and Oranges. Therefore,
> > there's a data value for various Apple and Orange pairs (e.g. apple1 &
> > orange5 have value 6.566).
> >
> > The question we want to model is "which Apples have data for a specific
> set
> > of Oranges." The problem is that the list of Oranges can be ~2000.
> >
> > Our first (and albeit ugly) approach was to create a dataAvailability
> field
> > in each Apple document. It's a multi-valued field that holds a list of
> > Oranges (actually a list of Orange IDs) that have data for that specific
> > Apple.
> >
> > Our facet query then becomes ...facet.query=dataAvailability:(1 OR 2 OR 4
> > OR 45 OR 200 OR ...)...
> >
> > For > 1000 Oranges, the query takes a long time to run the first time a
> > user performs it (afterwards it gets cached so it runs fairly quickly).
> Any
> > thoughts on how to speed this up? Is there a better model to use?
> >
> > One idea was to use the autowarming features. However, the list of
> Oranges
> > will always be dynamically built by the user (and it's not feasible to
> > autowarm all possible permutations of ~2000 Oranges =)).
> >
> > Hope the generalization isn't too stupid, and thanks in advance!
> >
> > Cheers,
> > Luis
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  
>

FunctionQuery does not work as advertised

2009-02-28 Thread Luis Neves


Hello all.

I have the need to include the result of a computed value in the search 
results of solr query and sort by that value.

The documentation about FunctionQuery available at:

<http://wiki.apache.org/solr/FunctionQuery>

states that this is possible (see the "General Example" at the bottom), 
but I'm unable to make it work.


Using solr1.3 and the included example application this is what I get:

Query: 
http://localhost:8983/solr/select?q=id:SP2514N&fl=id,popularity,score>

numFound=1
score=3.5649493

But for the Query: <http://localhost:8983/solr/select?q=id:SP2514N 
_val_:"pow(popularity,2)"&fl=id,popularity,score>, the expected results 
were:

numFound=1
score=36

and what I get instead is:
numFound=26
score=13.155498 (for doc with id=SP2514N)

This is surprising for two reasons:

-The score value is not the square of the "popularity" field.
-The result set cardinality is altered by the use of the FunctionQuery 
and I was under the impression that functions changed the ordering of 
the results but had no effect on the actual number of matched documents.


Is this a bug in solr or the is documentation at fault?
Am I missing something? Is there any way to include a computed value in 
the search results and sort by it?


Thanks in advance.

--
/**
  * Luis Neves
  * @e-mail: luis.ne...@co.sapo.pt
  * @xmpp: lfs_ne...@sapo.pt
  * @web: <http://technotes.blogs.sapo.pt/>
  * @tlm: +351 962 057 656
  */

OOM when autowarming is enabled

2007-07-25 Thread Luis Neves



Hello all.

We are having some issues with one of our Solr instances when autowarming is 
enabled. The index has about 2.2M documents and 2GB of size, so it's not 
particularly big. Solr runs with "-Xmx1024M -Xms1024M".


We are constantly inserting and updating the index, about 20 new/updated 
documents per minute, with a commit every 10 minutes.

These are our cache settings:

autowarmCount="256"/>


autowarmCount="256"/>


autowarmCount="0"/>


When the autowarming is disabled there are no OOM errors, but the first search 
after a commit takes ~10 seconds and that is too long.


I've enabled the "-XX:+HeapDumpOnOutOfMemoryError" flag. If this happen again I 
will be able to produce a headdump for analysis... meanwhile is there any 
setting that we can tweak that is easier on the memory and still manages to make 
the first search after a commit return in a reasonable time?


Thanks!

--
Luis Neves



StackTrace:
Error during auto-warming of 
key:[EMAIL PROTECTED]:java.lang.OutOfMemoryError: 
GC overhead limit exceeded

at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:104)
at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:159)
at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:165)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:153)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54)
at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:429)
at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:380)
at 
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:383)
at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at 
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:350)
at 
org.apache.solr.search.function.ReverseOrdFieldSource.getValues(ReverseOrdFieldSource.java:56)
at 
org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:57)
at 
org.apache.solr.search.function.LinearFloatFunction.getValues(LinearFloatFunction.java:49)
at 
org.apache.solr.search.function.FunctionQuery$AllScorer.(FunctionQuery.java:100)
at 
org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:78)

at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:233)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)
at org.apache.lucene.search.Searcher.search(Searcher.java:97)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:888)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:805)
at org.apache.solr.search.SolrIndexSearcher.access$1(SolrIndexSearcher.java:709)
at 
org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:251)

at org.apache.solr.search.LRUCache.warm(LRUCache.java:193)
at org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1385)
at org.apache.solr.core.SolrCore$1.call(SolrCore.java:488)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)

Re: OOM when autowarming is enabled

2007-07-25 Thread Luis Neves


Yonik Seeley wrote:

On 7/25/07, Luis Neves <[EMAIL PROTECTED]> wrote:
We are having some issues with one of our Solr instances when 
autowarming is

enabled. The index has about 2.2M documents and 2GB of size, so it's not
particularly big. Solr runs with "-Xmx1024M -Xms1024M".


"Big" is relative to what you are trying to do (faceting, sorting, etc).


Good point. We don't use faceting or sorting in this particular index.


From the stack trace it looks like a function query is the last

straw... it causes a FieldCache entry to be populated, just like
sorting would.  Depending on the number of unique terms in the field,
and the number of fields you sort on or do function queries on, it can
take quite a bit of memory.


I see ... we use the DismaxQueryHandler and the bf parameter is set like:

linear(recip(rord(EntryDate),1,1000,1000),11,0)

The objective is to boost the documents by "freshness" ...  this is probably the 
cause of the memory abuse since all the "EntryDate" values are unique.

I will try to use something like:
EntryDate:[* TO NOW/DAY-3MONTH]^1.5

Thanks!!

--
Luis Neves

Re: OOM when autowarming is enabled

2007-07-25 Thread Luis Neves


Luis Neves wrote:

The objective is to boost the documents by "freshness" ...  this is 
probably the cause of the memory abuse since all the "EntryDate" values 
are unique.

I will try to use something like:
EntryDate:[* TO NOW/DAY-3MONTH]^1.5


This turn out to be a bad idea ... for some reason using the BoostQuery instead 
of the BoostFunction slows the search to a crawl.


--
Luis Neves

Re: OOM when autowarming is enabled

2007-07-25 Thread Luis Neves


Yonik Seeley wrote:

On 7/25/07, Luis Neves <[EMAIL PROTECTED]> wrote:
This turn out to be a bad idea ... for some reason using the 
BoostQuery instead

of the BoostFunction slows the search to a crawl.


Dismax throws bq in with the main query, so it can't really be cached
separately, so iterating over the number of terms in [* TO
NOW/DAY-3MONTH] for each query is expensive.


Ok.


You could try lowering the resolution of EntryDate to lower the number
of unique terms (but that would require reindexing).  That would speed
up a range query, or lower the memory usage of the FieldCache entry.

Solr could also somehow be smarter about the FieldCache and only cache
the ordinal and not the actual values (this could apply to sorting
too).  Lucene's FieldCache doesn't currently support that though, so
it would require some hacking.

If you didn't want date math, date faceting, or date ranges, you could
simply store a date as  a classic integer (number of seconds since
epoch).  function queries would still work on this, and the FieldCache
would be 4 bytes per doc.


I will do a combination of both, I will add a new int field to the index and use 
it to hold the number of weeks since epoch (week resolution is good enough for 
freshness in our case).


Thanks!

--
Luis Neves

help with dismax query handler syntax

2006-09-25 Thread Luis Neves



Hello all,

Using the standard query handler I can search for a term excluding a category 
and sort descending by price, e.g.:


http://localhost/solr/select/?q=book+-Category:Adults;Price+desc&start=0&rows=10&fl=*,score

I'm scratching my head on how to do the same with the Dismax query handler, can 
anyone point me in the right direction.


Thanks!

--
Luis Neves

Re: help with dismax query handler syntax

2006-09-25 Thread Luis Neves


Nevermind, I got it ... Somehow I missed the javadoc.

--
Luis Neves



Luis Neves wrote:




Hello all,

Using the standard query handler I can search for a term excluding a category 
and sort descending by price, e.g.:


http://localhost/solr/select/?q=book+-Category:Adults;Price+desc&start=0&rows=10&fl=*,score

I'm scratching my head on how to do the same with the Dismax query handler, can 
anyone point me in the right direction.


Thanks!

--
Luis Neves

Varying the score acording to search word

2006-10-26 Thread Luis Neves



Hello all.

We have a product catalog that is searchable via Solr, by default we want to 
exclude results from the "Adult" category unless the search terms match a 
predetermined list of words.


Example:

Client searches for "doll", "doll" is not on the list -> we *don't want* to show 
him Adult results.


Client searched for "aphrodisiac", "aphrodisiac" is on the list -> we *want* to 
show him Adult results.


Did I made sense?

FunctionQuery seems to be what I want, but it's not clear to me how to use it in 
this particular case.

Can anyone point me in the right direction?

Thanks!

Luis Neves

Re: Varying the score acording to search word

2006-10-26 Thread Luis Neves



Mental note: think before post ... this is a simple job for a Servlet filter.
sorry for the noise.

--
Luis Neves



Luis Neves wrote:


Hello all.

We have a product catalog that is searchable via Solr, by default we 
want to exclude results from the "Adult" category unless the search 
terms match a predetermined list of words.


Example:

Client searches for "doll", "doll" is not on the list -> we *don't want* 
to show him Adult results.


Client searched for "aphrodisiac", "aphrodisiac" is on the list -> we 
*want* to show him Adult results.


Did I made sense?

FunctionQuery seems to be what I want, but it's not clear to me how to 
use it in this particular case.

Can anyone point me in the right direction?

Thanks!

Luis Neves

Re: result grouping?

2007-01-04 Thread Luis Neves


Yonik Seeley wrote:

On 1/3/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

thanks.  Yes, the presentation layer could group results, but that is
not practical if i want to show the first 20 results out of 200,000
matches.

Nutch groups the results by site.  Any idea how they do it?


Good question.
Off the top of my head, one could use a priority queue that can change
it's size dynamically.  One could increment a group count for each hit
(like faceted search with the FieldCache) and if the group count
exceeds "n", then you increment the size of the priority queue to
allow an additional item to be collected to compensate.

-Yonik


You might as wheel say that I have to change the dilithium crystals in the flux 
capacitor :-)


One of the reasons I like Solr so much is because I get impressive results 
without having to know Lucene, which is something that will have to change 
because I also need this feature.


Not knowing much about the internal of Solr/Lucene I had a look at the Facet 
code in search of ideas, but from what I could see the facet counts are 
calculated after the Documents are added to the response, it seems to me that 
any kind of grouping has to be done before that... right?


Could you explain in more detail where should I look?

Can the TopFieldDocCollector/TopFieldDocs classes be used to this end?

I'm immersing my self on Lucene but it will take some time.

Side note: Over here, beside Solr, we also use the "FAST" search platform and 
they call this feature "Field collapsing":

<http://www.fastsearch.com/glossary.aspx?m=48&amid=299>
I like the syntax they use:
"&collapseon=&collapsenum=N" -> Collapse, but keep N number of 
collapsed documents

For some reason they can only collapse on numeric fields (int32).

Regards,
Luis Neves

Re: is search possible while indexing?

2007-01-05 Thread Luis Neves


Rafeek Raja wrote:

I am beginner to solr and lucene. Is search possible while indexing?


Yes... that is just one of the cool features of Solr/Lucene.

<http://incubator.apache.org/solr/features.html>

--
Luis Neves

Re: result grouping?

2007-01-05 Thread Luis Neves


Yonik Seeley wrote:


There are still some things underspecified though.

Let's take an example of collapseon=site, collapsenum=2

The list of un-collapsed matches and their relevancy scores (sort order) 
is:

doc=51, site=A, score=100
doc=52, site=B, score=90
doc=53, site=C, score=80
doc=54, site=B, score=70
doc=55, site=D, score=60
doc=56, site=E, score=50
doc=57, site=B, score=40
doc=58, site=A, score=30

1)  If I ask for the top 4 docs, should I get [51,52,53,54] or
[51,52,54,53].  Are lower ranking docs moved up in the rankings to be
in their higher ranking "group"?


The docs move up the ranking.
You should get [51,58,52,54] ... or one could make the case that you should get
[51,58,52,54,53,55], to get the somewhat equivalent behaviour of a SQL 
"quota-query", in that case that case the "top 4" would not refer to the number 
of documents but the number of distinct values for the field you are collapsing.




2)  If I ask for the top 3 docs, should I get [51,52,53] because those
are the top 3 scoring docs, or should I get [51,58,52] because
documents were first groups and then ranked (and 51 and 58 go
together)?  Another way of asking this is related to (1): should docs
outside the "window" be moved up in the rankings to be in their higher
ranking "group"?


See above.




3) Should the number of documents in a "group" change the relevancy?
Should site=B rank higher than site=A?


I don't think so... don't know if that is what *should* be done, but that's not 
what FAST does.




4) Is the collapsing only in the returned results, or just within a
page of results.  If I ask for docs 4 through 7, should doc 57 be in
that list or not?


With "FAST" that is an option, the default behaviour is to remove the documents 
from the resultset and the 57 would not be on the list, but you can choose to 
not remove them and in that case they are presented last.



Defining things to make sense while retaining the ability to page
through the results seems to be the challenge.



I'm beginning to think that this a little to complex for a first project with 
Lucene. In my particular case all I want is to group results by category (from a 
predetermined - and small - category list), I think I will just make a request 
by category and accept the latency.


--
Luis Neves

XML querying

2007-01-15 Thread Luis Neves



Hello.
What I do now to index XML documents it's to use a Filter to strip the markup, 
this works but it's impossible to know where in the document is the match located.
What would it take to make possible to specify a filter query that accepts xpath 
expressions?... something like:


fq=xmlField:/book/content/text()

This way only the "/book/content/" element was searched.

Did I make sense? Is this possible?

--
Luis Neves

Re: XML querying

2007-01-15 Thread Luis Neves



Hi!

Thorsten Scherler wrote:


On Mon, 2007-01-15 at 12:23 +, Luis Neves wrote:

Hello.
What I do now to index XML documents it's to use a Filter to strip the markup, 
this works but it's impossible to know where in the document is the match located.
What would it take to make possible to specify a filter query that accepts xpath 
expressions?... something like:


fq=xmlField:/book/content/text()

This way only the "/book/content/" element was searched.

Did I make sense? Is this possible?


AFAIK short answer: no.

The field is ALWAYS plain text. There is no xmlField type.

...but why don't you just add your text in multiple field when indexing.

Instead of plain stripping the markup do above xpath on your document
and create different fields. Like
 
 

Makes sense?


Yes, but I have documents with different schemas on the same "xml field", also, 
that way I  would have to know the schema of the documents being indexed (which 
I don't).


The schema I use is something like:



Where each distinct DocumentType has its own schema.

I could revise this approach to use an Solr instance for each DocumentType but I 
would have to find a way to "merge" results from the different instances because 
I also need to search across different DocumentTypes... I guess I'm SOL :-(



--
Luis Neves

Re: XML querying

2007-01-17 Thread Luis Neves


Hi,

Thorsten Scherler wrote:

On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote:




I think you should explain your use case a wee bit more.


What I do now to index XML documents it's to use a Filter to strip
the markup, 

this works but it's impossible to know where in the document is the match 
located.


why do you need to know where? 


Poorly phrased from my part. Ideally I want to apply "lucene filters" to the xml 
content.

Something like what Nux does:
<http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html>


--
Luis Neves

Document "freshness" and Boost Functions

2007-01-17 Thread Luis Neves



Hello,
Reading the javadocs from the DisMaxRequestHandler I see that is possible to use 
"Boost Functions" to influence the score. How would that work in order to 
improve the score of recent documents? (I have a timestamp field in the 
schema)... I'm assuming it's possible (right?), but I can't figure out the syntax.


--
Luis Neves

Increment field value

2007-02-01 Thread Luis Neves


Hello all,

We have a Solr/Lucene index for newspaper articles, those articles have 
associated comments. When searching for articles we want to present the number 
of comments per article.
What we do now is to fetch from the DB the sum of comments for each articleId 
that Solr returns, but this is bringing the DB to its knees. We would like to 
store the number of comments in the Solr index to save the DB some work.


Is it possible when updating a numeric field to increment the existing value 
instead of replacing it with a new value?


The problem we are having is that we can't retrieve the number of comments 
increment it and update the index because the "actual" value might be 
uncommitted... is there any other alternative to this problem?


Thanks in advance for any help.

--
Luis Neves

Re: Increment field value

2007-02-01 Thread Luis Neves



I forgot one little detail.
The DB server is untouchable. I have "read-only" access to it. The database is a 
component of an big "enterprisy" CMS. The obvious solution of adding a "#Posts" 
field to the table updated with a trigger is not viable.

We have a ticket open with the vendor, but they are not what we could call 
agile.

--
Luis Neves


Luis Neves wrote:

Hello all,

We have a Solr/Lucene index for newspaper articles, those articles have 
associated comments. When searching for articles we want to present the 
number of comments per article.
What we do now is to fetch from the DB the sum of comments for each 
articleId that Solr returns, but this is bringing the DB to its knees. 
We would like to store the number of comments in the Solr index to save 
the DB some work.


Is it possible when updating a numeric field to increment the existing 
value instead of replacing it with a new value?


The problem we are having is that we can't retrieve the number of 
comments increment it and update the index because the "actual" value 
might be uncommitted... is there any other alternative to this problem?


Thanks in advance for any help.

--
Luis Neves

Parsing cluster result's docs

2015-03-09 Thread Jorge Luis Lazo

Hi,

I have a Solr instance using the clustering component (with the Lingo
algorithm) working perfectly. However when I get back the cluster results
only the ID's of these come back with it. What is the easiest way to
retrieve full documents instead? Should I parse these IDs into a new query
to Solr, or is there some configuration I am missing to return full docs
instead of IDs?

If it matters, I am using Solr 4.10.

Thanks.

Real-Time get and Dynamic Fields: possible bug.

2015-05-14 Thread Luis Cappa Banda

Hi there,

I have the following dynamicFields definition in my schema.xml:




  


I' ve seen that when fetching documents with /select?q=id:whateverId, the
results returned include both i18n* and *_facet fields filled. However,
when using real-time request handler (/get?ids:whateverIds) the result
fetched include only i18n* dynamic fields, but *_facet ones are not
included.

I have the impression during /get RequestHandler the server-side regular
expression used when parsing fields and fields values to return documents
with existing dynamic fields seems to be wrong. From the client side, I' ve
checked that the class DocField.java that parses SolrDocument to Bean ones
uses the following matcher:

 } else if (annotation.value().indexOf('*') >= 0) { // dynamic fields are
annotated as @Field("categories_*")

// if the field was annotated as a dynamic field, convert the name into a
pattern

// the wildcard (*) is supposed to be either a prefix or a suffix, hence
the use of replaceFirst

name = annotation.value().replaceFirst("\\*", "\\.*");

dynamicFieldNamePatternMatcher = Pattern.compile("^" + name + "$");

 } else {

name = annotation.value();

 }

So maybe a similar behavior from the server-side is wrong. That' s the only
reason I find to understand why when using /select all fields are returned
but when using /get those that matches *_facet regexp are not.

If you can confirm that this is a bug (because maybe is the expected
behavior, but after some years using Solr I think it is not) I can create
the JIRA issue and debug it more deeply to apply a patch with the aim to
help.


Regards,


-- 
- Luis Cappa

Re: Real-Time get and Dynamic Fields: possible bug.

2015-05-14 Thread Luis Cappa Banda

Ehem, *_target ---> *_facet.

2015-05-14 16:47 GMT+02:00 Luis Cappa Banda :

> Hi Yonik,
>
> Yes, they are the target from copyFields in the schema.xml. This *_target
> fields are suposed to be used in some specific searchable (thus, tokenized)
> fields that in the future are candidates to be faceted to return some
> stats. For example, imagine that you have a field storing a directory path
> and you want to search by. Also, you may want to facet by the whole
> directory path value (not just their terms). Thats why I' m storing both
> field values: searchable and tokenized one, string and 'facet candidate'
> one.
>
> What I do not understand is that both i18n* and *_target are dynamic,
> indexed and stored values. The only difference is that *_target one is
> multivalued. Does it have some sense?
>
>
> Regards
>
>
> - Luis Cappa
>
> 2015-05-14 16:42 GMT+02:00 Yonik Seeley :
>
>> Are the _facet fields the target of a copyField in the schema?
>> Realtime get either gets the values from the transaction log (and if
>> you didn't send it the values, they won't be there) or gets them from
>> the index to try and reconstruct what was sent in.
>>
>> It's generally not recommended to have copyField targets "stored", or
>> have a mix of explicitly set values and copyField values in the same
>> field.
>>
>> -Yonik
>>
>> On Thu, May 14, 2015 at 7:17 AM, Luis Cappa Banda 
>> wrote:
>> > Hi there,
>> >
>> > I have the following dynamicFields definition in my schema.xml:
>> >
>> >
>> > 
>> >
>> > > />  > indexed=
>> > "true" stored="true" multiValued="true" />
>> >
>> >
>> > I' ve seen that when fetching documents with /select?q=id:whateverId,
>> the
>> > results returned include both i18n* and *_facet fields filled. However,
>> > when using real-time request handler (/get?ids:whateverIds) the result
>> > fetched include only i18n* dynamic fields, but *_facet ones are not
>> > included.
>> >
>> > I have the impression during /get RequestHandler the server-side regular
>> > expression used when parsing fields and fields values to return
>> documents
>> > with existing dynamic fields seems to be wrong. From the client side,
>> I' ve
>> > checked that the class DocField.java that parses SolrDocument to Bean
>> ones
>> > uses the following matcher:
>> >
>> >  } else if (annotation.value().indexOf('*') >= 0) { // dynamic fields
>> are
>> > annotated as @Field("categories_*")
>> >
>> > // if the field was annotated as a dynamic field, convert the name into
>> a
>> > pattern
>> >
>> > // the wildcard (*) is supposed to be either a prefix or a suffix, hence
>> > the use of replaceFirst
>> >
>> > name = annotation.value().replaceFirst("\\*", "\\.*");
>> >
>> > dynamicFieldNamePatternMatcher = Pattern.compile("^" + name + "$");
>> >
>> >  } else {
>> >
>> > name = annotation.value();
>> >
>> >  }
>> >
>> > So maybe a similar behavior from the server-side is wrong. That' s the
>> only
>> > reason I find to understand why when using /select all fields are
>> returned
>> > but when using /get those that matches *_facet regexp are not.
>> >
>> > If you can confirm that this is a bug (because maybe is the expected
>> > behavior, but after some years using Solr I think it is not) I can
>> create
>> > the JIRA issue and debug it more deeply to apply a patch with the aim to
>> > help.
>> >
>> >
>> > Regards,
>> >
>> >
>> > --
>> > - Luis Cappa
>>
>
>
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Real-Time get and Dynamic Fields: possible bug.

2015-05-14 Thread Luis Cappa Banda

Hi Yonik,

Yes, they are the target from copyFields in the schema.xml. This *_target
fields are suposed to be used in some specific searchable (thus, tokenized)
fields that in the future are candidates to be faceted to return some
stats. For example, imagine that you have a field storing a directory path
and you want to search by. Also, you may want to facet by the whole
directory path value (not just their terms). Thats why I' m storing both
field values: searchable and tokenized one, string and 'facet candidate'
one.

What I do not understand is that both i18n* and *_target are dynamic,
indexed and stored values. The only difference is that *_target one is
multivalued. Does it have some sense?


Regards


- Luis Cappa

2015-05-14 16:42 GMT+02:00 Yonik Seeley :

> Are the _facet fields the target of a copyField in the schema?
> Realtime get either gets the values from the transaction log (and if
> you didn't send it the values, they won't be there) or gets them from
> the index to try and reconstruct what was sent in.
>
> It's generally not recommended to have copyField targets "stored", or
> have a mix of explicitly set values and copyField values in the same
> field.
>
> -Yonik
>
> On Thu, May 14, 2015 at 7:17 AM, Luis Cappa Banda 
> wrote:
> > Hi there,
> >
> > I have the following dynamicFields definition in my schema.xml:
> >
> >
> > 
> >
> > 
>   indexed=
> > "true" stored="true" multiValued="true" />
> >
> >
> > I' ve seen that when fetching documents with /select?q=id:whateverId, the
> > results returned include both i18n* and *_facet fields filled. However,
> > when using real-time request handler (/get?ids:whateverIds) the result
> > fetched include only i18n* dynamic fields, but *_facet ones are not
> > included.
> >
> > I have the impression during /get RequestHandler the server-side regular
> > expression used when parsing fields and fields values to return documents
> > with existing dynamic fields seems to be wrong. From the client side, I'
> ve
> > checked that the class DocField.java that parses SolrDocument to Bean
> ones
> > uses the following matcher:
> >
> >  } else if (annotation.value().indexOf('*') >= 0) { // dynamic fields are
> > annotated as @Field("categories_*")
> >
> > // if the field was annotated as a dynamic field, convert the name into a
> > pattern
> >
> > // the wildcard (*) is supposed to be either a prefix or a suffix, hence
> > the use of replaceFirst
> >
> > name = annotation.value().replaceFirst("\\*", "\\.*");
> >
> > dynamicFieldNamePatternMatcher = Pattern.compile("^" + name + "$");
> >
> >  } else {
> >
> > name = annotation.value();
> >
> >  }
> >
> > So maybe a similar behavior from the server-side is wrong. That' s the
> only
> > reason I find to understand why when using /select all fields are
> returned
> > but when using /get those that matches *_facet regexp are not.
> >
> > If you can confirm that this is a bug (because maybe is the expected
> > behavior, but after some years using Solr I think it is not) I can create
> > the JIRA issue and debug it more deeply to apply a patch with the aim to
> > help.
> >
> >
> > Regards,
> >
> >
> > --
> > - Luis Cappa
>



-- 
- Luis Cappa

Re: Real-Time get and Dynamic Fields: possible bug.

2015-05-14 Thread Luis Cappa Banda

That is something I didin' t know, but I thought it was mandatory. I' ll
try to explain step by step my (I think) logical way to understand it:

   - If a field is indexed, you can search by it.
   - When faceting, you have to index the field (because it can be
   tokenized and then you would like to facet by their terms). Then, you need
   to mark as indexed those fields you want to facet by.
   - If you mark as stored a field, you can return its value with the
   'original value' it was stored.
   - If you facet, you are searching, counting terms and returning values
   and their counters. Thus, that "returning their values" step is what I
   thought where 'stored=true' was necessary.

If you don' t mark as stored a field indexed and 'facetable', I was
expecting to not be able to return their values, so faceting has no sense.
Thats what I thought, of course. If it is not necessary, thats perfect: the
lighter the data, the better, and one more thing I' ve learned, :-)

Anyway, I think that the question is still open: both are dynamic fields,
stored (it is not necessary, OK) and indexed. When applying real time
requestHandler, i18n* dynamic fields are returned but those *_facet are
not. However, when applying the default /select requestHandler and finding
by the document id, both i18n* and *_facet fields are returned. You can try
it with Solr 5.1, the version I' m currently using.

The only differences between them are:

   - Regular expression: i18n* VS *_facet
   - Multivalued: *_facet are multivalued.

Regards,

- Luis Cappa

2015-05-14 18:32 GMT+02:00 Yonik Seeley :

> On Thu, May 14, 2015 at 10:47 AM, Luis Cappa Banda 
> wrote:
> > Hi Yonik,
> >
> > Yes, they are the target from copyFields in the schema.xml. This *_target
> > fields are suposed to be used in some specific searchable (thus,
> tokenized)
> > fields that in the future are candidates to be faceted to return some
> > stats. For example, imagine that you have a field storing a directory
> path
> > and you want to search by. Also, you may want to facet by the whole
> > directory path value (not just their terms). Thats why I' m storing both
> > field values: searchable and tokenized one, string and 'facet candidate'
> > one.
>
> OK, but you don't need to *store* the values in _facet, right?
> -Yonik
>

-- 
- Luis Cappa

Re: Real-Time get and Dynamic Fields: possible bug.

2015-05-14 Thread Luis Cappa Banda

Yep, but those dynamic fields had a field type "string", so the unique
indexed therm will be the entire field value and the faceted terms counted
will match with exactly with each field value. Thats why I was confused.
Typically I use faceting with string non tokenized field values for simple
stats and this kind of things.

Do you think the behavior explained (I mean, ghost dynamic field values
when using real-time request handler) can be a bug? I don' t mind
investigating it this weekend and trying to patch it.

2015-05-14 18:59 GMT+02:00 Yonik Seeley :

> On Thu, May 14, 2015 at 12:49 PM, Luis Cappa Banda 
> wrote:
> > If you don' t mark as stored a field indexed and 'facetable', I was
> > expecting to not be able to return their values, so faceting has no
> sense.
>
> Faceting does not use or retrieve stored field values.  The labels
> faceting returns are from the indexed values.
>
> "If you want the value returned, it needs to be stored" only applies
> to fields in the main document list (the fields that are retrieved for
> the top ranked documents).
>
> -Yonik
>

-- 
- Luis Cappa

Re: Issue serving concurrent requests to SOLR on PROD

2015-05-19 Thread Luis Cappa Banda

Hi there,

Unfortunately I don' t agree with Shawn when he suggest to update
server.xml configuration up to 1 in maxThreads. If Tomcat (due to the
concurrent overload you' re suffering, the type of the queries you' re
handling, etc.) cannot manage the requested queries what could happen is
that Tomcat internal request queue fills and and Out of Memory may appear
to say hello to you.

Solr is multithreaded and Tomcat also it is, but those Tomcat threads are
managed by an internal thread pool with a queue. What Tomcat does is to
dispatch requests as much it cans over the web applications that are
deployed in it (in this case, Solr). If Tomcat receives more requests that
it can answer its internal queue starts to be filled.

Those timeouts from the client side you explained seems to be due to Tomcat
thread pool and its queue is starting to fill up. You can check it
monitoring its memory and thread usage and I' m sure you' ll see how it
grows correlated with the number of concurrent requests they receive. Then,
for sure you' ll se a more or less horizontal line from memory usage and
those timeouts will appear from the cliente side.

Basically I think that our scenarios are:

   - Queries are slow. You should check and try to improve them, because
   maybe they are bad formed and that queries are destroying your performance.
   Also, check your index configuration (segments number, etc.).
   - Queries are OK, but you receive more queries that you can handle. Your
   configuration and everything is well done, but you are trying to consume
   more requests that you can dispatch and answer.

If you cannot improve your queries, or your queries are OK but you receive
more requests that the ones you can handle, the only solution you have is
to scale horizontally and startup new Tomcat + Solrs from 4 to N nodes.


Best,


- Luis Cappa

2015-05-19 15:57 GMT+02:00 Michael Della Bitta :

> Are you sure the requests are getting queued because the LB is detecting
> that Solr won't handle them?
>
> The reason why I'm asking is I know that ELB doesn't handle bursts well.
> The load balancer needs to "warm up," which essentially means it might be
> underpowered at the beginning of a burst. It will spool up more resources
> if the average load over the last minute is high. But for that minute it
> will definitely not be able to handle a burst.
>
> If you're testing infrastructure using a benchmarking tool that doesn't
> slowly ramp up traffic, you're definitely encountering this problem.
>
> Michael
>
>   Jani, Vrushank 
>  2015-05-19 at 03:51
>
> Hello,
>
> We have production SOLR deployed on AWS Cloud. We have currently 4 live
> SOLR servers running on m3xlarge EC2 server instances behind ELB (Elastic
> Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat container which
> is sitting behind Apache httpd. Apache httpd is using prefork mpm and the
> request flows from ELB to Apache Httpd Server to Tomcat (via AJP).
>
> Last few days, we are seeing increase in the requests around 2
> requests minute hitting the LB. In effect we see ELB Surge Queue Length
> continuously being around 100.
> Surge Queue Length: represents the total number of request pending
> submission to the instances, queued by the load balancer;
>
> This is causing latencies and time outs from Client applications. Our
> first reaction was that we don't have enough max connections set either in
> HTTPD or Tomcat. What we saw, the servers are very lightly loaded with very
> low CPU and memory utilisation. Apache preform settings are as below on
> each servers with keep-alive turned off.
>
> 
> StartServers 8
> MinSpareServers 5
> MaxSpareServers 20
> ServerLimit 256
> MaxClients 256
> MaxRequestsPerChild 4000
> 
>
>
> Tomcat server.xml has following settings.
>
>  maxThreads="500" connectionTimeout="6"/>
> For HTTPD – we see that there are lots of TIME_WAIT connections Apache
> port around 7000+ but ESTABLISHED connections are around 20.
> For Tomact – we see about 60 ESTABLISHED connections on tomcat AJP port.
>
> So the servers and connections doesn't look like fully utilised to the
> capacity. There is no visible stress anywhere. However we still get
> requests being queued up on LB because they can not be served from
> underlying servers.
>
> Can you please help me resolving this issue? Can you see any apparent
> problem here? Am I missing any configuration or settings for SOLR?
>
> Your help will be truly appreciated.
>
> Regards
> VJ
>
>
>
>
>
>
> Vrushank Jani [http://media.for.truelocal.com.au/signature/img/divider.png]
> Senior Java Developer
> T 02 8312 1625[http://media.for.truelocal.com.au/signature/img/di

Solr read-only mode with same datadir: commits are not working.

2014-03-12 Thread Luis Cappa Banda

Hey guys,

I've doing some tests sharing the same index between three Solr servers:

*SolrA*: is allowed to both read and index. The index is stored in a NFS.
It has its own configuration files.
*SolrB and SolrC*: they can only read from the shared index and each one
has their own configuration files. Solrconfig.xml has been changed with the
following parameters:

single


When all servers startup they all work perfectly executing search
operations. The problem appears when SolrA index new documents (commiting
itself afther that indexation operation). If I manually execute a commit or
a softCommit to SolrB or SolrC, they are not able to see the new documents
added even if it is suposed to reopen a new searcher when a commit occurs.

I have noticed that a commit operation in SolrA shows different segments
(the newest ones) compared with the logs that SorlB/SolrC has after a
commit. In other words, SolrA shows newer segments and SolrB/SolrC appears
to see just the old ones.

Is that normal? Any idea or suggestion to solve this?

Thank you in advance, :-)

Best regards,

-- 
- Luis Cappa

Re: Solr read-only mode with same datadir: commits are not working.

2014-03-12 Thread Luis Cappa Banda

I've seen that StandardDirectoryReader appears in the commit logs. Maybe
this DirectoryReader type is caching somehow the old segments in SolrB and
SolrC even if they have been commited previosly. If that's true, does exist
any other DirectoyReader type (I don't know, SimpleDirectoryReader or
FSDirectoyReader) that always read the current segments when a commit
happens?


2014-03-12 11:35 GMT+01:00 Luis Cappa Banda :

> Hey guys,
>
> I've doing some tests sharing the same index between three Solr servers:
>
> *SolrA*: is allowed to both read and index. The index is stored in a NFS.
> It has its own configuration files.
> *SolrB and SolrC*: they can only read from the shared index and each one
> has their own configuration files. Solrconfig.xml has been changed with the
> following parameters:
>
> single
>
>
> When all servers startup they all work perfectly executing search
> operations. The problem appears when SolrA index new documents (commiting
> itself afther that indexation operation). If I manually execute a commit or
> a softCommit to SolrB or SolrC, they are not able to see the new documents
> added even if it is suposed to reopen a new searcher when a commit occurs.
>
> I have noticed that a commit operation in SolrA shows different segments
> (the newest ones) compared with the logs that SorlB/SolrC has after a
> commit. In other words, SolrA shows newer segments and SolrB/SolrC appears
> to see just the old ones.
>
> Is that normal? Any idea or suggestion to solve this?
>
> Thank you in advance, :-)
>
> Best regards,
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Solr read-only mode with same datadir: commits are not working.

2014-03-12 Thread Luis Cappa Banda

Hi again!

I'm diving inside DirectUpdateHandler2 code and it seems that the problem
is that when a commit, when core.openNewSercher(true,true) is called it
returns a RefCounted with a new searcher reference that
points to an old (probably cached somehow) data dir. I've tried with
core.openNewSearcher(false,
false) but it doesn't work. What I think that I need is simple: after a
commit, SolrIndexSearcher must be reload with a recent index snapshot not
using any NRT caching method or whatever.


(...)

synchronized (solrCoreState.getUpdateLock()) { if (ulog != null)
ulog.preSoftCommit(cmd); if (cmd.openSearcher) { core.getSearcher(true,
false, waitSearcher); } else { // force open a new realtime searcher so
realtime-get and versioning code can see the latest *
RefCounted searchHolder = core.openNewSearcher(true,
true); * searchHolder.decref(); } if (ulog != null)
ulog.postSoftCommit(cmd); }


It seems that executing this a new SolrIndexSearcher is returned, but I
don't know how to set that new SolrIndexSearcher to the SolrCore instance:

* SolrIndexSearcher searcher = core.newSearcher("Last update searcher");*

Does anybody knows if possible?

Thanks in advance!


Best,



2014-03-12 12:10 GMT+01:00 Luis Cappa Banda :

> I've seen that StandardDirectoryReader appears in the commit logs. Maybe
> this DirectoryReader type is caching somehow the old segments in SolrB and
> SolrC even if they have been commited previosly. If that's true, does exist
> any other DirectoyReader type (I don't know, SimpleDirectoryReader or
> FSDirectoyReader) that always read the current segments when a commit
> happens?
>
>
> 2014-03-12 11:35 GMT+01:00 Luis Cappa Banda :
>
> Hey guys,
>>
>> I've doing some tests sharing the same index between three Solr servers:
>>
>> *SolrA*: is allowed to both read and index. The index is stored in a
>> NFS. It has its own configuration files.
>> *SolrB and SolrC*: they can only read from the shared index and each one
>> has their own configuration files. Solrconfig.xml has been changed with the
>> following parameters:
>>
>> single
>>
>>
>> When all servers startup they all work perfectly executing search
>> operations. The problem appears when SolrA index new documents (commiting
>> itself afther that indexation operation). If I manually execute a commit or
>> a softCommit to SolrB or SolrC, they are not able to see the new documents
>> added even if it is suposed to reopen a new searcher when a commit occurs.
>>
>> I have noticed that a commit operation in SolrA shows different segments
>> (the newest ones) compared with the logs that SorlB/SolrC has after a
>> commit. In other words, SolrA shows newer segments and SolrB/SolrC appears
>> to see just the old ones.
>>
>> Is that normal? Any idea or suggestion to solve this?
>>
>> Thank you in advance, :-)
>>
>> Best regards,
>>
>> --
>> - Luis Cappa
>>
>
>
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Spellcheck with Distributed Search (sharding).

2013-10-23 Thread Luis Cappa Banda

Hello!

I'be been trying to enable Spellchecking using sharding following the steps
from the Wiki, but I failed, :-( What I do is:

*Solrconfig.xml*


<*searchComponent name="suggest"* class="solr.SpellCheckComponent">

suggest
org.apache.solr.spelling.suggest.Suggester
org.apache.solr.spelling.suggest.tst.TSTLookup
suggestion
true




<*requestHandler name="/suggest"* class="solr.SearchHandler">

suggestion
true
suggest
10

  
suggest
  



*Note:* I have two shards (solr1 and solr2) and both have the same
solrconfig.xml. Also, bot indexes were optimized to create the spellchecker
indexes.

*Query*

solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data

*
*
*Response*
*
*
{

   - responseHeader:
   {
  - status: 404,
  - QTime: 12,
  - params:
  {
 - shards: "solr1:8080/events/data,solr2:8080/events/data",
 - shards.qt: "/suggestion",
 - q: "m",
 - wt: "json",
 - qt: "/suggestion"
 }
  },
   - error:
   {
  - msg: "Server at http://solr1:8080/events/data returned non ok
  status:404, message:Not Found",
  - code: 404
  }

}

More query syntaxes that I used and that doesn't work:

http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data>

http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data>


Any idea of what I'm doing wrong?

Thank you very much in advance!

Best regards,

-- 
- Luis Cappa

Re: Spellcheck with Distributed Search (sharding).

2013-10-23 Thread Luis Cappa Banda

More info:

When executing the Query to a single Solr server it works:
http://solr1:8080/events/data/suggest?q=m&wt=json<http://solrclusterd.buguroo.dev:8080/events/data/suggest?q=m&wt=json>

{

   - responseHeader:
   {
  - status: 0,
  - QTime: 1
  },
   - response:
   {
  - numFound: 0,
  - start: 0,
  - docs: [ ]
  },
   - spellcheck:
   {
  - suggestions:
  [
 - "m",
 -
 {
- numFound: 4,
- startOffset: 0,
- endOffset: 1,
- suggestion:
[
   - "marca",
   - "marcacom",
   - "mis",
   - "mispelotas"
   ]
}
 ]
  }

}


But when choosing the Request handler this way it doesn't:
http://solr1:8080/events/data/select?*qt=/sugges*t&wt=json&q=*:*<http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggest&wt=json&q=*:*>




2013/10/23 Luis Cappa Banda 

> Hello!
>
> I'be been trying to enable Spellchecking using sharding following the
> steps from the Wiki, but I failed, :-( What I do is:
>
> *Solrconfig.xml*
>
>
> <*searchComponent name="suggest"* class="solr.SpellCheckComponent">
> 
>  suggest
> org.apache.solr.spelling.suggest.Suggester
>   name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
> suggestion
>  true
> 
> 
>
>
> <*requestHandler name="/suggest"* class="solr.SearchHandler">
> 
>  suggestion
> true
>  suggest
> 10
>  
>   
> suggest
>   
> 
>
>
> *Note:* I have two shards (solr1 and solr2) and both have the same
> solrconfig.xml. Also, bot indexes were optimized to create the spellchecker
> indexes.
>
> *Query*
>
>
> solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data
>
> *
> *
> *Response*
> *
> *
> {
>
>- responseHeader:
>{
>   - status: 404,
>   - QTime: 12,
>   - params:
>   {
>  - shards: "solr1:8080/events/data,solr2:8080/events/data",
>  - shards.qt: "/suggestion",
>  - q: "m",
>  - wt: "json",
>  - qt: "/suggestion"
>  }
>   },
>- error:
>{
>   - msg: "Server at http://solr1:8080/events/data returned non ok
>   status:404, message:Not Found",
>   - code: 404
>   }
>
> }
>
> More query syntaxes that I used and that doesn't work:
>
>
> http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data>
>
>
> http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data>
>
>
> Any idea of what I'm doing wrong?
>
> Thank you very much in advance!
>
> Best regards,
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Spellcheck with Distributed Search (sharding).

2013-10-24 Thread Luis Cappa Banda

Any idea?


2013/10/23 Luis Cappa Banda 

> More info:
>
> When executing the Query to a single Solr server it works:
> http://solr1:8080/events/data/suggest?q=m&wt=json<http://solrclusterd.buguroo.dev:8080/events/data/suggest?q=m&wt=json>
>
> {
>
>- responseHeader:
>{
>   - status: 0,
>   - QTime: 1
>   },
>- response:
>{
>   - numFound: 0,
>   - start: 0,
>   - docs: [ ]
>   },
>- spellcheck:
>{
>   - suggestions:
>   [
>  - "m",
>  -
>  {
> - numFound: 4,
> - startOffset: 0,
> - endOffset: 1,
> - suggestion:
> [
>- "marca",
>- "marcacom",
>- "mis",
>- "mispelotas"
>]
> }
>  ]
>   }
>
> }
>
>
> But when choosing the Request handler this way it doesn't:
> http://solr1:8080/events/data/select?*qt=/sugges*t&wt=json&q=*:*<http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggest&wt=json&q=*:*>
>
>
>
>
> 2013/10/23 Luis Cappa Banda 
>
>> Hello!
>>
>> I'be been trying to enable Spellchecking using sharding following the
>> steps from the Wiki, but I failed, :-( What I do is:
>>
>> *Solrconfig.xml*
>>
>>
>> <*searchComponent name="suggest"* class="solr.SpellCheckComponent">
>> 
>>  suggest
>> org.apache.solr.spelling.suggest.Suggester
>>  > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>> suggestion
>>  true
>> 
>> 
>>
>>
>> <*requestHandler name="/suggest"* class="solr.SearchHandler">
>> 
>>  suggestion
>> true
>>  suggest
>> 10
>>  
>>   
>> suggest
>>   
>> 
>>
>>
>> *Note:* I have two shards (solr1 and solr2) and both have the same
>> solrconfig.xml. Also, bot indexes were optimized to create the spellchecker
>> indexes.
>>
>> *Query*
>>
>>
>> solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data
>>
>> *
>> *
>> *Response*
>> *
>> *
>> {
>>
>>- responseHeader:
>>{
>>   - status: 404,
>>   - QTime: 12,
>>   - params:
>>   {
>>  - shards: "solr1:8080/events/data,solr2:8080/events/data",
>>  - shards.qt: "/suggestion",
>>  - q: "m",
>>  - wt: "json",
>>  - qt: "/suggestion"
>>  }
>>   },
>>- error:
>>{
>>   - msg: "Server at http://solr1:8080/events/data returned non ok
>>   status:404, message:Not Found",
>>   - code: 404
>>   }
>>
>> }
>>
>> More query syntaxes that I used and that doesn't work:
>>
>>
>> http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data>
>>
>>
>> http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data>
>>
>>
>> Any idea of what I'm doing wrong?
>>
>> Thank you very much in advance!
>>
>> Best regards,
>>
>> --
>> - Luis Cappa
>>
>
>
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Spellcheck with Distributed Search (sharding).

2013-10-24 Thread Luis Cappa Banda

I'ts just a type error, sorry about that! The Request Handler is OK spelled
and it doesn't work.


2013/10/24 Dyer, James 

> Is it that your request handler is named "/suggest" but you are setting
> "shards.qt" to "/suggestion" ?
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Luis Cappa Banda [mailto:luisca...@gmail.com]
> Sent: Thursday, October 24, 2013 6:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Spellcheck with Distributed Search (sharding).
>
> Any idea?
>
>
> 2013/10/23 Luis Cappa Banda 
>
> > More info:
> >
> > When executing the Query to a single Solr server it works:
> > http://solr1:8080/events/data/suggest?q=m&wt=json<
> http://solrclusterd.buguroo.dev:8080/events/data/suggest?q=m&wt=json>
> >
> > {
> >
> >- responseHeader:
> >{
> >   - status: 0,
> >   - QTime: 1
> >   },
> >- response:
> >{
> >   - numFound: 0,
> >   - start: 0,
> >   - docs: [ ]
> >   },
> >- spellcheck:
> >{
> >   - suggestions:
> >   [
> >  - "m",
> >  -
> >  {
> > - numFound: 4,
> > - startOffset: 0,
> > - endOffset: 1,
> > - suggestion:
> > [
> >- "marca",
> >- "marcacom",
> >- "mis",
> >    - "mispelotas"
> >]
> > }
> >  ]
> >   }
> >
> > }
> >
> >
> > But when choosing the Request handler this way it doesn't:
> > http://solr1:8080/events/data/select?*qt=/sugges*t&wt=json&q=*:*<
> http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggest&wt=json&q=*:*
> >
> >
> >
> >
> >
> > 2013/10/23 Luis Cappa Banda 
> >
> >> Hello!
> >>
> >> I'be been trying to enable Spellchecking using sharding following the
> >> steps from the Wiki, but I failed, :-( What I do is:
> >>
> >> *Solrconfig.xml*
> >>
> >>
> >> <*searchComponent name="suggest"* class="solr.SpellCheckComponent">
> >> 
> >>  suggest
> >> org.apache.solr.spelling.suggest.Suggester
> >>   >> name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
> >> suggestion
> >>  true
> >> 
> >> 
> >>
> >>
> >> <*requestHandler name="/suggest"* class="solr.SearchHandler">
> >> 
> >>  suggestion
> >> true
> >>  suggest
> >> 10
> >>  
> >>   
> >> suggest
> >>   
> >> 
> >>
> >>
> >> *Note:* I have two shards (solr1 and solr2) and both have the same
> >> solrconfig.xml. Also, bot indexes were optimized to create the
> spellchecker
> >> indexes.
> >>
> >> *Query*
> >>
> >>
> >>
> solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data
> >>
> >> *
> >> *
> >> *Response*
> >> *
> >> *
> >> {
> >>
> >>- responseHeader:
> >>{
> >>   - status: 404,
> >>   - QTime: 12,
> >>   - params:
> >>   {
> >>  - shards: "solr1:8080/events/data,solr2:8080/events/data",
> >>  - shards.qt: "/suggestion",
> >>  - q: "m",
> >>  - wt: "json",
> >>  - qt: "/suggestion"
> >>  }
> >>   },
> >>- error:
> >>{
> >>   - msg: "Server at http://solr1:8080/events/data returned non ok
> >>   status:404, message:Not Found",
> >>   - code: 404
> >>   }
> >>
> >> }
> >>
> >> More query syntaxes that I used and that doesn't work:
> >>
> >>
> >>
> http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data
> <
> http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data
> >
> >>
> >>
> >>
> http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data
> <
> http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data
> >
> >>
> >>
> >> Any idea of what I'm doing wrong?
> >>
> >> Thank you very much in advance!
> >>
> >> Best regards,
> >>
> >> --
> >> - Luis Cappa
> >>
> >
> >
> >
> > --
> > - Luis Cappa
> >
>
>
>
> --
> - Luis Cappa
>
>


-- 
- Luis Cappa

Replication: slow first query after replication.

2013-11-05 Thread Luis Cappa Banda

Hi guys!

I have a master-slave replication (Solr 4.1 version) with a 30 seconds
polling interval and continuously new documents are indexed, so after 30
seconds always new data must be replicated. My test index is not huge: just
5M documents.

I have experimented that a simple "q=*:*" query appears to be very slow (up
to 10 secs of QTime). After that first slow query the following "q=*:*"
queries are much quicker. I feel that warming up caches after replication
has something to say about this weird behavior, but maybe an index re-built
is also involved.

Question time:

*1.* How can I warm up caches against? There exists any solrconfig.xml
searcher to configure to be executed after replication events?

*2. *My system needs to execute queries to the slaves continuously. If
there exists any warm up way to reload caches, some queries will experience
slow response times until reload has finished, isn't it?

*3. *After a replication has done, does Solr execute any index rebuild
operation that slow down query responses, or this poor performance is just
due to caches?

*4. *My system is always querying by the latest documents indexed (I'm
filtering by document dates), and I don't use "fq" to execute that queries.
In this scenario, do you recommend to disable caches?

Thank you very much in advance!

Best,

-- 
- Luis Cappa

Re: Replication: slow first query after replication.

2013-11-05 Thread Luis Cappa Banda

Against --> again, :-)


2013/11/5 Luis Cappa Banda 

> Hi guys!
>
> I have a master-slave replication (Solr 4.1 version) with a 30 seconds
> polling interval and continuously new documents are indexed, so after 30
> seconds always new data must be replicated. My test index is not huge: just
> 5M documents.
>
> I have experimented that a simple "q=*:*" query appears to be very slow
> (up to 10 secs of QTime). After that first slow query the following "q=*:*"
> queries are much quicker. I feel that warming up caches after replication
> has something to say about this weird behavior, but maybe an index re-built
> is also involved.
>
> Question time:
>
> *1.* How can I warm up caches against? There exists any solrconfig.xml
> searcher to configure to be executed after replication events?
>
> *2. *My system needs to execute queries to the slaves continuously. If
> there exists any warm up way to reload caches, some queries will experience
> slow response times until reload has finished, isn't it?
>
> *3. *After a replication has done, does Solr execute any index rebuild
> operation that slow down query responses, or this poor performance is just
> due to caches?
>
> *4. *My system is always querying by the latest documents indexed (I'm
> filtering by document dates), and I don't use "fq" to execute that queries.
> In this scenario, do you recommend to disable caches?
>
> Thank you very much in advance!
>
> Best,
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Is there any limit how many documents can be indexed by apache solr

2013-11-26 Thread Luis Cappa Banda

Hello!

Checkout also your application server logs. Maybe you're trying to index
Documents with any syntax error and they are skipped.

Regards,

- Luis Cappa


2013/11/26 Alejandro Marqués Rodríguez 

> Hi,
>
> In lucene you are supossed to be able to index up to 274 billion documents
> ( http://lucene.apache.org/core/3_0_3/fileformats.html#Limitations ), so
> in
> Solr should be something like that. Anyway the maximum number is quite
> bigger than those 11.000 ;)
>
> Could it be that you are reusing IDs so the new documents overwrite the old
> ones?
>
>
> 2013/11/26 Kamal Palei 
>
> > Dear All
> > I am using Apache solr 3.6.2 with Drupal 7.
> > Users keeps adding their profiles (resumes) and with cron task from
> Drupal,
> > documents get indexed.
> >
> > Recently I observed, after indexing around 11,000 documents, further
> > documents are not getting indexed.
> >
> > Is there any configuration for max documents those can be indexed.
> >
> > Kindly help.
> >
> > Thanks
> > kamal
> >
>
>
>
> --
> Alejandro Marqués Rodríguez
>
> Paradigma Tecnológico
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>



-- 
- Luis Cappa

Facet count mismatch.

2014-01-20 Thread Luis Cappa Banda

Hello!

I've installed a classical two shards Solr 4.5 topology without SolrCloud
balancing with an HA proxy. I've got a *copyField* like this:

* *

Copied from this one:

* *

* *
**
* *
* *
* *
* *
* *
* *
*   *
**


When faceting with *tagValues* field I've got a total count of 3:


   - facet_counts:
   {
  - facet_queries: { },
  - facet_fields:
  {
 - tagsValues:
 [
- "sucks",
- 3
]
 },
  - facet_dates: { },
  - facet_ranges: { }
  }



Bug when searching like this with *tagValues* the total number of documents
is not three, but two:



   - params:
   {
  - facet: "true",
  - shards:
  "solr1.test:8081/comments/data,solr2.test:8080/comments/data",
  - facet.mincount: "1",
  - facet.sort: "count",
  - q: "tagsValues:"sucks"",
  - facet.limit: "-1",
  - facet.field: "tagsValues",
  - wt: "json"
  }



Any idea of what's happening here? I'm confused, :-/

Regards,


-- 
- Luis Cappa

Optimize and replication: some questions battery.

2014-02-05 Thread Luis Cappa Banda

Hello!

I've got an scenario where I index very frequently on master servers and
replicate to slave servers with one minute polling. Master indexes are
growing fast and I would like to optimize indexes to improve search
queries. However...

1. During an optimize operation, can master servers index new documents? I
suppose that is not possible.

2. The optimize operation can take probably minutes, hours... and then will
affect to live/production environment because new documents won't be
indexed. Should I optimize each slave indexes, instead? What will happen
with replication? Will slave servers "loose" index identifiers that allow
them to replicate delta documents from master after optimizing them? Will
the next replication update slaves indexes overriding the optimized index?

Thank you very much in advance.

Regards,

-- 
- Luis Cappa

Re: Optimize and replication: some questions battery.

2014-02-06 Thread Luis Cappa Banda

Hi Chris,

Thank you very much for your response! It was very instructive. I knew some
performance tips to improve search and I configured a very low merge factor
(2) to boost search operations instead of
indexation ones. I haven't got a deep knowledge of internal Lucene behavior
in this case, but I thought that somehow an optimization operation may
rebuild the index checking and fixing corrupted segments, merging again
whatever it should merge, etc., and finally the "new" master index will be
a better index where to insert new data frequently.

One last question: do you think that this kind of scenario where I
continuously index and replicate data will corrupt the index? In the past I
developed a simple tool using a Lucene class to check the index and alert
me if it's corrupted or not, so if you think that this scenario is
dangerous maybe I can reuse that tool to prevent weird production
situations.

Best,


- Luis Cappa


2014-02-05 Chris Hostetter :

>
> : I've got an scenario where I index very frequently on master servers and
> : replicate to slave servers with one minute polling. Master indexes are
> : growing fast and I would like to optimize indexes to improve search
> : queries. However...
>
> For a scenerio where your index is changing that rapidly, you don't wnat
> to use the optimize command at all -- it's not going to improve the
> performance of anything...
>
>
> https://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations
>
> You may want to optimize an index in certain situations -- ie: if you
> build your index once, and then never modify it.
>
> If you have a rapidly changing index, rather than optimizing, you likely
> simply want to use a lower merge factor. Optimizing is very expensive, and
> if the index is constantly changing, the slight performance boost will not
> last long. The tradeoff is not often worth it for a non static index.
>
> In a master slave setup, sometimes you may also want to optimize on the
> master so that slaves serve from a single segment index. This will can
> greatly increase the time to replicate the index though, so this is often
> not desirable either.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
- Luis Cappa

Re: Optimize and replication: some questions battery.

2014-02-06 Thread Luis Cappa Banda

Hi Toke!

Thanks for answering. That's it: I talk about index corruption just to
prevent, not because I have already noticed it. During some tests in the
past I checked that a mergeFactor of 2 improves more than a little bit
search speed instead common merge factors such as 10, for example. Of
course index speed is penalized, but my production architecture is based on
task queues and workers that index into Solr, and I've developed a custom
SolrCluster module that it's a black box that acts as a single Solr server
from an outside point of view, but it balances into N Solr master servers
internally deciding where to index, checking Solr servers status (alive,
dead), executing sharding search queries, etc., so that point is
controlled: if I need more index speed I can add new Solr masters and/or
new worker modules to dequeue, process and execute index operations. My
principal worry was about optimizing at much as possible search speed
thanks to optimizing, mergeFactor tunning, caches setup, etc.

Thanks a lot!

2014-02-06 Toke Eskildsen :

> On Thu, 2014-02-06 at 10:22 +0100, Luis Cappa Banda wrote:
> > I knew some performance tips to improve search and I configured a very
> > low merge factor (2) to boost search
> > operations instead of indexation ones.
>
> That would give you a small search speed increase and a huge penalty on
> indexing speed (as it will perform large merges all the time) and
> replication speed (as all file data will be updated frequently instead
> of just a subset of them). Unless you are absolutely sure that you need
> the small search speed increase, you should set this to a higher number.
>
> > I haven't got a deep knowledge of internal Lucene behavior in this
> > case, but I thought that somehow an optimization operation may rebuild
> > the index checking and fixing corrupted segments,
>
> To my knowledge, there are not attempts to repair corrupted segments
> during merge. I hope you speak of corruption as a precaution and not
> because it is something that happens to your setup. If you have
> corrupted indexes at any time, you should investigate how that happens,
> instead of trying to repair them.
>
> > One last question: do you think that this kind of scenario where I
> > continuously index and replicate data will corrupt the index?
>
> Lucene is used in a lot of places with massive updates. Aside for
> JVM-related bugs, it has proven to be very stable under these
> conditions. So not, the indexing will not corrupt anything.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>

-- 
- Luis Cappa

solr/lucene 4.10 out of memory issues

2014-09-11 Thread Luis Carlos Guerrero

hey guys,

I'm running a solrcloud cluster consisting of five nodes. My largest index
contains 2.5 million documents and occupies about 6 gigabytes of disk
space. We recently switched to the latest solr version (4.10) from version
4.4.1 which we ran successfully for about a year without any major issues.
>From the get go we started having memory problems caused by the CMS old
heap usage being filled up incrementally. It starts out with a very low
memory consumption and after 12 hours or so it ends up using up all
available heap space. We thought it could be one of the caches we had
configured, so we reduced our main core filter cache max size from 1024 to
512 elements. The only thing we accomplished was that the cluster ran for a
longer time than before.

I generated several heapdumps and basically what is filling up the heap is
lucene's field cache. it gets bigger and bigger until it fills up all
available memory.

My jvm memory settings are the following:

-Xms15g -Xmx15g -XX:PermSize=512m -XX:MaxPermSize=512m -XX:NewSize=5g
-XX:MaxNewSize=5g
-XX:+UseParNewGC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDateStamps
-XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
What's weird to me is that we didn't have this problem before, I'm thinking
this is some kind of memory leak issue present in the new lucene. We ran
our old cluster for several weeks at a time without having to redeploy
because of config changes or other reasons. Was there some issue reported
related to elevated memory consumption by the field cache?

any help would be greatly appreciated.

regards,

-- 
Luis Carlos Guerrero
about.me/luis.guerrero

Re: solr/lucene 4.10 out of memory issues

2014-09-16 Thread Luis Carlos Guerrero

Thanks for the response, I've been working on solving some of the most
evident issues and I also added your garbage collector parameters. First of
all the Lucene field cache is being filled with some entries which are
marked as 'insanity'. Some of these were related to a custom field that we
use for our ranking. We fixed our custom plugin classes so that we wouldn't
see any entries related to those fields there, but it seems there are other
related problems with the field cache. Mainly the cache is being filled
with these types of insanity entries:

'SUBREADER: Found caches for descendants of StandardDirectoryReader'

They are all related to standard solr fields. Could it be that our current
schemas and configs have some incorrect setting that is not compliant with
this lucene version? I'll keep investigating the subject but if there is
any additional information you can give me about these types of field cache
insanity warnings it would be really helpful.

On Thu, Sep 11, 2014 at 3:00 PM, Timothy Potter 
wrote:

> Probably need to look at it running with a profiler to see what's up.
> Here's a few additional flags that might help the GC work better for
> you (which is not to say there isn't a leak somewhere):
>
> -XX:MaxTenuringThreshold=8 -XX:CMSInitiatingOccupancyFraction=40
>
> This should lead to a nice up-and-down GC profile over time.
>
> On Thu, Sep 11, 2014 at 10:52 AM, Luis Carlos Guerrero
>  wrote:
> > hey guys,
> >
> > I'm running a solrcloud cluster consisting of five nodes. My largest
> index
> > contains 2.5 million documents and occupies about 6 gigabytes of disk
> > space. We recently switched to the latest solr version (4.10) from
> version
> > 4.4.1 which we ran successfully for about a year without any major
> issues.
> > From the get go we started having memory problems caused by the CMS old
> > heap usage being filled up incrementally. It starts out with a very low
> > memory consumption and after 12 hours or so it ends up using up all
> > available heap space. We thought it could be one of the caches we had
> > configured, so we reduced our main core filter cache max size from 1024
> to
> > 512 elements. The only thing we accomplished was that the cluster ran
> for a
> > longer time than before.
> >
> > I generated several heapdumps and basically what is filling up the heap
> is
> > lucene's field cache. it gets bigger and bigger until it fills up all
> > available memory.
> >
> > My jvm memory settings are the following:
> >
> > -Xms15g -Xmx15g -XX:PermSize=512m -XX:MaxPermSize=512m -XX:NewSize=5g
> > -XX:MaxNewSize=5g
> > -XX:+UseParNewGC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDateStamps
> > -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError
> -XX:+UseConcMarkSweepGC
> > What's weird to me is that we didn't have this problem before, I'm
> thinking
> > this is some kind of memory leak issue present in the new lucene. We ran
> > our old cluster for several weeks at a time without having to redeploy
> > because of config changes or other reasons. Was there some issue reported
> > related to elevated memory consumption by the field cache?
> >
> > any help would be greatly appreciated.
> >
> > regards,
> >
> > --
> > Luis Carlos Guerrero
> > about.me/luis.guerrero
>



-- 
Luis Carlos Guerrero
about.me/luis.guerrero

Re: solr/lucene 4.10 out of memory issues

2014-09-16 Thread Luis Carlos Guerrero

I checked and these 'insanity' cached keys correspond to fields we use for
both grouping and faceting. The same behavior is documented here:
https://issues.apache.org/jira/browse/SOLR-4866, although I have single
shards for every replica which the jira says is a setup which should not
generate these issues.

What I don't get is why the cluster was running fine with solr 4.4,
although double checking I was using LUCENE_40 as the match version. If I
use this match version in my current running 4.10 cluster will it make a
difference, or will I experience more issues than if I just roll back to
4.4 with LUCENE_40 match version? The problem in the end is that the
fieldcache grows unlimitedly. I'm thinking its because of the insanity
entries but I'm not really sure. It seem like a really big problem to leave
unattended or is the use case for faceting and grouping on the same field
not that common?

On Tue, Sep 16, 2014 at 11:06 AM, Luis Carlos Guerrero <
lcguerreroc...@gmail.com> wrote:

> Thanks for the response, I've been working on solving some of the most
> evident issues and I also added your garbage collector parameters. First of
> all the Lucene field cache is being filled with some entries which are
> marked as 'insanity'. Some of these were related to a custom field that we
> use for our ranking. We fixed our custom plugin classes so that we wouldn't
> see any entries related to those fields there, but it seems there are other
> related problems with the field cache. Mainly the cache is being filled
> with these types of insanity entries:
>
> 'SUBREADER: Found caches for descendants of StandardDirectoryReader'
>
> They are all related to standard solr fields. Could it be that our current
> schemas and configs have some incorrect setting that is not compliant with
> this lucene version? I'll keep investigating the subject but if there is
> any additional information you can give me about these types of field cache
> insanity warnings it would be really helpful.
>
> On Thu, Sep 11, 2014 at 3:00 PM, Timothy Potter 
> wrote:
>
>> Probably need to look at it running with a profiler to see what's up.
>> Here's a few additional flags that might help the GC work better for
>> you (which is not to say there isn't a leak somewhere):
>>
>> -XX:MaxTenuringThreshold=8 -XX:CMSInitiatingOccupancyFraction=40
>>
>> This should lead to a nice up-and-down GC profile over time.
>>
>> On Thu, Sep 11, 2014 at 10:52 AM, Luis Carlos Guerrero
>>  wrote:
>> > hey guys,
>> >
>> > I'm running a solrcloud cluster consisting of five nodes. My largest
>> index
>> > contains 2.5 million documents and occupies about 6 gigabytes of disk
>> > space. We recently switched to the latest solr version (4.10) from
>> version
>> > 4.4.1 which we ran successfully for about a year without any major
>> issues.
>> > From the get go we started having memory problems caused by the CMS old
>> > heap usage being filled up incrementally. It starts out with a very low
>> > memory consumption and after 12 hours or so it ends up using up all
>> > available heap space. We thought it could be one of the caches we had
>> > configured, so we reduced our main core filter cache max size from 1024
>> to
>> > 512 elements. The only thing we accomplished was that the cluster ran
>> for a
>> > longer time than before.
>> >
>> > I generated several heapdumps and basically what is filling up the heap
>> is
>> > lucene's field cache. it gets bigger and bigger until it fills up all
>> > available memory.
>> >
>> > My jvm memory settings are the following:
>> >
>> > -Xms15g -Xmx15g -XX:PermSize=512m -XX:MaxPermSize=512m -XX:NewSize=5g
>> > -XX:MaxNewSize=5g
>> > -XX:+UseParNewGC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDateStamps
>> > -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError
>> -XX:+UseConcMarkSweepGC
>> > What's weird to me is that we didn't have this problem before, I'm
>> thinking
>> > this is some kind of memory leak issue present in the new lucene. We ran
>> > our old cluster for several weeks at a time without having to redeploy
>> > because of config changes or other reasons. Was there some issue
>> reported
>> > related to elevated memory consumption by the field cache?
>> >
>> > any help would be greatly appreciated.
>> >
>> > regards,
>> >
>> > --
>> > Luis Carlos Guerrero
>> > about.me/luis.guerrero
>>
>
>
>
> --
> Luis Carlos Guerrero
> about.me/luis.guerrero
>



-- 
Luis Carlos Guerrero
about.me/luis.guerrero

Syllabification, readability metric

2014-09-30 Thread Luis Carlos Guerrero

Hi,

Does Lucene support syllabification of words out of the box? If so is there
support for brazilian portuguese? I'm trying to setup a readability score
for short text descriptions and this would be really helpful.

thanks,

-- 
Luis Carlos Guerrero
about.me/luis.guerrero

Delete in Solr based on foreign key (like SQL delete from … where id in (select id from…)

2014-10-09 Thread Luis Festas Matos

Given the following Solr data:


1008rs1cz0icl2pk
2014-10-07T14:18:29.784Z
h60fmtybz0i7sx87
1481314421768716288
u42xyz1cz0i7sx87
h60fmtybz0i7sx87
1481314421768716288
   u42xyz1cz0i7sx87
   h60fmtybz0i7sx87
   1481314421448900608

I would like to know how to *DELETE documents* above on the Solr console or
using a script that achieves the same result as issuing the following
statement in SQL (assuming all of these columns existed in a table called x
):

DELETE FROM x WHERE foreign_key_docid_s in (select docid_s from x
where message_state_ts < '2014-10-05' and message_state_ts >
'2014-10-01')

Basically, delete all derived documents whose foreign key is the same as
the primary key where the primary key is selected between 2 dates.

Question originally posted on stackoverflow.com ->
http://stackoverflow.com/questions/26248372/delete-in-solr-based-on-foreign-key-like-sql-delete-from-where-id-in-selec

Re: Delete in Solr based on foreign key (like SQL delete from … where id in (select id from…)

2014-10-10 Thread Luis Festas Matos

Hi matthew,

I'm more than glad getting the ids and deleting them in a separate query,
if need be. But how do I do it? It's dozens of thousands of ids that I have
to delete. What's the strategy to delete them?

On Fri, Oct 10, 2014 at 4:16 AM, Matthew Nigl 
wrote:

> I was going to say that the below should do what you are asking:
>
> {!join from=docid_s
> to=foreign_key_docid_s}(message_state_ts:[* TO 2014-10-05T00:00:00Z} AND
> message_state_ts:{2014-10-01T00:00:00Z TO *])
>
> But I get the same response as in
> https://issues.apache.org/jira/browse/SOLR-6357
>
> I can't think of any other queries at the moment. You might consider using
> the above query (which should work as a normal select query) to get the
> IDs, then delete them in a separate query.
>
>
> On 10 October 2014 07:31, Luis Festas Matos  wrote:
>
> > Given the following Solr data:
> >
> > 
> > 1008rs1cz0icl2pk
> > 2014-10-07T14:18:29.784Z
> > h60fmtybz0i7sx87
> > 1481314421768716288
> > u42xyz1cz0i7sx87
> > h60fmtybz0i7sx87
> > 1481314421768716288
> >u42xyz1cz0i7sx87
> >h60fmtybz0i7sx87
> >1481314421448900608
> >
> > I would like to know how to *DELETE documents* above on the Solr console
> or
> > using a script that achieves the same result as issuing the following
> > statement in SQL (assuming all of these columns existed in a table
> called x
> > ):
> >
> > DELETE FROM x WHERE foreign_key_docid_s in (select docid_s from x
> > where message_state_ts < '2014-10-05' and message_state_ts >
> > '2014-10-01')
> >
> > Basically, delete all derived documents whose foreign key is the same as
> > the primary key where the primary key is selected between 2 dates.
> >
> > Question originally posted on stackoverflow.com ->
> >
> >
> http://stackoverflow.com/questions/26248372/delete-in-solr-based-on-foreign-key-like-sql-delete-from-where-id-in-selec
> >
>

Email regular expression.

2013-07-30 Thread Luis Cappa Banda

Hello everyone!

Unfortunately I have to search all E-mail addresses found in a text field
from each document. I've been reading for a while how to use RegExp's in
Solr, but after trying some of them they didn't work. I've noticed that
Lucene RegExp syntax sometimes is very different from the classic RegExp
syntax, so that may be the reason why they didn't work for me, and maybe
someone more expert can help me.

The syntax is the following:

*E-mail: *
text:/[a-z0-9_\|-]+(\.[a-z0-9_\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]|)*\.([a-z]{2,4})/

Thank you very much in advance!

Best regards,

-- 
- Luis Cappa

Re: Email regular expression.

2013-07-30 Thread Luis Cappa Banda

Hello, Jack, Steve,

Thank you for your answers. I´ve never used UAX29URLEmailTokenizerFactory,
but I´ve read about it before trying RegExp´s queries. As far as I
know, UAX29URLEmailTokenizerFactory
allows to tokenize an entry text value into patterns that match URLs,
E-mails, etc. Reading the documentation I haven´t found any way to select
just E-mail patterns, not URL ones, for example. I feel that it may have
sense to specify one or multiple patterns in a configuration file to be
setted during the Tokenizer definition in the schema.xml, but I found
nothing.

I´ve just want to retrieve those documents indexed where they appear at
least one E-mail inside de text. However, even using
UAX29URLEmailTokenizerFactory,
and suposing that I store that E-mail data in a field called 'emails' (I
feel creative, hehe), a query like the following appears to be... dirty:

http://localhost:8080/mysolr/select?q=emails:[* TO
*]&start=0&rows=10&sort=mydate desc

What do you think about?

And Andy... I know many RegExps to find E-mail patterns in a text - that
wasn´t my question, and of course there is no perfect one. However, Lucene
RegExp syntax is different from classic RegExp one, so is not as easy as
copy & paste any RegExps and, voilá! E-mails everywhere.

Thank you very much in advance,

Best regards,





2013/7/30 Jack Krupansky 

> Just use the UAX29URLEmailTokenizerFactory, which recognizes email
> addresses.
>
> Any particular reason that you're trying to reinvent the wheel?
>
> -- Jack Krupansky
>
> -Original Message- From: Luis Cappa Banda
> Sent: Tuesday, July 30, 2013 10:53 AM
> To: solr-user@lucene.apache.org
> Subject: Email regular expression.
>
>
> Hello everyone!
>
> Unfortunately I have to search all E-mail addresses found in a text field
> from each document. I've been reading for a while how to use RegExp's in
> Solr, but after trying some of them they didn't work. I've noticed that
> Lucene RegExp syntax sometimes is very different from the classic RegExp
> syntax, so that may be the reason why they didn't work for me, and maybe
> someone more expert can help me.
>
> The syntax is the following:
>
> *E-mail: *
>
> text:/[a-z0-9_\|-]+(\.[a-z0-9_**\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]**
> |)*\.([a-z]{2,4})/
>
> Thank you very much in advance!
>
> Best regards,
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Email regular expression.

2013-07-30 Thread Luis Cappa Banda

Hello guys,

Hey, I think I´ve found how to do this just adding a filter. Just for
anyone´s curiosity:

   
  


  


Anyway, I still need to do a query like the following to retrieve those
documents with at least one E-mail detected:

http://localhost:8080/mysolr/select?q=emails:[* TO
*]&start=0&rows=10&sort=mydate desc

And I don´t like it, to be honest,

Regards,




2013/7/30 Luis Cappa Banda 

> Hello, Jack, Steve,
>
> Thank you for your answers. I´ve never used UAX29URLEmailTokenizerFactory,
> but I´ve read about it before trying RegExp´s queries. As far as I know, 
> UAX29URLEmailTokenizerFactory
> allows to tokenize an entry text value into patterns that match URLs,
> E-mails, etc. Reading the documentation I haven´t found any way to select
> just E-mail patterns, not URL ones, for example. I feel that it may have
> sense to specify one or multiple patterns in a configuration file to be
> setted during the Tokenizer definition in the schema.xml, but I found
> nothing.
>
> I´ve just want to retrieve those documents indexed where they appear at
> least one E-mail inside de text. However, even using 
> UAX29URLEmailTokenizerFactory,
> and suposing that I store that E-mail data in a field called 'emails' (I
> feel creative, hehe), a query like the following appears to be... dirty:
>
> http://localhost:8080/mysolr/select?q=emails:[* TO
> *]&start=0&rows=10&sort=mydate desc
>
> What do you think about?
>
> And Andy... I know many RegExps to find E-mail patterns in a text - that
> wasn´t my question, and of course there is no perfect one. However, Lucene
> RegExp syntax is different from classic RegExp one, so is not as easy as
> copy & paste any RegExps and, voilá! E-mails everywhere.
>
> Thank you very much in advance,
>
> Best regards,
>
>
>
>
>
> 2013/7/30 Jack Krupansky 
>
>> Just use the UAX29URLEmailTokenizerFactory, which recognizes email
>> addresses.
>>
>> Any particular reason that you're trying to reinvent the wheel?
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Luis Cappa Banda
>> Sent: Tuesday, July 30, 2013 10:53 AM
>> To: solr-user@lucene.apache.org
>> Subject: Email regular expression.
>>
>>
>> Hello everyone!
>>
>> Unfortunately I have to search all E-mail addresses found in a text field
>> from each document. I've been reading for a while how to use RegExp's in
>> Solr, but after trying some of them they didn't work. I've noticed that
>> Lucene RegExp syntax sometimes is very different from the classic RegExp
>> syntax, so that may be the reason why they didn't work for me, and maybe
>> someone more expert can help me.
>>
>> The syntax is the following:
>>
>> *E-mail: *
>>
>> text:/[a-z0-9_\|-]+(\.[a-z0-9_**\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]**
>> |)*\.([a-z]{2,4})/
>>
>> Thank you very much in advance!
>>
>> Best regards,
>>
>> --
>> - Luis Cappa
>>
>
>
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Email regular expression.

2013-07-30 Thread Luis Cappa Banda

I´ve tried this kind of queries in the past but I detected that they have a
poor performance and that they are incredibly slow. But it´s just my
experience, maybe someone can share with us any other opinion.

2013/7/30 Raymond Wiker 

> On Jul 30, 2013, at 22:05 , Luis Cappa Banda  wrote:
> > Anyway, I still need to do a query like the following to retrieve those
> > documents with at least one E-mail detected:
> >
> > http://localhost:8080/mysolr/select?q=emails:[* TO
> > *]&start=0&rows=10&sort=mydate desc
>
> Can't you just use emails:* ?
>
>
>

-- 
- Luis Cappa

Re: Email regular expression.

2013-07-30 Thread Luis Cappa Banda

I´ve been re-reading about that in older solr-mail-list messages, and it
seems that a query like 'field:*' implies that internally the whole terms
indexed are checked one by one even if they are some caches filled for that
field. That make reasonable my poor performance in the past.

However, it may be possible to create a field called 'flagEmails' that will
be true if the field 'emails' is filled via UAX29URLEmailTokenizerFactory.
Does anyone implemented during index-time this kind of behavior? Is it
possible?

Regards,

2013/7/30 Luis Cappa Banda 

> I´ve tried this kind of queries in the past but I detected that they have
> a poor performance and that they are incredibly slow. But it´s just my
> experience, maybe someone can share with us any other opinion.
>
>
> 2013/7/30 Raymond Wiker 
>
>> On Jul 30, 2013, at 22:05 , Luis Cappa Banda  wrote:
>> > Anyway, I still need to do a query like the following to retrieve those
>> > documents with at least one E-mail detected:
>> >
>> > http://localhost:8080/mysolr/select?q=emails:[* TO
>> > *]&start=0&rows=10&sort=mydate desc
>>
>> Can't you just use emails:* ?
>>
>>
>>
>
>
> --
> - Luis Cappa
>

-- 
- Luis Cappa

Re: Performance question on Spatial Search

2013-07-30 Thread Luis Cappa Banda

Hey, David,

I´ve been reading the thread and I think that is one of the most educative
mail-threads I´ve read in Solr mailing list. Just for curiosity: internally
for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I
think that it´s expected to receive the same number of numFound documents,
but I would like to know the internal behavior of Solr.

Best regards,

- Luis Cappa


2013/7/30 Smiley, David W. 

> Steve,
> The FieldCache and DocValues are irrelevant to this problem.  Solr's
> FilterCache is, and Lucene has no counterpart.  Perhaps it would be cool
> if Solr could look for expensive field:* usages when parsing its queries
> and re-write them to use the FilterCache.  That's quite doable, I think.
> I just created an issue for it:
> https://issues.apache.org/jira/browse/SOLR-5093but don't expect me to
> work on it anytime soon ;-)
>
>
> ~ David
>
> On 7/30/13 2:02 PM, "Steven Bower"  wrote:
>
> >I am curious why the field:* walks the entire terms list.. could this be
> >discovered from a field cache / docvalues?
> >
> >steve
> >
> >
> >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower  wrote:
> >
> >> Until I get the data refed I there was another field (a date field) that
> >> was there and not when the geo field was/was not... i tried that field:*
> >> and query times come down to 2.5s .. also just removing that filter
> >>brings
> >> the query down to 30ms.. so I'm very hopeful that with just a boolean
> >>i'll
> >> be down in that sub 100ms range..
> >>
> >> steve
> >>
> >>
> >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower 
> >>wrote:
> >>
> >>> Will give the boolean thing a shot... makes sense...
> >>>
> >>>
> >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W.
> >>>wrote:
> >>>
> >>>> I see the problem ‹ it's +pp:*. It may look innocent but it's a
> >>>> performance killer.  What your telling Lucene to do is iterate over
> >>>> *every* term in this index to find all documents that have this data.
> >>>> Most fields are pretty slow to do that.  Lucene/Solr does not have
> >>>>some
> >>>> kind of cache for this. Instead, you should index a new boolean field
> >>>> indicating wether or not 'pp' is populated and then do a simple true
> >>>> check
> >>>> against that field.  Another approach you could do right now without
> >>>> reindexing is to simplify the last 2 clauses of your 3-clause boolean
> >>>> query by using the "IsDisjointTo" predicate.  But unfortunately Lucene
> >>>> doesn't have a generic filter cache capability and so this predicate
> >>>>has
> >>>> no place to cache the whole-world query it does internally (each and
> >>>> every
> >>>> time it's used), so it will be slower than the boolean field I
> >>>>suggested
> >>>> you add.
> >>>>
> >>>>
> >>>> Nevermind on LatLonType; it doesn't support JTS/Polygons.  There is
> >>>> something close called SpatialPointVectorFieldType that could be
> >>>>modified
> >>>> trivially but it doesn't support it now.
> >>>>
> >>>> ~ David
> >>>>
> >>>> On 7/30/13 11:32 AM, "Steven Bower"  wrote:
> >>>>
> >>>> >#1 Here is my query:
> >>>> >
> >>>> >sort=vid asc
> >>>> >start=0
> >>>> >rows=1000
> >>>> >defType=edismax
> >>>> >q=*:*
> >>>> >fq=recordType:"xxx"
> >>>> >fq=vt:"X12B" AND
> >>>> >fq=(cls:"3" OR cls:"8")
> >>>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
> >>>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR
> >>>> >vid:89XXX48
> >>>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR
> >>>> vid:90XXX33
> >>>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR
> >>>> vid:90XXX44
> >>>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR
> >>>> vid:91XXX87
> >>>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR
> >>>

Re: Performance question on Spatial Search

2013-07-30 Thread Luis Cappa Banda

Thank you very much, David. That was a great explanation!

Regards,

- Luis Cappa


2013/7/30 Smiley, David W. 

> Luis,
>
> field:* and field:[* TO *] are semantically equivalent -- they have the
> same effect.  But they internally work differently depending on the field
> type.  The field type has the chance to intercept the range query to do
> something smart (FieldType.getRangeQuery(...)).  Numeric/Date (trie)
> fields have a reasonably quick implementation for such queries.  Spatial
> fields could be enhanced similarly but aren't (yet).  So in general you
> should avoid field:* in favor of field:[* TO *].  Perhaps Solr should
> redirect a field:* to the FieldType's getRangeQuery method so that there
> is no difference.  Anyway, the official/best way to ask for all data in a
> field (without cheating and indexing a boolean in a different field) is
> field:[* TO *].
>
> ~ David
>
> On 7/30/13 4:44 PM, "Luis Cappa Banda"  wrote:
>
> >Hey, David,
> >
> >I´ve been reading the thread and I think that is one of the most educative
> >mail-threads I´ve read in Solr mailing list. Just for curiosity:
> >internally
> >for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I
> >think that it´s expected to receive the same number of numFound documents,
> >but I would like to know the internal behavior of Solr.
> >
> >Best regards,
> >
> >- Luis Cappa
> >
> >
> >2013/7/30 Smiley, David W. 
> >
> >> Steve,
> >> The FieldCache and DocValues are irrelevant to this problem.  Solr's
> >> FilterCache is, and Lucene has no counterpart.  Perhaps it would be cool
> >> if Solr could look for expensive field:* usages when parsing its queries
> >> and re-write them to use the FilterCache.  That's quite doable, I think.
> >> I just created an issue for it:
> >> https://issues.apache.org/jira/browse/SOLR-5093but don't expect me
> >>to
> >> work on it anytime soon ;-)
> >>
> >>
> >> ~ David
> >>
> >> On 7/30/13 2:02 PM, "Steven Bower"  wrote:
> >>
> >> >I am curious why the field:* walks the entire terms list.. could this
> >>be
> >> >discovered from a field cache / docvalues?
> >> >
> >> >steve
> >> >
> >> >
> >> >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower 
> >>wrote:
> >> >
> >> >> Until I get the data refed I there was another field (a date field)
> >>that
> >> >> was there and not when the geo field was/was not... i tried that
> >>field:*
> >> >> and query times come down to 2.5s .. also just removing that filter
> >> >>brings
> >> >> the query down to 30ms.. so I'm very hopeful that with just a boolean
> >> >>i'll
> >> >> be down in that sub 100ms range..
> >> >>
> >> >> steve
> >> >>
> >> >>
> >> >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower 
> >> >>wrote:
> >> >>
> >> >>> Will give the boolean thing a shot... makes sense...
> >> >>>
> >> >>>
> >> >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W.
> >> >>>wrote:
> >> >>>
> >> >>>> I see the problem ‹ it's +pp:*. It may look innocent but it's a
> >> >>>> performance killer.  What your telling Lucene to do is iterate over
> >> >>>> *every* term in this index to find all documents that have this
> >>data.
> >> >>>> Most fields are pretty slow to do that.  Lucene/Solr does not have
> >> >>>>some
> >> >>>> kind of cache for this. Instead, you should index a new boolean
> >>field
> >> >>>> indicating wether or not 'pp' is populated and then do a simple
> >>true
> >> >>>> check
> >> >>>> against that field.  Another approach you could do right now
> >>without
> >> >>>> reindexing is to simplify the last 2 clauses of your 3-clause
> >>boolean
> >> >>>> query by using the "IsDisjointTo" predicate.  But unfortunately
> >>Lucene
> >> >>>> doesn't have a generic filter cache capability and so this
> >>predicate
> >> >>>>has
> >> >>>> no place to cache the whole-world query it

EmbeddedSolrServer Solr 4.4.0 bug?

2013-07-31 Thread Luis Cappa Banda

Hello guys,

Since I upgrade from 4.1.0 to 4.4.0 version I've noticed that
EmbeddedSolrServer has changed a little the way of construction:

*Solr 4.1.0 style:*

CoreContainer coreContainer = new CoreContainer(*solrHome, new
File(solrHome+"/solr.xml"*));
EmbeddedSolrServer localSolrServer = new EmbeddedSolrServer(coreContainer,
core);

*Solr 4.4.0 new style:
*

CoreContainer coreContainer = new CoreContainer(*solrHome*);
EmbeddedSolrServer localSolrServer = new EmbeddedSolrServer(coreContainer,
core);


However, it's not working. I've got the following solr.xml configuration
file:

*
*
**
**
**


And resources appears to be loaded correctly:

*2013-07-31 09:46:37,583 47889 [main] INFO  org.apache.solr.core.ConfigSolr
 - Loading container configuration from /opt/solr/solr.xml*


But when indexing into core with coreName 'core', it throws an Exception:

*2013-07-31 09:50:49,409 5189 [main] ERROR
com.buguroo.solr.index.WriteIndex  - No such core: core*

Or I am sleppy, something that's possible, or there is some kind of bug
here.

Best regards,

-- 
- Luis Cappa

Re: EmbeddedSolrServer Solr 4.4.0 bug?

2013-07-31 Thread Luis Cappa Banda

Thank you very much, Alan. Now it's working! I agree with you: this kind of
things should be documented at least in CHANGELOG.txt, because when
upgrading from one version to another all should be compatible between
versions, but this is not the case, thus people should be noticed of that.

Regards,


2013/7/31 Alan Woodward 

> Hi Luis,
>
> You need to call coreContainer.load() after construction for it to load
> the cores.  Previously the CoreContainer(solrHome, configFile) constructor
> also called load(), but this was the only constructor to do that.
>
> I probably need to put something in CHANGES.txt to point this out...
>
> Alan Woodward
> www.flax.co.uk
>
>
> On 31 Jul 2013, at 08:53, Luis Cappa Banda wrote:
>
> > Hello guys,
> >
> > Since I upgrade from 4.1.0 to 4.4.0 version I've noticed that
> > EmbeddedSolrServer has changed a little the way of construction:
> >
> > *Solr 4.1.0 style:*
> >
> > CoreContainer coreContainer = new CoreContainer(*solrHome, new
> > File(solrHome+"/solr.xml"*));
> > EmbeddedSolrServer localSolrServer = new
> EmbeddedSolrServer(coreContainer,
> > core);
> >
> > *Solr 4.4.0 new style:
> > *
> >
> > CoreContainer coreContainer = new CoreContainer(*solrHome*);
> > EmbeddedSolrServer localSolrServer = new
> EmbeddedSolrServer(coreContainer,
> > core);
> >
> >
> > However, it's not working. I've got the following solr.xml configuration
> > file:
> >
> > * > hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}"
> > zkClientTimeout="${zkClientTimeout:15000}">
> > *
> > **
> > **
> > **
> >
> >
> > And resources appears to be loaded correctly:
> >
> > *2013-07-31 09:46:37,583 47889 [main] INFO
>  org.apache.solr.core.ConfigSolr
> > - Loading container configuration from /opt/solr/solr.xml*
> >
> >
> > But when indexing into core with coreName 'core', it throws an Exception:
> >
> > *2013-07-31 09:50:49,409 5189 [main] ERROR
> > com.buguroo.solr.index.WriteIndex  - No such core: core*
> >
> > Or I am sleppy, something that's possible, or there is some kind of bug
> > here.
> >
> > Best regards,
> >
> > --
> > - Luis Cappa
>
>


-- 
- Luis Cappa

Re: Distributed MLT is slow

2013-08-20 Thread Luis Cappa Banda

Is distributed MLT officially released or you are using a patch?

El martes, 20 de agosto de 2013, Shawn Heisey escribió:

> Before I file an issue on this, I wanted to bring it up here, so I can see
> if there's something I'm overlooking.
>
> Distributed MLT is very very slow for me.  I can make it work, but a QTime
> of one to two minutes in production isn't acceptable.  Sending a
> non-distributed MLT request directly to a large shard takes about 1.5
> seconds.  There are six large cold shards and one tiny hot shard.
>
> I used my dev server to gather some logs.  This server is considerably
> less powerful than my production servers, but has exactly the same data.
>  It's running a 4.5 snapshot with the patch from SOLR-5125.  Unlike my
> production servers, the dev server takes over four minutes for the
> distributed MLT request.  Slightly redacted logfile at this URL:
>
> https://dl.dropboxusercontent.**com/u/97770508/slow-mlt.log<https://dl.dropboxusercontent.com/u/97770508/slow-mlt.log>
>
> After I ran the query that you can see in the logfile, I restarted Solr on
> my dev server and ran one of the slow subrequests directly to a shard.
>  Here's the debugQuery timing section from that request.  QTime on it was
> 56506:
>
> "QParser":"LuceneQParser",
> "timing":{
>   "time":56504.0,
>   "prepare":{
> "time":29.0,
> "query":{
>   "time":29.0},
> "facet":{
>   "time":0.0},
> "mlt":{
>   "time":0.0},
> "highlight":{
>   "time":0.0},
> "stats":{
>   "time":0.0},
> "spellcheck":{
>   "time":0.0},
> "debug":{
>   "time":0.0}},
>   "process":{
> "time":56475.0,
> "query":{
>   "time":935.0},
> "facet":{
>   "time":0.0},
> "mlt":{
>   "time":55442.0},
> "highlight":{
>   "time":0.0},
> "stats":{
>   "time":0.0},
> "spellcheck":{
>   "time":0.0},
> "debug":{
>   "time":98.0}
>
> Is there anything for me to do other than file an issue?
>
> Thanks,
> Shawn
>


-- 
- Luis Cappa

Re: SOLR Prevent solr of modifying fields when update doc

2013-08-24 Thread Luis Portela Afonso

Hi,

The uuid, that was been used like the id of a document, it's generated by
solr using an updatechain.
I just use the recommend method to generate uuid's.

I think an atomic update is not suitable for me, because I want that solr
indexes the feeds and not me. I don't want to send information to solr, I
want that indexes it each 15 minutes, for example, and now it's doing that.

Lance, I don't understand what you want to say with, software that I use to
index.
I just use solr. I have a configuration with two entities. One that selects
my rss sources from a database and then the main entity that get
information from an URL and processes it.

Thank you all for the answers.
Much appreciated

On Saturday, August 24, 2013, Greg Preston wrote:

> But there is an API for sending a delta over the wire, and server side it
> does a read, overlay, delete, and insert.  And only the fields you sent
> will be changed.
>
> *Might require your unchanged fields to all be stored, though.
>
>
> -Greg
>
>
> On Fri, Aug 23, 2013 at 7:08 PM, Lance Norskog 
> >
> wrote:
>
> > Solr does not by default generate unique IDs. It uses what you give as
> > your unique field, usually called 'id'.
> >
> > What software do you use to index data from your RSS feeds? Maybe that is
> > creating a new 'id' field?
> >
> > There is no partial update, Solr (Lucene) always rewrites the complete
> > document.
> >
> >
> > On 08/23/2013 09:03 AM, Greg Preston wrote:
> >
> >> Perhaps an atomic update that only changes the fields you want to
> change?
> >>
> >> -Greg
> >>
> >>
> >> On Fri, Aug 23, 2013 at 4:16 AM, Luís Portela Afonso
> >> > wrote:
> >>
> >>> Hi thanks by the answer, but the uniqueId is generated by me. But when
> >>> solr indexes and there is an update in a doc, it deletes the doc and
> >>> creates a new one, so it generates a new UUID.
> >>> It is not suitable for me, because i want that solr just updates some
> >>> fields, because the UUID is the key that i use to map it to an user in
> my
> >>> database.
> >>>
> >>> Right now i'm using information that comes from the source and never
> >>> chages, as my uniqueId, like for example the guid, that exists in some
> rss
> >>> feeds, or if it doesn't exists i use link.
> >>>
> >>> I think there is any simple solution for me, because for what i have
> >>> read, when an update to a doc exists, SOLR deletes the old one and
> create a
> >>> new one, right?
> >>>
> >>> On Aug 23, 2013, at 12:07 PM, Erick Erickson 
> >>> 
> >
> >>> wrote:
> >>>
> >>>  Well, not much in the way of help because you can't do what you
>  want AFAIK. I don't think UUID is suitable for your use-case. Why not
>  use your ?
> 
>  Or generate something yourself...
> 
>  Best
>  Erick
> 
> 
>  On Thu, Aug 22, 2013 at 5:56 PM, Luís Portela Afonso <
>  meligalet...@gmail.com 
> 
> > wrote:
> > Hi,
> >
> > How can i prevent solr from update some fields when updating a doc?
> > The problem is, i have an uuid with the field name uuid, but it is
> not
> > an
> > unique key. When a rss source updates a feed, solr will update the
> doc
> > with
> > the same link but it generates a new uuid. This is not the desired
> > because
> > this id is used by me to relate feeds with an user.
> >
> > Can someone help me?
> >
> > Many Thanks
> >
> 
> >
>


-- 
Sent from Gmail Mobile

Re: SOLR Prevent solr of modifying fields when update doc

2013-08-25 Thread Luis Portela Afonso

Hi, right now I'm using the link field that comes in any rss entry as my
uniqueKey.
That was the best solution that I found because in many updated documents,
this was the only field that never changes.

Now I'm facing another problem. When I want to search for a document with
that id or link, because that is my uniqueKey, I'm not able to get an
unique result.
I can't successfully search for a field that is a URL on solr.
I think that is because I'm encoding the URL that I'm searching for, but
solr doesn't decodes it.

Thanks for the concern and help

On Saturday, August 24, 2013, Erick Erickson wrote:

> bq:  but the uniqueId is generated by me. But when solr indexes and there
> is an update in a doc, it deletes the doc and creates a new one, so it
> generates a new UUID.
>
> right, this is why I was saying that a UUID field may not fit your use
> case. The _point_ of a UUID field is to generate a unique entry for every
> added document, there's no concept of "only generate the UUID once per
>  indexed" which seems to be what you want.
>
> So I'd do something like just use the  field rather than a
> separate UUID field. That doesn't change by definition. What advantage do
> you think you get from the UUID field over just using your 
> field?
>
> Best,
> Erick
>
>
> On Sat, Aug 24, 2013 at 6:26 AM, Luis Portela Afonso <
> meligalet...@gmail.com 
> > wrote:
>
> > Hi,
> >
> > The uuid, that was been used like the id of a document, it's generated by
> > solr using an updatechain.
> > I just use the recommend method to generate uuid's.
> >
> > I think an atomic update is not suitable for me, because I want that solr
> > indexes the feeds and not me. I don't want to send information to solr, I
> > want that indexes it each 15 minutes, for example, and now it's doing
> that.
> >
> > Lance, I don't understand what you want to say with, software that I use
> to
> > index.
> > I just use solr. I have a configuration with two entities. One that
> selects
> > my rss sources from a database and then the main entity that get
> > information from an URL and processes it.
> >
> > Thank you all for the answers.
> > Much appreciated
> >
> > On Saturday, August 24, 2013, Greg Preston wrote:
> >
> > > But there is an API for sending a delta over the wire, and server side
> it
> > > does a read, overlay, delete, and insert.  And only the fields you sent
> > > will be changed.
> > >
> > > *Might require your unchanged fields to all be stored, though.
> > >
> > >
> > > -Greg
> > >
> > >
> > > On Fri, Aug 23, 2013 at 7:08 PM, Lance Norskog 
> > > 
> > >
> > > wrote:
> > >
> > > > Solr does not by default generate unique IDs. It uses what you give
> as
> > > > your unique field, usually called 'id'.
> > > >
> > > > What software do you use to index data from your RSS feeds? Maybe
> that
> > is
> > > > creating a new 'id' field?
> > > >
> > > > There is no partial update, Solr (Lucene) always rewrites the
> complete
> > > > document.
> > > >
> > > >
> > > > On 08/23/2013 09:03 AM, Greg Preston wrote:
> > > >
> > > >> Perhaps an atomic update that only changes the fields you want to
> > > change?
> > > >>
> > > >> -Greg
> > > >>
> > > >>
> > > >> On Fri, Aug 23, 2013 at 4:16 AM, Luís Portela Afonso
> > > >>  > wrote:
> > > >>
> > > >>> Hi thanks by the answer, but the uniqueId is generated by me. But
> > when
> > > >>> solr indexes and there is an update in a doc, it deletes the doc
> and
> > > >>> creates a new one, so it generates a new UUID.
> > > >>> It is not suitable for me, because i want that solr just updates
> some
> > > >>> fields, because the UUID is the key that i use to map it to an user
> > in
> > > my
> > > >>> database.
> > > >>>
> > > >>> Right now i'm using information that comes from the source and
> never
> > > >>> chages, as my uniqueId, like for example the guid, that exists in
> > some
> > > rss
> > > >>> feeds, or if it doesn't exists i use link.
> > > >>>
> > > >>> I think there is any simple solution for

Solr documents update on index

2013-09-05 Thread Luis Portela Afonso

Hi,

I'm having a problem when solr indexes.
It is updating documents already indexed. Is this a normal behavior?
If a document with the same key already exists is it supposed to be updated?
I has thinking that is supposed to just update if the information on the
rss has changed.

Appreciate your help

-- 
Sent from Gmail Mobile

Re: Data import

2013-09-09 Thread Luis Portela Afonso

So I'm indexing RSS feeds.
I'm running the data import full-import command with a cron job. It runs
every 15 minutes and indexes a lot of RSS feeds from many sources.

With cron job, I do a http request using curl, to the address
http://localhost:port/solr/core/dataimport/?command=full-import&clean=false

When it runs, if the rss source has a feed that is already indexed on solr,
it updates the existing source.
So if the source has the same information of the destiny, it updates the
information on the destiny.

I want to prevent that. Is that explicit? I may try to provide some
examples.

Thanks

On Tuesday, September 10, 2013, Chris Hostetter wrote:

>
> : When i run "dataimport/?command=full-import&clean=false", solr add new
> : documents with the information. But if the same information already
> : exists with the same uniquekey, it replaces the existing document with a
> : new one.
> : It does not update the document, it creates a new one. It's that
> possible?
>
> I'm not certain that i'm understanding your question.
>
> It is possible using Atomic Updates, but you have to be explicit
> about what/how you wnat Solr to use the new information (ie: when to
> replace, when to add to a multivaluded field, when to increment a numeric
> field, etc...)
>
> https://wiki.apache.org/solr/Atomic_Updates
>
> I don't think DIH has any straight forward syntax for letting you
> configure this easily, but as long as you put a "map" in each
> field (ie: via ScriptTransformer perhaps) containing a single "modifier =>
> value" pair you want applied to that field, it should work.
>
> : I'm indexing rss feeds. I run the rss example that exists in the solr
> : examples, and i does that.
>
> Can you please be more specific about what you would like to see happen,
> we can better understand what your actual goal is?  It's really not clear
> if using Atomic Updates is the easiest way to achieve what you're after,
> or if I'm just completley missunderstanding your question...
>
> https://wiki.apache.org/solr/UsingMailingLists
>
> -Hoss
>


-- 
Sent from Gmail Mobile

Re: Data import

2013-09-09 Thread Luis Portela Afonso

But with atomic updates i need to send the information, right?

I want that solr automatic indexes it. And he is doing that. Can you look
at the solr example in the source?
There is an example on example-DIH folder.

Imagine that you run the URL to import the data every 15 minutes. If the
same information is already indexed, solr will update it, and by update I
mean delete and index again.

I just want that solr simple discards the information if this already
exists with indexed.

On Tuesday, September 10, 2013, Chris Hostetter wrote:

>
> : With cron job, I do a http request using curl, to the address
> : http://localhost:port
> /solr/core/dataimport/?command=full-import&clean=false
> :
> : When it runs, if the rss source has a feed that is already indexed on
> solr,
> : it updates the existing source.
> : So if the source has the same information of the destiny, it updates the
> : information on the destiny.
> :
> : I want to prevent that. Is that explicit? I may try to provide some
> : examples.
>
> Yes, specific examples would be helpful -- it's not really clear what it
> is that you want to prevent.
>
> Please note the URL i mentioned before and use it as a guideline for
> how much detail we need to understand what it is you are asking...
>
> : > Can you please be more specific about what you would like to see
> happen,
> : > we can better understand what your actual goal is?  It's really not
> clear
>
> : > https://wiki.apache.org/solr/UsingMailingLists
>
>
>
> -Hoss
>

-- 
Sent from Gmail Mobile

Quick question about indexing with SolrJ.

2013-05-13 Thread Luis Cappa Banda

Is it possible to index plain String JSON documents using SolrJ? I already
know annotating POJOs works fine, but I need a more flexible way to index
data without any intermediate POJO.

That's because when changing, adding or removing new fields I don't want to
change continously that POJO again and again.


-- 
- Luis Cappa

Re: Quick question about indexing with SolrJ.

2013-05-13 Thread Luis Cappa Banda

Hello, Jack.

I don't want to use POJOs, that's the main problem. I know that you can
send AJAX POST HTTP Requests with JSON data to index new documents and I
would like to do that with SolrJ, that's all, but I don't find the way to
do that, :-/ . What I would like to do is simple retrieve an String with an
embedded JSON and add() it via an HttpSolrServer object instance. If the
JSON matches the Solr server schema.xml or not it would be a server-side
problem, not a client-side one. I mean, I want to use a best effort and
more flexible way to index data, and using POJOs is not the way to do that:
you have to change the Java class, compile it again and relaunch whatever
the process that uses that Java class.

Regards,

- Luis Cappa


2013/5/13 Jack Krupansky 

> Do your POJOs follow a simple flat data model that is 100% compatible with
> Solr?
>
> If so, maybe you can simply ingest them by setting the Content-type to
> "application/json" and maybe having to put some minimal wrapper around the
> raw JSON.
>
> But... if they DON'T follow a simple, flat data model, then YOU are going
> to have to transform their data into a format that does have a simple, flat
> data model.
>
> -- Jack Krupansky
>
> -Original Message- From: Luis Cappa Banda
> Sent: Monday, May 13, 2013 10:52 AM
> To: solr-user@lucene.apache.org
> Subject: Quick question about indexing with SolrJ.
>
>
> Is it possible to index plain String JSON documents using SolrJ? I already
> know annotating POJOs works fine, but I need a more flexible way to index
> data without any intermediate POJO.
>
> That's because when changing, adding or removing new fields I don't want to
> change continously that POJO again and again.
>
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

1 2 3 4 >

1 - 100 of 323 matches

Mail list logo