date:20120216

Entity with multiple datasources

2012-02-16 Thread Radu Toev

Hello,

I created a data-config.xml file where I define a datasource and an entity
with 12 fields.
In my use case I have 2 databases with the same schema, so I want to
combine in one index the 2 databases.
I defined a second dataSource tag and duplicateed the entity with its
field(changed the name and the datasource).
What I'm expecting is to get around 7k results(I have around 6k in the
first db and 1k in the second). However I'm getting a total of 2k.
Where could be the problem?

Thanks

Re: 'foruns' don't match 'forum' with NGramFilterFactory (or EdgeNGramFilterFactory)

2012-02-16 Thread Dirceu Vieira

Hi,

It's funny that if you try "fóruns" it matches:
http://bhakta.casadomato.org:8982/solr/select/?q=f%C3%B3runs&version=2.2&start=0&rows=10&indent=on
But not when you try "foruns", it does not.

Check this out...

http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=type&name=text&verbose=on&highlight=on&val=f%C3%B3rum&qverbose=on&qval=foruns

See that stemming does not work for the word foruns.

Could it be because fórum is part of the PT dictionary but not forum?

Regards,

2012/2/14 Bráulio Bhavamitra 

> Hello all,
>
> I'm experimenting with NGramFilterFactory and EgdeNGramFilterFactory.
>
> Both of them shows a match in my solr admin analysis, but when I query
> 'foruns'
> doesn't find any 'forum'.
> analysis
>
> http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=type&name=text&verbose=on&highlight=on&val=f%C3%B3runs&qverbose=on&qval=f%C3%B3runs
> search
>
> http://bhakta.casadomato.org:8982/solr/select/?q=foruns&version=2.2&start=0&rows=10&indent=on
>
> Anybody knows what's the problem?
>
> bráulio
>



-- 
Dirceu Vieira Júnior
---
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr

problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi

Hi all,
I have a problem to configure a pdf indexing from a directory in my solr
wit DIH:

with this data-config



 
 
  
  
   
   
 
 

 
 
  
 


I obtain this result:



  full-import

  idle

  

- 

  0:0:2.44

  0

  43

  0

  2012-02-12 19:06:00

  Indexing failed. Rolled back all changes.

  2012-02-12 19:06:00
  


suggestions?
thank you
alessio

Best requestHandler for "typing error".

2012-02-16 Thread stockii

Hello.

Which RH do you use to find typing errors like "goolge" => do you mean
"google" ?!

I want to use my Autosuggestion "EdgeNGram" with a clever AutoCorrection!



What do you use ?

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores < 200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3749576.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan

I kept old schema files and solrconfig file but there were some errors due to
which solr was not loading. I dono what are those things. We have few our
own custom plugins developed with 1.4.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749629.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan

we have both stored = true and false fields in the schema. So we cant reindex
wat u said. we have tried that earlier.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749631.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

2012-02-16 Thread Mikhail Khludnev

Pls find inlined.

On Thu, Feb 16, 2012 at 10:30 AM, Alexey Verkhovsky <
alexey.verkhov...@gmail.com> wrote:

> Hi, all,
>
> I'm new here. Used Solr on a couple of projects before, but didn't need to
> dive deep into anything until now. These days, I'm doing a spike for a
> "yellow pages" type search server with the following technical
> requirements:
>
> ~10 mln listings in the database. A listing has a name, address,
> description, coordinates and a number of tags / filtering fields; no more
> than a kilobyte all told; i.e. theoretically the whole thing should fit in
> RAM without sharding. A typical query is either "all text matches on name
> and/or description within a bounded box", or "some combination of tag
> matches within a bounded box". Bounded boxes are 1 to 50 km wide, and
> contain up to 10^5 unfiltered listings (the average is more like 10^3).
> More than 50% of all the listings are in the frequently requested bounding
> boxes, however a vast majority of listings are almost never displayed
> (because they don't match the other filters).
>
> Data "never changes" (i.e., a daily batch update; rebuild of the entire
> index and restart of all search servers is feasible, as long as it takes
> minutes, not hours).

Everybody start from daily bounce, but end up with UPDATED_AT column and
delta updates , just consider urgent content fix usecase. Don't think it's
worth to rely on daily bounce as a cornerstone of architecture.


> This thing ideally should serve up to 10^3 requests
> per second on a small (as in, "less than 10 commodity boxes") cluster. In
> other words, a typical request should be CPU bound and take ~100-200 msec
> to process. Because of coordinates (that are almost never the same),
> caching of queries makes no sense;

you can use grid of coordinates to reduce their entropy, if you filter by
bounding box argument is bounding box not a coordinates. Anyway
postfiltering and cache=false for such filters
http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/


> from what little I understand about
> Lucene internals, caching of filters probably doesn't make sense either.
>
But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache

>
> After perusing documentation and some googling (but almost no source code
> exploring yet), I understand how the schema and the queries will look like,
> and now have to figure out a specific configuration that fits the
> performance/scalability requirements. Here is what I'm thinking:
>
> 1. Search server is an internal service that uses embedded Solr for the
> indexing part. RAMDirectoryFactory as index storage.
>
Bad idea. It's purposed mostly for tests, the closest purposed for
production analogue is
org.apache.lucene.store.instantiated.InstantiatedIndex


> 2. All data is in some sort of persistent storage on a file system, and is
> loaded into the memory when a search server starts up.
>
AFAIK the state of art is use file directory (MMAP or whatever), rely on
Linux file system RAM cache. Also Solr and partially Lucene cache some
stuff in HEAP themselves
http://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration.
So, this is mostly done already.


> 3. Data updates are handled as "update the persistent storage, start
> another cluster, load the world into RAM, flip the load balancer, kill the
> old cluster"
>
no again. Lucene has pretty cool model of segments and generations purposed
to incremental update. And Solr does a lot to do search in old generation
and warnup the new one simultaneously (it just takes some memory, you know,
two times). I don;t think that manual A/B scheme is applicable. Anyway, you
can (but don't relly need to) play around replication facilities e.g.
disable traffic for half of nodes, push new index on it, let them warmup,
enable traffic (such machinery never works smoothly due number of moving
parts)


> 4. Solr returns IDs with relevance scores; actual presentations of listings
> (as JSON documents) are constructed outside of Solr and cached in
> Memcached, as a mostly static content with a few templated bits, like
> <%=DISTANCE_TO(-123.0123, 45.6789) %>.
>
Use separate nodes to do a search and another nodes to stream the content
sounds good (mentioned in every book). Looks like beside of the score you
can also return distance to user i.e. no need to <%=DISTANCE_TO(-123.0123,
45.6789) %> , just <%=doc.DISTANCE%> see
http://wiki.apache.org/solr/SpatialSearch?#Returning_the_distance



> 5. All Solr caching is switched off.
>
But why?



>
> Obviously, we are not the first people to do something like this with Solr,
> so I'm hoping for some collective wisdom on the following:
>
> Does this sounds like a feasible set of requirements in terms of
> performance and scalability for Solr? Are we on the right path to solving
> this problem well? If not, what should we be doing instead? What nasty
> technical/architectural gotchas are we probably missing at this stage?
>
> One particular

Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan

1. Do you see any errors / exceptions in the logs?
2. Could you have duplicates?

On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev  wrote:

> Hello,
>
> I created a data-config.xml file where I define a datasource and an entity
> with 12 fields.
> In my use case I have 2 databases with the same schema, so I want to
> combine in one index the 2 databases.
> I defined a second dataSource tag and duplicateed the entity with its
> field(changed the name and the datasource).
> What I'm expecting is to get around 7k results(I have around 6k in the
> first db and 1k in the second). However I'm getting a total of 2k.
> Where could be the problem?
>
> Thanks
>



-- 
Regards,

Dmitry Kan

Re: Spatial Search and faceting

2012-02-16 Thread Eric Grobler

Hi William,

Thanks for the feedback.

I will try the group query and see how the performance with 2 queries is.

Best Regards
Ericz

On Thu, Feb 16, 2012 at 4:06 AM, William Bell  wrote:

> One way to do it is to group by city and then sort=geodist() asc
>
> select?group=true&group.field=city&sort=geodist() desc&rows=10&fl=city
>
> It might require 2 calls to SOLR to get it the way you want.
>
> On Wed, Feb 15, 2012 at 5:51 PM, Eric Grobler 
> wrote:
> > Hi Solr community,
> >
> > I am doing a spatial search and then do a facet by city.
> > Is it possible to then sort the faceted cities by distance?
> >
> > We would like to display the hits per city, but sort them by distance.
> >
> > Thanks & Regards
> > Ericz
> >
> > q=iphone
> > fq={!bbox}
> > sfield=geopoint
> > pt=49.594857,8.468614
> > d=50
> > fl=id,description,city,geopoint
> >
> > facet=true
> > facet.field=city
> > f.city.facet.limit=10
> > f.city.facet.sort=count //geodist() asc
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>

Realtime search with multi clients updating index simultaneously.

2012-02-16 Thread v_shan

I have a heldesk application developed in PHP/MySQL. I want to implement real
time Full text search and I have shortlisted Solr. MySQL database will store
all the tickets and their updates and that data will be imported for
building Solr index. All Search requests will be handled by Solr.

What I want is a real time search. The moment someone updates a ticket, it
should be available for search.

As per my understanding of Solr, this is how I think the system will work.
A user updates a ticket -> database record is modified -> a request is sent
to Solr server to modify corresponding document in index.

I have read a book on Solr and below questions are troubling me.
1. The book mentions that "commits are slow in Solr. Depending on the index
size, Solr's auto-warming
configuration, and Solr's cache state prior to committing, a commit can take
a non-trivial amount of time. Typically, it takes a few seconds, but it can
take
some number of minutes in extreme cases". If this is true then how will I
know when the data will be availbale for search and how can I implemnt
realtime search? Also I don't want the ticket update operation to be slowed
down (by adding extra step of updating Solr index)

2. It is also mentioned that "there is no transaction isolation. This means
that if more than one Solr client
were to submit modifications and commit them at overlapping times, it is
possible for part of one client's set of changes to be committed before that
client told Solr to commit. This applies to rollback as well. If this is a
problem
for your architecture then consider using one client process responsible for
updating Solr."

Does it mean that due to lack of transactional commits, Solr can mess up the
updates when multiple people update the ticket simultaneously?

Now the question before me is: Is Solr fit in my case? If yes, How?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Realtime-search-with-multi-clients-updating-index-simultaneously-tp3749881p3749881.html
Sent from the Solr - User mailing list archive at Nabble.com.

88 matches

Mail list logo