Re: Ideas

2015-09-21 Thread DVT
Hi Bill, the classical way would be to have a reverse proxy in front of the application that catches such cases. A decent reverse proxy or even application firewall router will allow you to define limits on bandwidth and sessions per time unit. Some even recognize specific denial-of-service patte

Re: OT: is Heliosearch discontinued?

2015-11-26 Thread DVT
https://github.com/Heliosearch/heliosearch Last committment a year ago... that tells me something :-) heliosearch.com and heliosearch.org go to standard GoDaddy pages. Heliosearch was a fork that has apparently been dormant for a year already. Cheers, --Jürgen On 26.11.2015 14:26, Bernd Fehlin

Re: Unstructured/Structured data for indexing

2015-12-09 Thread DVT
Subin, Only the envelope is structured. What's inside the individual fields of the structure may be single values (possibly considered structured meta-data) or unstructured (like free text or other fields with informal semantics). Even if you pass a 5-hour video as a major case of unstructured d

Re: solr error

2016-08-01 Thread DVT
Abhishek, given the vast amount of information you write, I suspect thisis not an HTTP error code (those are three digits, and the ones starting with 200 actually indicate a success), but rather a libcurl error code. Check against this list to find out whether that's an explanation: https://curl

Re: Deploying multiple ZooKeeper ensemble on a single machine

2015-04-08 Thread Jürgen Wagner (DVT)
To be precise: create one zoo.cfg for each of the instances. One config file for all is a bad idea. In each config file, use the same server.X lines, but use a unique clientPort. As you will also have separate data directories, I would recommend having one root directory .../zookeeper where you c

Re: Replication for SolrCloud

2015-04-18 Thread Jürgen Wagner (DVT)
Replication on the storage layer will provide a reliable storage for the index and other data of Solr. In particular, this replication does not guarantee your index files are consistent at any time as there may be intermediate states that are only partially replicated. Replication is only a converg

Re: How can I set shard members?

2014-09-02 Thread Jürgen Wagner (DVT)
Hello, have you tried the "createNodeSet" option of collection/shard creation and the "node" option of replica creation in Solr 4.9.0+? As you're just testing, I would strongly recommend going to the latest version. https://cwiki.apache.org/confluence/display/solr/Collections+API This is useful

Re: Create collection dynamically in my program

2014-09-03 Thread Jürgen Wagner (DVT)
Hello Xinwu, does it change anything if you use an underline instead of the dash in the collection name? What is the result of the call? Any status or error message? Did you actually feed data into the collection? Cheers, --Jürgen On 03.09.2014 11:21, xinwu wrote: > Hi , all: > I crea

FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a "semantic f

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
/solr/The+Term+Vector+Component > And just to show some impressive search functionality of the wiki: ;) > https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors > > Cheers, > Jim > > > 2014-09-05 9:44 GMT+02:00 &

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Thanks for posting this. I was just about to send off a message of similar content :-) Important to add: - In FAST ESP, you could have more than one such docvector associated with a document, in order to reflect different metrics. - Term weights in docvectors are document-relative, not absolute.

Re: Performance of Unsorted Queries

2014-09-16 Thread Jürgen Wagner (DVT)
Depending on the size of the individual records returned, I'd use a decent size window (to minimize network and marshalling/unmarshalling overhead) of maybe 1000-1 items sorted by id, and use that in combination with cursorMark. That will be easier on the server side in terms of garbage collect

Re: Access solr cloud via ssh tunnel?

2014-09-16 Thread Jürgen Wagner (DVT)
In a test scenario, I used stunnel for connections between some zookeeper observers and the central ensemble, as well as between a SolrJ 4.9.0 client and the central zookeepers. This is entirely transparent modulo performance penalties due to network latency and ssl overhead. I finally ended up wit

Re: Frequent recovery of nodes in SolrCloud

2014-10-16 Thread Jürgen Wagner (DVT)
Hello, you have one shard and 11 replicas? Hmm... - Why you have to keep two nodes on some machines? - Physical hardware or virtual machines? - What is the size of this index? - Is this all on a local network or are there links with potential outages or failures in between? - What is the query l

Re: issue in launching SolrCloud windows/cygwin

2014-10-19 Thread Jürgen Wagner (DVT)
Hello Anurag, the CRLF problem with Cygwin can be cured by running the scripts all through this filter: tr -d '\r' < $script > $script.new ; mv $script.new $script with $script holding the path of the script to be massaged. Generally, however, I would advise to use the standard scripts only fo

Re: CoreAdminRequest in SolrCloud

2014-10-20 Thread Jürgen Wagner (DVT)
Hello Nabil, isn't that what should be expected? Cores are local to nodes, so you only get the core status from the node you're asking. Cluster status refers to the entire SolrCloud cluster, so you will get the status over all collection/nodes/shards[=cores]. Check the Core Admin REST interface f

Re: CoreAdminRequest in SolrCloud

2014-10-20 Thread Jürgen Wagner (DVT)
s you can see, I'm not using direct connection to node. It's a CloudServer. > Do you have example to how to get Cluster status from solrJ. > > Regards, > Nabil. > > > Le Lundi 20 octobre 2014 13h44, Jürgen Wagner (DVT) > a écrit : > > > > Hello N

Re: Indexing documents/files for production use

2014-10-28 Thread Jürgen Wagner (DVT)
Hello Olivier, for real production use, you won't really want to use any toys like post.jar or curl. You want a decent connector to whatever data source there is, that fetches data, possibly massages it a bit, and then feeds it into Solr - by means of SolrJ or directly into the web service of Sol

Re: Consul instead of ZooKeeper anyone?

2014-11-01 Thread Jürgen Wagner (DVT)
Hello Greg, Consul and Zookeeper are quite similar in their offering with respect to what SolrCloud needs. Service discovery, watches on distributed cluster state, updates of configuration could all be handled through Consul. Plus, Consul does offer built-in capabilities for multi-datacenter sc

Re: Consul instead of ZooKeeper anyone?

2014-11-04 Thread Jürgen Wagner (DVT)
Hello Greg, we run Zookeeper not on dedicated Zookeeper machines, but rather on admin nodes in search application clusters (that makes two instances), plus on at least one more node that does not have much load (e.g., a crawling node). Also, as long as you don't stuff too much data into Zookeeper

Re: Best Practices for open source pipeline/connectors

2014-11-04 Thread Jürgen Wagner (DVT)
Hello Dan, ManifoldCF is a connector framework, not a processing framework. Therefore, you may try your own lightweight connectors (which usually are not really rocket science and may take less time to write than time to configure a super-generic connector of some sort), any connector out there (

Re: One ZooKeeper and many Solr clouds

2014-11-14 Thread Jürgen Wagner (DVT)
Hello Enrico, you may use the chroot feature of Zookeeper to root the different SolrCloud instances differently. Instead of zoohost1:2181, you can use zoohost1:2181/cluster1 as the Zookeeper location. Unless there is a load issue with high rates of updates and other data traffic, a single Zookeep

Re: Restrict search to subset (a list of aprrox 40,000 ids from an external service) of corpus

2014-11-14 Thread Jürgen Wagner (DVT)
Hi guy, there's not much of a search operation here. Why not store the documents in a key/value store and simply fetch them by matching ids? Another approach: as there is no query, you could easily partition the set of ids and fetch the results in multiple batches. The maximum number of clause

Re: Solr HTTP client authentication

2014-11-17 Thread Jürgen Wagner (DVT)
Why rely on the default http client? Why not create one with HttpClients.custom() .setDefaultSocketConfig(socketConfig) .setDefaultRequestConfig(requestConfig) .setSSLSocketFactory(sslsf) .build(); that has the SSLConnectionSocketFactory property set up with an SSL

Re: Hardware requirement for 500 million documents

2015-01-04 Thread Jürgen Wagner (DVT)
Hi Ali, the sizing is not just determined by the number of indexed documents (and even less by the number of concurrent users). - Document volume (number of documents, amount of text data to be indexed with each document, number and types of fields, the cardinality of fields) guide you to the n

Re: PDF search functionality using Solr

2015-01-06 Thread Jürgen Wagner (DVT)
Hello, no matter which search platform you will use, this will pose two challenges: - The size of the documents will render search less and less useful as the likelihood of matches increases with document size. So, without a proper semantic extraction (e.g., using decent NER or relationship extr

Re: Frequent deletions

2015-01-11 Thread Jürgen Wagner (DVT)
Maybe you should consider creating different generations of indexes and not keep everything in one index. If the likelihood of documents being deleted is rather high in, e.g., the first week or so, you could have one index for the high-probability of deletion documents (the fresh ones) and a second

Re: Using SolrCloud to implement a kind of federated search

2015-01-20 Thread Jürgen Wagner (DVT)
Hello Charlie, theoretically, things may work as you describe them. A few big HOWEVERs exist as far as I can see: 1. Attributes: as different organisations may use different schemata (document attributes), the consolidation of results from multiple sources may present a problem. This may not ari