Hi Ali, the sizing is not just determined by the number of indexed documents (and even less by the number of concurrent users).
- Document volume (number of documents, amount of text data to be indexed with each document, number and types of fields, the cardinality of fields) guide you to the number of primary shards or collections you want to have in your environment. - Query volume determines replication factors to deal with proper response times. - The amount of concurrency (e.g., do you have primarily insertions of new documents and then queries, or is there also a significant deletion process running in parallel - partial updates count as deletion+insertion) and the frequency of required index updates also influences the sizing. - Usually, processing (document to text, extractions, enrichment, ...) will be handled outside Solr (and has to be taken into account for the entire platform scaling of hardware). Some figures you may want to know before tackling this project are - Are there different types of documents (e.g., text, media, data) that have different textual amounts for indexing (e.g., plain text ~100%, HTML ~90%, Microsoft Word ~15%, PDF ~10%, ...) to be handled? - What are the size distributions (possibly over these types of documents)? - What is the expected update frequency? Can you do incremental crawling? - What types of attributes and facets are you planning to have for these documents? - How fresh an index do you need? - Is this concurrent indexing and querying or will indexing happen, e.g., at night, while during the day, users will query the platform? - What are the types of typical queries issued by users? - Will you have to take security into account (possibly leading to large Boolean expressions added to queries to filter by entitlement groups)? This will guide you into a first direction. Then run a prototype to measure representative figures for scaling and make your estimates. Best regards, --Jürgen On 04.01.2015 15:36, Ali Nazemian wrote: > Hi, > I was wondering what is the hardware requirement for indexing 500 million > documents in Solr? Suppose maximum number of concurrent users in peak time > would be 20. > Thank you very much. > -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center "Intelligence" & Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com <mailto:juergen.wag...@devoteam.com>, URL: www.devoteam.de <http://www.devoteam.de/> ------------------------------------------------------------------------ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071