Re: Can Apache Solr Handle TeraByte Large Data

2015-08-05 Thread Mugeesh Husain
thanks you Upayavira, I think i have done all these thing using SolrJ which was usefull before starting development of the project. I hope i will not got any of issue using SolrJ and got lots of stuff using it. Thanks Mugeesh Husain -- View this message in context: http://lucene.472066.n3

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-05 Thread Mugeesh Husain
@Mikhail Use of data import handler ,if i define my baseDir is D:/work/folder. Will it work for sub-folder and sub-folder of sub-folder ... etc also.? -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221063.html Sent fr

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-05 Thread Upayavira
Post your docs in sets of 1000. Create a: List docs Then add 1000 docs to it, then client.add(docs); Repeat until your 40m are indexed. Upayavira On Wed, Aug 5, 2015, at 05:07 PM, Mugeesh Husain wrote: > filesystem are about 40 millions of document it will iterate 40 times how > may > solrJ c

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-05 Thread Mugeesh Husain
filesystem are about 40 millions of document it will iterate 40 times how may solrJ could not handle 40m times loops(before indexing i have to split values from filename and make some operation then index to Solr) Is it will continuous indexing using 40m times or i have to sleep in between some i

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-05 Thread Upayavira
If you are using Java, you will likely find SolrJ the best way - it uses serialised Java objects to communicate with Solr - you don't need to worry about that. Just use code similar to that earlier in the thread. No XML, no CSV, just simple java code. Upayavira On Wed, Aug 5, 2015, at 04:50 PM,

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-05 Thread Mugeesh Husain
@Upayavira Thanks these thing are most useful for my understanding I have thing about i will create XML or CVS file from my requirement using java Then Index it via HTTP post or bin/post I am not using DIH because i did't get any of link or idea how to split data and add to solr one by one.(A

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-04 Thread Mikhail Khludnev
On Tue, Aug 4, 2015 at 8:10 PM, Mugeesh Husain wrote: > Thanks you Erik, I will preferred XML files instead of csv. > On my requirement if i want to use DIH for indexing than how could i split > these operation or include java clode to DIH.. > Here is my favorite way to tweak data in DIH https://

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-04 Thread Upayavira
On Tue, Aug 4, 2015, at 06:13 PM, Mugeesh Husain wrote: > @Upayavira if i uses Solrj for indexing. autocommit or Softautocommit > will > work in case of SolJ There are two ways to get content into Solr: * push it in via an HTTP post. - this is what SolrJ uses, what bin/post uses, and every

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-04 Thread Mugeesh Husain
@Upayavira if i uses Solrj for indexing. autocommit or Softautocommit will work in case of SolJ -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220796.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-04 Thread Mugeesh Husain
Thanks you Erik, I will preferred XML files instead of csv. On my requirement if i want to use DIH for indexing than how could i split these operation or include java clode to DIH.. I have googled but not get such type of requirement. provide my any of link for it or some suggestion to do it.

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-04 Thread Erik Hatcher
If you have data that only consists of id (full filename) and filename (indexed, tokenized) 40M of those will fit comfortably into a single shard provided enough RAM to operate. I know SolrJ is tossed out there a lot as a/the way to index - but if you’ve got a directory tree of files and want t

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-04 Thread Mugeesh Husain
Thank @Alexandre and Erickson ,Hatcher. I will generate ID of MD5 with help of filename using java. I can do it with help of SolrJ nicely because i am java developer apart from this The question raised that data is too large i think it will break into multiple shards(core) Using multi core inde

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-04 Thread Upayavira
Yes, you are right - generally autocommit is a better way. If you are doing a one-off indexing, then a manual commit may well be the best option, but generally, autocommit is a better way. Upayavira On Mon, Aug 3, 2015, at 11:15 PM, Konstantin Gribov wrote: > Upayavira, manual commit isn't a good

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Konstantin Gribov
Upayavira, manual commit isn't a good advice, especially with small bulks or single document, is it? I see recommendations on using autoCommit+autoSoftCommit instead of manual commit mostly. вт, 4 авг. 2015 г. в 1:00, Upayavira : > SolrJ is just a "SolrClient". In pseudocode, you say: > > SolrCli

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Upayavira
SolrJ is just a "SolrClient". In pseudocode, you say: SolrClient client = new SolrClient("http://localhost:8983/solr/whatever";); List docs = new ArrayList<>(); SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", "abc123"); doc.addField("some-text-field", "I like it when the sun s

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
Well, If it is just file names, I'd probably use SolrJ client, maybe with Java 8. Read file names, split the name into parts with regular expressions, stuff parts into different field names and send to Solr. Java 8 has FileSystem walkers, etc to make it easier. You could do it with DIH, but it wo

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
@Alexandre No i dont need a content of a file. i am repeating my requirement I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf I just split all Value from a filename only,these values i have to index. I am interested t

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
@Erik Hatcher You mean i have to use Solrj for indexing to it.(right ?) Can Solrj handle large amount of data which i have mentioned previous post ? If i will use DIH then how will i split value from filename etc. I want to start my development in a right direction that why i am little confuse on

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
Just to reconfirm, are you indexing file content? Because if you are, you need to be aware most of the PDF do not extract well, as they do not have text flow preserved. If you are indexing PDF files, I would run a sample through Tika directly (that's what Solr uses under the covers anyway) and see

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erick Erickson
Ahhh, listen to Hatcher if you're not indexing the _contents_ of the files, just the filenames Erick On Mon, Aug 3, 2015 at 2:22 PM, Erik Hatcher wrote: > Most definitely yes given your criteria below. If you don’t care for the > text to be parsed and indexed within the files, a simple fil

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erik Hatcher
Most definitely yes given your criteria below. If you don’t care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as you’d like to Solr would suffice it sounds like. — Erik Hatcher, Senior S

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erick Erickson
I'd go with SolrJ personally. For a terabyte of data that (I'm inferring) are PDF files and the like (aka "semi-structured documents) you'll need to have Tika parse out the data you need to index. And doing that through posting or DIH puts all the analysis on the Solr servers, which will work, but

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
Hi Alexandre, I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 1.)I have to split all underscore value from a filename and these value have to be index to the solr. 2.)Do Not need file contains(Text) to index. You Told me "

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
That's still a VERY open question. The answer is Yes, but the details depend on the shape and source of your data. And the search you are anticipating. Is this a lot of entries with small number of fields. Or a - relatively - small number of entries with huge field counts. Do you need to store/ret

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
Hi, I am new in solr development and have a same requirement and I have already got some knowledge such as how many shard have to created such amount of data at all. with help of googling. I want to take Some suggestion there are so many method to do indexing such as DIH,solr,Solrj. Please sugges

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-17 Thread Otis Gospodnetic
ent: Tuesday, January 17, 2012 12:15 AM >Subject: Re: Can Apache Solr Handle TeraByte Large Data > >I've been toying with the idea of setting up an experiment to index a large >document set 1+ TB -- any thoughts on an open data set that one could use >for this purpose? > >Than

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-16 Thread Memory Makers
I've been toying with the idea of setting up an experiment to index a large document set 1+ TB -- any thoughts on an open data set that one could use for this purpose? Thanks. On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom wrote: > Hello , > > Searching real-time sounds difficult with that am

RE: Can Apache Solr Handle TeraByte Large Data

2012-01-16 Thread Burton-West, Tom
Hello , Searching real-time sounds difficult with that amount of data. With large documents, 3 million documents, and 5TB of data the index will be very large. With indexes that large your performance will probably be I/O bound. Do you plan on allowing phrase or proximity searches? If so, you

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-16 Thread Otis Gospodnetic
Hello, > > From: mustafozbek > >All documents that we use are rich text documents and we parse them with >tika. we need to search real time. Because of real-time requirement, you'll need to use unreleased/dev version of Solr. >Robert Stewart wrote >> Any idea

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-14 Thread Otis Gospodnetic
Hello,   Inline - Original Message - > From: mustafozbek > > I am an apache solr user about a year. I used solr for simple search tools > but now I want to use solr with 5TB of data. I assume that 5TB data will be > 7TB when solr index it according to filter that I use. And then I will a

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-13 Thread Robert Stewart
Any idea how many documents your 5TB data contains? Certain features such as faceting depends more on # of total documents than on actual size of data. I have tested approx. 1 TB (100 million documents) running on a single machine (40 cores, 128 GB RAM), using distributed search across 10 shard

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-13 Thread darren
Maybe also have a look at these links. http://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes http://www.hathitrust.org/blogs/large-scale-search On Fri, 13 Jan 2012 15:49:06 +0100, Daniel Brügge wrote: > Hi, > > it's definitely a problem to store 5TB in Solr without u

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-13 Thread Daniel Brügge
Hi, it's definitely a problem to store 5TB in Solr without using sharding. I try to split data over solr instances, so that the index will fit in my memory on the server. I ran into trouble with a Solr using 50G index. Daniel On Jan 13, 2012, at 1:08 PM, mustafozbek wrote: > I am an apache s