Re: Is there a way to create multiple using DIH and access the data pertaining to a particular ?
(10/11/11 1:57), bbarani wrote: Hi, I have a peculiar situation where we are trying to use SOLR for indexing multiple tables (There is no relation between these tables). We are trying to use the SOLR index instead of using the source tables and hence we are trying to create the SOLR index as that of source tables. There are 3 tables which needs to be indexed. Table 1, table 2 and table 3. I am trying to index each table in seperate doc tag with different doc tag name and each table has some of the common field names. For Ex: Barani, You cannot have multiple documents in a data-config, but you can have multiple entities in a document. And if your table 1,2, and 3 come from different dataSources, you can have multiple data sources in a data-config. If so, you should use dataSource attribute of entity element to refer to the name of dataSource: Koji -- http://www.rondhuit.com/en/
Re: Is there a way to create multiple using DIH and access the data pertaining to a particular ?
Just curious, do these tables have the same schema, like a set of shards would? If not, how do you map them to the index? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Koji Sekiguchi To: solr-user@lucene.apache.org Sent: Sat, December 18, 2010 5:19:08 AM Subject: Re: Is there a way to create multiple using DIH and access the data pertaining to a particular ? (10/11/11 1:57), bbarani wrote: > > Hi, > > I have a peculiar situation where we are trying to use SOLR for indexing > multiple tables (There is no relation between these tables). We are trying > to use the SOLR index instead of using the source tables and hence we are > trying to create the SOLR index as that of source tables. > > There are 3 tables which needs to be indexed. > > Table 1, table 2 and table 3. > > I am trying to index each table in seperate doc tag with different doc tag > name and each table has some of the common field names. For Ex: > > > > > > > > > > > Barani, You cannot have multiple documents in a data-config, but you can have multiple entities in a document. And if your table 1,2, and 3 come from different dataSources, you can have multiple data sources in a data-config. If so, you should use dataSource attribute of entity element to refer to the name of dataSource: Koji -- http://www.rondhuit.com/en/
RE: Memory use during merges (OOM)
Thanks Robert, We will try the termsIndexInterval as a workaround. I have also opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-2290. Hope I found the right sections of the Lucene code. I'm just now in the process of looking at the Solr IndexReaderFactory and SolrIndexWriter and SolrIndexConfig trying to better understand how solrconfig.xml gets instantiated and how it affects the readers and writers. Tom From: Robert Muir [rcm...@gmail.com] On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom wrote: >>>Your setting isn't being applied to the reader IW uses during >>>merging... its only for readers Solr opens from directories >>>explicitly. >>>I think you should open a jira issue! > > Do I understand correctly that this setting in theory could be applied to the > reader IW uses during merging but is not currently being applied? yes, i'm not really sure (especially given the "name=") if you can/or it was planned to have multiple IR factories in solr, e.g. a separate one for spellchecking. so i'm not sure if we should (hackishly) steal this parameter from the IR factory (it is common to all IRFactories, not just StandardIRFactory) and apply it to to IW.. but we could at least expose the divisor param separately to the IW config so you have some way of setting it. > > class="org.apache.solr.core.StandardIndexReaderFactory"> >8 > > > I understand the tradeoffs for doing this during searching, but not the > trade-offs for doing this during merging. Is the use during merging the > similar to the use during searching? > > i.e. Some process has to look up data for a particular term as opposed to > having to iterate through all the terms? > (Haven't yet dug into the merging/indexing code). it needs it for applying deletes... as a workaround (if you are reindexing), maybe instead of using the Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8 * 128) ? this will solve your merging problem, and have the same perf characteristics of divisor=8, except you cant "go back down" like you can with the divisor without reindexing with a smaller interval... if you've already tested that performance with the divisor of 8 is acceptable, or in your case maybe necessary!, it sort of makes sense to 'bake it in' by setting your divisor back to 1 and your interval = 1024 instead...
Re: how to config DataImport Scheduling
I think it must work with any version of solr. because it works url base (see config file). Attention to this point: Successfully tested on Apache Tomcat v6(should work on any other servlet container) From: Ahmet Arslan To: solr-user@lucene.apache.org Sent: Fri, December 17, 2010 3:22:37 AM Subject: Re: how to config DataImport Scheduling > I also have the same problem, i configure > dataimport.properties file as shown > in > http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example > but no change occur, can any one help me What version of solr are you using? This seems a new feature. So it won't work on solr 1.4.1.
Re: Is there a way to create multiple using DIH and access the data pertaining to a particular ?
You can have multiple documents generated by the same data-config: It's the 'rootEntity="false" that makes the child entity a document. On Sat, Dec 18, 2010 at 7:43 AM, Dennis Gearon wrote: > Just curious, do these tables have the same schema, like a set of shards > would? > > If not, how do you map them to the index? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > > > > - Original Message > From: Koji Sekiguchi > To: solr-user@lucene.apache.org > Sent: Sat, December 18, 2010 5:19:08 AM > Subject: Re: Is there a way to create multiple using DIH and access the > data pertaining to a particular ? > > (10/11/11 1:57), bbarani wrote: >> >> Hi, >> >> I have a peculiar situation where we are trying to use SOLR for indexing >> multiple tables (There is no relation between these tables). We are trying >> to use the SOLR index instead of using the source tables and hence we are >> trying to create the SOLR index as that of source tables. >> >> There are 3 tables which needs to be indexed. >> >> Table 1, table 2 and table 3. >> >> I am trying to index each table in seperate doc tag with different doc tag >> name and each table has some of the common field names. For Ex: >> >> >> >> >> >> >> >> >> >> >> > > Barani, > > You cannot have multiple documents in a data-config, but you can > have multiple entities in a document. And if your table 1,2, and 3 > come from different dataSources, you can have multiple data sources > in a data-config. If so, you should use dataSource attribute of entity > element to refer to the name of dataSource: > > > > > > > > > > > > > Koji > -- http://www.rondhuit.com/en/ > > -- Lance Norskog goks...@gmail.com
old index files not deleted on slave
I have set up index replication (triggered on optimize). The problem I am having is the old index files are not being deleted on the slave. After each replication, I can see the old files still hanging around as well as the files that have just been pulled. This causes the data directory size to increase by the index size every replication until the disk fills up. Checking the logs, I see the following error: SEVERE: SnapPull failed org.apache.solr.common.SolrException: Index fetch failed : at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1065) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:954) at org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:192) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319) ... 11 more lsof reveals that the file is still opened from the java process. I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup is pretty vanilla. The OS is linux, the indexes are on local directories, write permissions look ok, nothing unusual in the config (default deletion policy, etc.). Contents of the index data dir: master: -rw-rw-r-- 1 feeddo feeddo 191 Dec 14 01:06 _1lg.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 14 01:07 _1lg.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 14 01:12 _1lg.tii -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq -rw-rw-r-- 1 feeddo feeddo 311 Dec 14 01:12 segments_1ji -rw-rw-r-- 1 feeddo feeddo 23M Dec 14 01:12 _1lg.nrm -rw-rw-r-- 1 feeddo feeddo 191 Dec 18 01:11 _24e.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 18 01:12 _24e.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt -rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 18 01:23 _24e.tii -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23 _24e.prx -rw-rw-r-- 1 feeddo feeddo 283M Dec 18 01:23 _24e.frq -rw-rw-r-- 1 feeddo feeddo 311 Dec 18 01:24 segments_1xz -rw-rw-r-- 1 feeddo feeddo 23M Dec 18 01:24 _24e.nrm -rw-rw-r-- 1 feeddo feeddo 191 Dec 18 13:15 _25z.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 18 13:16 _25z.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 13:16 _25z.fdt -rw-rw-r-- 1 feeddo feeddo 484M Dec 18 13:35 _25z.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 18 13:35 _25z.tii -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 13:35 _25z.prx -rw-rw-r-- 1 feeddo feeddo 284M Dec 18 13:35 _25z.frq -rw-rw-r-- 1 feeddo feeddo 20 Dec 18 13:35 segments.gen -rw-rw-r-- 1 feeddo feeddo 311 Dec 18 13:35 segments_1y1 -rw-rw-r-- 1 feeddo feeddo 23M Dec 18 13:35 _25z.nrm slave: -rw-rw-r-- 1 feeddo feeddo 20 Dec 13 17:54 segments.gen -rw-rw-r-- 1 feeddo feeddo 191 Dec 15 01:07 _1mk.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 15 01:08 _1mk.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 15 01:08 _1mk.fdt -rw-rw-r-- 1 feeddo feeddo 476M Dec 15 01:18 _1mk.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 15 01:18 _1mk.tii -rw-rw-r-- 1 feeddo feeddo 144M Dec 15 01:18 _1mk.prx -rw-rw-r-- 1 feeddo feeddo 278M Dec 15 01:18 _1mk.frq -rw-rw-r-- 1 feeddo feeddo 312 Dec 15 01:18 segments_1kj -rw-rw-r-- 1 feeddo feeddo 23M Dec 15 01:18 _1mk.nrm -rw-rw-r-- 1 feeddo feeddo
Re: Is there a way to create multiple using DIH and access the data pertaining to a particular ?
And, a use case: Tika blows up on some files. But we still want other data like file name etc. and an empty text field. So: Both documents have the same unique id. If the Tika autoparser uses PDF and the PDF works, the second document overwrites the first. If the PDF blows up, the second document skips and: the first document goes in. Ugly, yes, but a testament to the maturity of DIH that it had enough tools to work around a Tika weakness. Oh, and the AutoParser does not work: SOLR-2116: https://issues.apache.org/jira/browse/SOLR-2116 In my previous example, the innermost entities below should be not . Sorry for any confusion. On Sat, Dec 18, 2010 at 4:22 PM, Lance Norskog wrote: > You can have multiple documents generated by the same data-config: > > > > > > > > > > > > > > > > > It's the 'rootEntity="false" that makes the child entity a document. > > On Sat, Dec 18, 2010 at 7:43 AM, Dennis Gearon wrote: >> Just curious, do these tables have the same schema, like a set of shards >> would? >> >> If not, how do you map them to the index? >> >> Dennis Gearon >> >> >> Signature Warning >> >> It is always a good idea to learn from your own mistakes. It is usually a >> better >> idea to learn from others’ mistakes, so you do not have to make them >> yourself. >> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' >> >> >> EARTH has a Right To Life, >> otherwise we all die. >> >> >> >> - Original Message >> From: Koji Sekiguchi >> To: solr-user@lucene.apache.org >> Sent: Sat, December 18, 2010 5:19:08 AM >> Subject: Re: Is there a way to create multiple using DIH and access the >> data pertaining to a particular ? >> >> (10/11/11 1:57), bbarani wrote: >>> >>> Hi, >>> >>> I have a peculiar situation where we are trying to use SOLR for indexing >>> multiple tables (There is no relation between these tables). We are trying >>> to use the SOLR index instead of using the source tables and hence we are >>> trying to create the SOLR index as that of source tables. >>> >>> There are 3 tables which needs to be indexed. >>> >>> Table 1, table 2 and table 3. >>> >>> I am trying to index each table in seperate doc tag with different doc tag >>> name and each table has some of the common field names. For Ex: >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> Barani, >> >> You cannot have multiple documents in a data-config, but you can >> have multiple entities in a document. And if your table 1,2, and 3 >> come from different dataSources, you can have multiple data sources >> in a data-config. If so, you should use dataSource attribute of entity >> element to refer to the name of dataSource: >> >> >> >> >> >> >> >> >> >> >> >> >> Koji >> -- http://www.rondhuit.com/en/ >> >> > > > > -- > Lance Norskog > goks...@gmail.com > -- Lance Norskog goks...@gmail.com
Re: old index files not deleted on slave
This could be a quirk of the native locking feature. What's the file system? Can you fsck it? If this error keeps happening, please file this. It should not happen. Add the text above and also your solrconfigs if you can. One thing you could try is to change from the native locking policy to the simple locking policy - but only on the child. On Sat, Dec 18, 2010 at 4:44 PM, feedly team wrote: > I have set up index replication (triggered on optimize). The problem I > am having is the old index files are not being deleted on the slave. > After each replication, I can see the old files still hanging around > as well as the files that have just been pulled. This causes the data > directory size to increase by the index size every replication until > the disk fills up. > > Checking the logs, I see the following error: > > SEVERE: SnapPull failed > org.apache.solr.common.SolrException: Index fetch failed : > at > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329) > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) > at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: org.apache.lucene.store.LockObtainFailedException: Lock > obtain timed out: > NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1065) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:954) > at > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:192) > at > org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99) > at > org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) > at > org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376) > at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471) > at > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319) > ... 11 more > > lsof reveals that the file is still opened from the java process. > > I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup > is pretty vanilla. The OS is linux, the indexes are on local > directories, write permissions look ok, nothing unusual in the config > (default deletion policy, etc.). Contents of the index data dir: > > master: > -rw-rw-r-- 1 feeddo feeddo 191 Dec 14 01:06 _1lg.fnm > -rw-rw-r-- 1 feeddo feeddo 26M Dec 14 01:07 _1lg.fdx > -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt > -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis > -rw-rw-r-- 1 feeddo feeddo 15M Dec 14 01:12 _1lg.tii > -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx > -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq > -rw-rw-r-- 1 feeddo feeddo 311 Dec 14 01:12 segments_1ji > -rw-rw-r-- 1 feeddo feeddo 23M Dec 14 01:12 _1lg.nrm > -rw-rw-r-- 1 feeddo feeddo 191 Dec 18 01:11 _24e.fnm > -rw-rw-r-- 1 feeddo feeddo 26M Dec 18 01:12 _24e.fdx > -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt > -rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis > -rw-rw-r-- 1 feeddo feeddo 15M Dec 18 01:23 _24e.tii > -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23 _24e.prx > -rw-rw-r-- 1 feeddo feeddo 283M Dec 18 01:23 _24e.frq > -rw-rw-r-- 1 feeddo feeddo 311 Dec 18 01:24 segments_1xz > -rw-rw-r-- 1 feeddo feeddo 23M Dec 18 01:24 _24e.nrm > -rw-rw-r-- 1 feeddo feeddo 191 Dec 18 13:15 _25z.fnm > -rw-rw-r-- 1 feeddo feeddo 26M Dec 18 13:16 _25z.fdx > -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 13:16 _25z.fdt > -rw-rw-r-- 1 feeddo feeddo 484M Dec 18 13:35 _25z.tis > -rw-rw-r-- 1 feeddo feeddo 15M Dec 18 13:35 _25z.tii > -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 13:35 _25z.prx > -rw-rw-r-- 1 feeddo feeddo 284M Dec 18 13:35 _25z.frq > -rw-rw-r-- 1 feeddo feeddo 20 Dec 18 13:35 segments.gen > -rw-rw-r-- 1 feeddo feeddo 311 Dec 18 13:35 segments_1y1 > -rw-rw-r-- 1 feeddo feeddo 23M Dec 18 13:35 _25z.nrm > > slave: > -rw-rw-r-- 1 feeddo fe
DIH for sharded database?
I have a table that is broken up into many virtual shards. So basically I have N identical tables: Document1 Document2 . . Document36 Currently these tables all live in the same database, but in the future they may be moved to different servers to scale out if the needs arise. Is there any way to configure a DIH for these tables so that it will automatically loop through the 36 identical tables and pull data out for indexing? Something like (pseudo code): for (i = 1; i <= 36; i++) { ## retrieve data from the table Document{$i} & index the data } What's the best way to handle a situation like this? Thanks
Re: DIH for sharded database?
You can have a file with 1,2,3 on separate lines. There is a line-by-line file reader that can pull these as separate drivers. Inside that entity the JDBC url has to be altered with the incoming numbers. I don't know if this will work. It also may work for single-threaded DIH but not during multiple threads. (Ignore this for Solr 1.4, you have no threads feature.) On Sat, Dec 18, 2010 at 6:20 PM, Andy wrote: > I have a table that is broken up into many virtual shards. So basically I > have N identical tables: > > Document1 > Document2 > . > . > Document36 > > Currently these tables all live in the same database, but in the future they > may be moved to different servers to scale out if the needs arise. > > Is there any way to configure a DIH for these tables so that it will > automatically loop through the 36 identical tables and pull data out for > indexing? > > Something like (pseudo code): > > for (i = 1; i <= 36; i++) { > ## retrieve data from the table Document{$i} & index the data > } > > What's the best way to handle a situation like this? > > Thanks > > > > -- Lance Norskog goks...@gmail.com
Re: DIH for sharded database?
--- On Sat, 12/18/10, Lance Norskog wrote: > You can have a file with 1,2,3 on > separate lines. There is a > line-by-line file reader that can pull these as separate > drivers. > Inside that entity the JDBC url has to be altered with the > incoming > numbers. I don't know if this will work. I'm not sure I understand. How will altering the JDBC url change the name of the table it is importing data from? Wouldn't I need to change the actual SQL query itself? "select * from Document1" "select * from Document2" ... "select * from Document36"