Sorry for the duplicate here, I had to subscribe first... From: Walt Rolle Sent: Tuesday, April 28, 2015 11:24 AM To: 'Brad Rix'; MarkLogic Developer Discussion; Cynthia Jiang Cc: Scott Fowler; '[email protected]' Subject: RE: [MarkLogic Dev General] SQL Query Performance and Tableau Connection
Hi Brad, Thanks for the tip. Also it looks like it's going to make sense for us just to store the primary address at the patient level anyway (so no joins). I wanted to mention some conclusions we arrived at after talking to Sara. * If we want to use SQL at all, then we can't have repeating rows in our document structure, or at least need to name the first one in the collection something different to separate it. * Getting SQL to work in our situation is about supporting ODBC connectivity options such as ODBC to Tableau. We can simplify the data model by de-normalizing a bit, such as sticking to just primary address and active eligibility. After that, we supply Tableau with our own specific SQL statement to limit the columns and number of "tables". Group by / order by / functions are best left to Tableau afterwards, so our source queries are simple. * MarkLogic lends itself better to document search, which would use things like XQuery or Javascript to screen the documents and return patient documents that match the criteria. Because we broke apart our model to support SQL, we take a performance hit here in having to do joins or store data in duplicate locations. The other option would be to keep the large hierarchical documents and ETL a subset off to a data mart, if the customer wants to do relational things against it. Because the customer has a lot of existing code and internal capability around the traditional SQL Server stack, we're thinking of having our model driven towards supporting SQL and NoSQL techniques simultaneously - at least as the first step of the roadmap. In the future, it would not be much effort to generate new documents that are complete hierarchies of patient history. To do this we need to separate objects by grain. 1. Patient (with primary address and active eligibility) 2. Claim Header (consider storing diagnosis codes as repeating rows in one field, it's essentially "factless") 3. Claim Revenue (this is line item detail with dollar amounts) 4. Claim Procedure (another type of line item detail with dollar amounts) 5. Care Provider (specialty may need to be a field level collection like diagnosis) 6. Facility We'll have to revisit these decisions when we get more time to spend on looking at the analytics use cases to make sure we always choose the right tool for the job. As far as deep amounts of storage for terabytes of data, MarkLogic seems to be a good landing place for the variety of incoming claims files, either before or after we standardize them or both. Thanks, Walt Walter Rolle, PMP Project Manager [email protected]<mailto:[email protected]> o: (770) 884.4512 | m: (678) 722.1549 | f: (770) 884.4512 2475 NorthWinds Parkway, Suite 200 Alpharetta, GA 30009-4807 [in]<http://www.linkedin.com/in/wrolle>[cid:[email protected]]<http://twitter.com/rdadevteam>[cid:[email protected]]<http://www.facebook.com/pages/RDA-Corporation/90541234444> www.rdacorp.com<http://www.rdacorp.com/> From: Brad Rix [mailto:[email protected]] Sent: Tuesday, April 28, 2015 10:45 AM To: MarkLogic Developer Discussion; Cynthia Jiang Cc: Scott Fowler; Walt Rolle Subject: RE: [MarkLogic Dev General] SQL Query Performance and Tableau Connection One comment about your specific query. You have a "left outer join" against the patientaddress table. This can actually be written as an "inner join" with the same results since you are restricting the right side of the outer join with the where clause. You can try your query as an inner join and see if that improves your performance. The relational database engine probably optimized the query by re-writing it as an inner join. If you did not have the "a.State='LA'", then it would have to remain an outer join. The left outer join might be greatly effecting performance. Since you are also using group by and order by, the total amount of results might also effect the performance. select c.Patientid, ClaimNumber, FirstName, LastName, sum (PharmacyTotalCost) as PharmacyTotalCost from claim c inner join patient p on c.patientid = p.patientid left outer join patientaddress a on p.patientid = a.patientid where a.[State] = 'LA' group by c.PatientId, ClaimNumber, FirstName, LastName order by SUM(PharmacyTotalCost) desc limit 10 Note: You can see a reference in mysql as an example about when outer joins can be re-written to inner joins: https://dev.mysql.com/doc/refman/5.0/en/outer-join-simplification.html Brad Rix | Technical Lead O: +1 303 542 2172 | M: +1 303 915 2771 Skype: Brad.Rix | Google Talk: [email protected]<mailto:[email protected]> From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Sara Mazer Sent: Monday, April 27, 2015 6:30 PM To: MarkLogic Developer Discussion; [email protected]<mailto:[email protected]> Cc: Scott Fowler; Walt Rolle Subject: Re: [MarkLogic Dev General] SQL Query Performance and Tableau Connection Hi Cynthia, My name is Sara Mazer, and I am a solutions consultant in the DC area. I've worked a lot with Tableau and MarkLogic and would like to set up some time to discuss your evaluation and give you some best practices so that you can get the best performance possible. I was the person who worked with Tableau to add MarkLogic as a named connection and supported their certification of MarkLogic, and I think I can help you. I have a too many best practices to go over by email, so I'd like to speak with you and walk you through setting up databases for successful analytics through WebEx. Would that be possible? I have a meeting from 9:30-11:00 but can move any other meeting I have tomorrow to suit your schedule, just let me know a good time for you. In general, I plan on: Recommending the best data model for MarkLogic of this type of data (hint: it doesn't look like what would be in an RDBMs) Showing you a demo of patient and claims data with Tableau and review how it was done Discussing the best practices of working with Tableau (minimize the data coming back from MarkLogic by using custom SQL for charts, using MATCH keyword, having Tableau do grouping/sorting instead of in SQL) Reviewing common connection issues between MarkLogic and Tableau and how to avoid them (most likely you are corrupting your SQL views:a best practice is to have a unique schema database for each MarkLogic database) Reviewing performance of XQuery/JavaScript vs. SQL in MarkLogic Best regards, Sara Mazer sara dot mazer "at" marklogic dot com This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Cynthia Jiang Sent: Monday, April 27, 2015 4:56 PM To: [email protected]<mailto:[email protected]> Cc: Walt Rolle Subject: [MarkLogic Dev General] SQL Query Performance and Tableau Connection Hello, We have been testing MarkLogic 8 downloaded from your website for a POC. Our environment is very straight forward. A single VM (windows 7 Enterprise - 64bit) with 16GM RAM and 2 Intel (E5-2690 v2) 3GHz processors. We have around 4 SQL tables that we modeled in Json format so each row ends up becoming a document, and here are some of the stats: Patient : 8 fields / 20,924 documents PatientAddress: 5 fields / 36,193 documents Claim: less than 30 fields / 335,349 documents ClaimDiagnosis: 2 fields / 8,676 documents We created element range index for all the fields, and created 4 views that each represent a SQL table. Then we were able to use SQL to query the data and get the result in the query console. But the performance is not nearly as what we expected, we thought it would work faster than SQL environment. A simple query like the one below took 30 seconds to run, and the same query took 3 seconds to run with a larger data set in a SQL environment with 6GB memory and 4 processors. select c.Patientid, ClaimNumber, FirstName, LastName, sum (PharmacyTotalCost) as PharmacyTotalCost from claim c inner join patient p on c.patientid = p.patientid left outer join patientaddress a on p.patientid = a.patientid where a.[State] = 'LA' group by c.PatientId, ClaimNumber, FirstName, LastName order by SUM(PharmacyTotalCost) desc limit 10 We also tested out the connection between MarkLogic and Tableau 9 trial version through MarkLogic 64 bit ODBC connector on the same VM. The connection is not stable, got lost very frequently, and sometimes it won't stop running for 20 minutes after adding one dimension and one simple measure the columns and rows shelves. We are under a very tight timeline and have to report our findings back to our client within 24 hours. With the current performance, we will not be able to recommend any MarkLogic product to our client. Please help us if there is anything that we could do to improve the SQL query performance and data connectivity between Tableau and MarkLogic. Also, for the query as the simple example above, is there any XQuery or Javascript Query that we can use that could produce the similar results? We cannot find any documented examples. Any help will be really appreciated! Thank you very much, Cynthia Jiang RDA Corpration
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
