I'm far from an expert on stats but what I think you are saying is if you try 
and compare Baseline with Version 3 you don't think your p-value is as good as 
version 1 and 2.  I'm not 100% sure you are meant to do that with p-values but 
I'll let someone else comment on that!.

                total    incorrect  correct   % correct
baseline     898      708         190       21.2%
version_1   898      688         210       23.4%
version_2   898      680         218      24.3%
version_3   1021    790          231      22.6%

>
> Here, the p value for version_3 (when compared with the baseline) seems to
> make no sense whatsoever. It shouldn't be larger that the other two p
> values, the increase in correct answers (that is what counts!) is bigger
> after all.
>
No its not the raw numbers its the proportion of correct answers that counts.

I've added a % correct to your data - does that  make it clearer?  Only 22.6% 
of version 3's answers were correct - so the difference in terms of % change is 
smaller than version 1 and 2 produced.  From my niave persepctive I'd want to 
test for a difference between all results and baseline, and then v1 & v2, v1 & 
v3, v2 & v3  (you may tell me they are unsound things to test - in which case 
don't test them.  You'd then need to determine a threshold for accepting that 
the test is valid (say p < 0.05).  I'#d contest that the test should be two 
tailed - results could be better or worse?

You should also develop a hypothesis.  Let me create one for you:


A.
H1: version1 of the software is better than baseline
(H0: version 1 is no better than baseline)

B.
H1: version2 of the software is better than version 1
(H0: version 2 is no better than version 1)

C.
H1: version3 of the software is better than version 2
(H0: version 3 is no better than version 2)

Now look at you results and p-values and and work out if the H1 or H0 applies. 
You could develop further variants (D: version 3 is better than baseline).

Finally - remember to consider the 'clinical significance' as well as the 
statistical significance.  I'd have hoped a software change might have increase 
correct answers to say 40%?  And remember also that p-value of 0.05 has a false 
positive rate of 1:20.

>
> Any idea what's going on here? I thought the sample size should have no
> impact on the results?
>
Erm.. sample size always has an influence of results,  If you show a  
difference in 100 samples - you would expect a larger p value for virtually any 
statistical test you chose than if you show that same difference in 1000 
results.  You have a bigger sample but a smaller overall difference so in 
effect you can be less sure that that change is not down to chance. (Purists 
statisticians will likely challenge that definition)


********************************************************************************************************************

This message may contain confidential information. If yo...{{dropped:21}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to