There are two branches of Statistics: Descriptive- Used in studies to describe relationships and Deterministic: Mathematical processes using descriptive properties/ranges to drive minimum and maximum parameters to solve for unknown variables.
Descriptive Statistics:
Descriptive Statistics, of which Multivariate is the most advanced involves linear regression and hypothesis testing with the logic or understanding based on science- and you obtain results by seeing if you have sufficient evidence to reject the null hypothesis- the null hypothesis being most typically: there is no relationship between the independent and dependent variables tested. The conclusion is stated that there is or is not enough information to reject the null hypothesis which then suggests further, more refined thinking and testing.
Multivariate Analysis of Variance Statistics are the most thorough and insightful scientific statistical methods designed/invented/created and enhanced to facilitate understanding relationships, complex relationships, relationship dynamics- interaction, nesting and factor relationships- i.e. is the dynamic driven by race or poverty or education level. Using linear regression as its base and isolating variance attributed to treatment or factor verses mean squared error, these tools/methods may reveal education or education level is the driving factor- which does not mean that one group/sex/race is not more adversely impacted- but rather present an understanding which can guide rectification if deemed a problem.
Multivariance analysis of variance calculates and compares the variance of a given factor and the dependent variable to the mean squared error- a measure of random error, error that is not accounted for by the model. If the mean squared error is great, the model is a weak fit and should be re-examined as there may be something else of greater significance when trying to understand the dependent variable.
Indeed, despite better working conditions or more training, differences in assembly line productivity may in reality be primarily due to a loose or broken bolt! With science, statistics can confirm a relationship-i.e. the higher trained line has better performance, but true causality must be based in reality- i.e. when the broken bolt is fixed, the differences go away.
The solution process and testing thereof is separate. The understanding should suggest parameters and things to consider.
Mean, Mode and Median and Standard Deviation- convey to you a lot about your data and knowing which one is most applicable or helpful and what they mean is important.
If the variances amongst two groups is sufficiently large, distinguishing between the two groups may be inappropriate.
Plotting residuals is key- data results can be driven by outliers- extremes like average income most people 60-80K and then 1 person 3 Million- if you look at the mean it may be not calculating but 400K- this is driven by the 3 Million. This is simple and straight forward if you are looking at a chemical impact on someone it may be more complex.
Sample size is not the most important variable and sample size plays a role in determining the degrees of freedom in a model. Technically you can have 1 degree of freedom with a sample size of 2 and therefore information. The "fit" of the model and the degree to which it accounts for variation- in other words variance is the most important part.
Big Data- Mining Data- Reviewing all medical records and then presenting a conclusion or inferring causality is not scientifically valid and does promote understanding in reality. The term for this is data snooping and the burden of proof is much higher- Burden of proof is a scientific- statistical term. Mining Data is not objective- you are looking for relationships vs testing to see if there is a relationship which is why it violates understanding- it can be done in some instances in manufacturing to suggest areas to then use Multivariate- generate hypothesis. It can readily lead to circular logic and even wishful thinking. Correlations are plentiful but they do not mean causality and this is lost by many who practice data snooping. Data snooping is not a scientific approach to understanding factors as there is no scientific thought or hypothesis involved. It is only suggestive and should only be seen as exploratory in nature.
Deterministic Statistics:
Deterministic statistics utilize linear equations and matrices to solve for unknown variables through simulation of known variables and their ranges as parameters while utilizing minimizing and maximizing driving equations to produce ranges/solve for unknown variables. It was invented by NASA to solve for unknown variables for the space program and it is grossly misapplied and commonplace by the majority of today’s stock analysts who use software to “predict” stock price ranges in lieu of understanding an industry, company products, and company management. This gross abuse allows for gaming and a tendency for systematic over valuation particularly given the “stops” used when models are “minimized” with no such provisions when “maximized”.
Today's Wall Street analysts all predict stock prices and use similar software which in reality is a form of price fixing as these models then direct machines to trade based on algorithms and stocks are traded at lightning speed all day long and a machine can buy and sell the same stock multiple times a day. This literally DRIVES the market and has little to do with company performance.
Money is so concentrated with institutional investors that they can make money off of money and no longer need to charge fees for accounts nor for trades. This defies reason is a manifestation of a system out of control. The nearly identical slopes of Amazon and Tesla and Game Stop exemplify this. Machine trading as practiced can lead to a gross overvaluation of a company and the market as a whole.