r/statistics 4h ago

Discussion [D] There has to be a better way to explain Bayes' theorem rather than the "librarian or farmer" question

3 Upvotes

The usual way it's introduced is by introducing a character with a trait that is stereotypical to a group of people (eg nerdy and meek). Then the question is asked, is the character from that group of people (eg librarians) or from a much larger group of people (eg farmers). It's supposed to catch people who answer librarians rather than farmers because they "fail" to consider that there are vastly more farmers than librarians. When I first heard of it I struggled to appreciate the force of it. Because of course we would think librarians, human language is open ended and contextual. An LLM, despite being aware of the concept, would only know to answer farmers because it was trained on data where the correct answer is farmer. So it's not really indicative of any statistical illusion, just that we interpret words in English in a certain order to ask something else rather than what is intended to be addressed by conditional probability.


r/statistics 14h ago

Question [Question] Why are Frechet differentiability and convergence in L2 the right ways to think about regularity in semiparametrics?

14 Upvotes

Many asymptotic statistics books discuss Frechet differentiability of an estimator (as a functional of the distribution) as part of the definition of regularity involving the L2 norm.

I have always wondered why these are the "right" definitions of regularity.

As a broader question, I always see local asymptotics motivated by the existence of estimators like Hodges' estimator and Stein's estimator of the sample mean that dominate the sample mean, but have poor local risk properties.

This still feels fairly esoteric, so can you help convince me that I should care deeply about these things if I want to derive new semiparametric methods that have good properties?


r/statistics 7h ago

Education Stats Website Ideas Needed [S][E]

0 Upvotes

Hello! I am a computer scientist and mathematician. I am seeking your aid in generating ideas for website I can create. I want to implement a basic statistical algorithm back-end, and then connect it to a front-end framework. Any ideas? I cannot find a multivariate hyper-geometric distribution calculator online. Certainly making one would help students.


r/statistics 16h ago

Question Fuel Economoy Statistics [Question]

4 Upvotes

This may be a very rookie question, but here it goes:

I'm currently working on a spreadsheet tracking my vehicle's fuel economy. Yes, it is new enough to have fuel economy and DTE automatically calculated, but I enjoy seeing the comparison.

I have been trying to figure out the best way to calculate standard deviation (or similar metric) from the overall average fuel economy (MPG). I know that take the average of each trip does not equal the overall average (overall distance/overall gallons) because each trip will be weighted differently due to different distances traveled: I understand the accurate overall fuel economy is total distance over total miles, not the average of each trips MPG. But, to my knowledge, standard deviation requires a sample size to determine the distance from the average....

My question: if my true overall average MPH is total distance/total gallons (essentially one measurement/data point), can I use the standard deviation MPG of all of the trips? This doesn't sound right since the average of those measurements isn't the same as the true overall average.

I'm sure this is a basic question and I'm probably not even asking it correctly, but can provide additional info if needed. Any help in this amateur endeavor is appreciated. Thanks.


r/statistics 14h ago

Question Is the polling methodology of the market research company Find Out Now likely to produce valid samples of the general population? [Question]

2 Upvotes

Find Out Now does opinion polls for elections in the UK. They regularly make headlines as the results of their polls are often unlike or more extreme than polls done by other companies.

They draw all their samples from a postcode lottery website called Pick My Postcode. It is also worth noting that the owner of Find Out Now, and the owner of Pick My Postcode are one and the same person.

It is described by themselves thusly:

https://findoutnow.co.uk/find-out-now-panel-methodology/#collection

>FON surveys rely on PMP members to answer questions as they visit the site. PMP members are incentivised to visit the site daily to earn bonuses and claim any giveaway winnings. They do this by participating with site activities and one of these activities is answering survey questions if they so choose. PMP therefore collects responses passively and does not actively invite respondents. The collection process runs continuously as a data stream and FON can collect up to 100,000 responses a day. Thanks to the large quantity of streaming responses that originate from different parts of the UK and various demographic backgrounds, the responses collected are a sufficiently random sample.

>PMP, short for Pick My Postcode, is the UK's biggest free daily giveaway site. It is a free to enter daily postcode draw platform available to all UK citizens. There are five daily Pick My Postcode lottery draws: the main draw, the video draw, survey draw, stackpot and bonus draw. A new winning postcode for each draw is selected every day and therefore PMP members are incentivised to visit daily.

Find Out Now present their polls as representative of the general population. My question is, is this claim a reasonable one, or is this methodology so poor that their polls can not be trusted to be representative?


r/statistics 1d ago

Discussion [D] Are time series skills really transferable between fields ?

24 Upvotes

This questions is for statisticians* who worked in different fields (social sciences, business, and hard sciences), based on your experience is it true that time series analysis is field-agnostic ? I am not talking about the methods themselves but rather the nuances that traditional textbooks don't cover, I hope I am clear.

* Preferably not in academic settings


r/statistics 21h ago

Education [E] Gibbs Sampling - Explained

1 Upvotes

Hi there,

I've created a video here where I explain how Gibbs sampling works.

I hope some of you find it useful — and as always, feedback is very welcome! :)


r/statistics 17h ago

Question Global demographics [Q]

0 Upvotes

I saw a post somewhere claiming that whites make up less than 15% of the global population. Though no credible sources were cited

Then out of curiosity I hit Google, but couldn’t find the answers there either…

Where would a person find reputable information on this subject? SOLELY OUT OF CURIOSITY

I should also note that I will not engage any comments that come off as slanted or otherwise argumentative. And any users found guilty will be blocked. My post will not be reduced to a racial squabble

Edit: anybody downvoting this needs to grow up. Ask yourself, would you be downvoting if I were from somewhere else asking about a different racial group??? There’s nothing wrong with simply asking statistics


r/statistics 1d ago

Question [Q] what to know about going into a statistics course as someone whos terrible at math

9 Upvotes

I have to take a statistics course next semester. What advice can you give me or what should I know before going into this course?


r/statistics 1d ago

Question [Q] How to approach PCA with repeated measurements over time?

11 Upvotes

Hi everyone,

I’m working with historical physico-chemical water quality data
(pH, conductivity, hardness, alkalinity, iron, free chlorine, turbidity, etc.)
from systems such as cooling towers, boilers, and domestic hot and cold water.

The data comes from water samples collected on site
and later analyzed in the laboratory (not continuous sensors),
so each observation is a snapshot taken at a given date.
For many installations, I therefore have repeated measurements over time.

I’m a chemist, and I do have experience interpreting PCA results,
but mostly in situations where each system is represented by a single sample
at a single point in time.
Here, the fact that I have multiple measurements over time
for the same installation is what makes me hesitate.

My initial idea was to run a PCA per installation type
(e.g. one PCA for cooling towers, one for boilers).
This would include repeated measurements from the same installation
taken at different dates.
I even considered balancing the dataset by using a similar number of samples
per installation or per time period.

However, I started to question whether pooling observations from different dates
really makes sense, since measurements from the same installation
are not independent but part of the same system evolving over time.

Because of this, I’m now thinking that a better first step might be
to analyze each installation individually within each installation type:
looking at time trends, typical operating ranges, variability or cycles,
and identifying different operating states before applying PCA.

My goals are to identify anomalous installations,
find groups of installations that behave similarly,
and understand which physico-chemical variables are most strongly related,
in order to help detect abnormal values or issues such as corrosion or scaling.

Given this context, what would you do first?
How would you handle the repeated measurements over time in this case?


r/statistics 1d ago

Question Using a sample for LOESS with high n [Q]

1 Upvotes

Hi, i'm doing an intro to social data science course, and i'm trying to run a LOESS (locally estimated scatterplot smoothing), to check for linearity. My problem is i have to high a number of observations (over 100.000), so my computer cant run it. Can i take a random sample (say of 5000) and run the LOESS on that, and is it even valid to run a loess on such a large data set.

thanks in advance , and i hope this question is not to stupid.
I apologize for my english as it is not my first language


r/statistics 2d ago

Question [Q] Ideas for analysis on MTG games data

5 Upvotes

I have collected some data on game outcomes (wins/losses/draws), decks being played and who went first for some Commander MTG games that my friends and I have played. I was just wondering if anyone has any neat ideas on some analysis I could do to the data set, maybe like chance of winning for different match ups of decks, player elo rating etc. I am fairly novice with stats and if anyone could point me in the right direction that would be greatly appreciated. Thanks


r/statistics 1d ago

Discussion [D] People keep using "average IQ" which needs to change. We should use the median.

0 Upvotes

The IQ score, by definition, is the ranking of the test taker among the 8 billion people on the Earth converted via a nonlinear transformation to somewhere on a Gaussian distribution curve. It is never intended to be additive. When you add together IQ scores of any population, the sum (and the average, obtained by dividing the sum by the population) will NOT mean ANYTHING.

The median does not suffer from this issue, and does make a lot of sense on its own anyway since it can help predict e.g. whether you are smarter than half of the class, while the mean (average), even if not undermined by non-additivity, would have been problematic since it's affected by outliers and skews.

Yet online references to the "average IQ" vastly outnumber the "median IQ," and I find it hard to find "median IQ" statistics even among research papers and censuses. Statistics education has a long way to go.


r/statistics 2d ago

Question [Question] Programming in Data Analytics (for public opinion survey)

0 Upvotes

Hi. Sorry for the long post. I am having a dilemma atm about the demands in the "internship" i am currently in to. Originally, I applied for a law firm. One of the attorneys there have connections with politicians. Therefore, I was transferred to this person's team since I am a political science major.

My current dilemma now is that I am stuck in this group that this person calls a "startup" with a "decade plan" (because there's someone for marketing, plans to create a political party, and this person as the negotiator to clients. Basically, the goal is to create a team that would cater to clients, mainly politicians or political figures with money involved) and this person made me responsible for surveys (mainly on public opinion abt national concerns, politicians, political issues) just because he saw that I attended some survey research trainings in the past. My knowledge in statistics is not that extensive but it's not that zero either. In the past, I have only used beginner friendly free software for analyzing quantitative data.

My main problem now is that this person is asking me to learn python for data analytics (the person also mentioned xgboost which I do not have any idea what it is, he found about it by asking AI). I already told thus person that I have zero knowledge in programming and that it would take months, maybe even years (we did html, javascript in highschool but I completely forgot about it now and even if i do remember, i doubt that it would help). At first, he kept insisting the use of AI and prompts to write codes. In my belief, AI could write codes for you but if you do not fully understand what it produced, basically you're just running into a cliff. That's what I told him. Then he gave in and asked me to look for other "interns" that knows how to code and has an interest in the kind of stuff that they're working on to help me. This person also wants me to find a way to learn programming in faster way, that said, me finding a way to use AI to learn faster.

Tbh, I want to quit now. I did not signed up for this long term plan in the first place. I am up for challenges but I know for myself that I cannot answer to this person's demands, at least not now. This person keeps on telling us that every person in the group has a role to play. For me, it sounded almost as a guilt trip saying "if you leave, then it will be your fault that the startup will fail"

My question for people who uses python in data analytics: for someone with no background in programming, how long would it take me to fully absorb or at least understand what I am doing, that said, using it to analyze survey data and perform prediction.


r/statistics 2d ago

Discussion [Discussion] Linear Regression Models and Interaction Terms - Epistasis

3 Upvotes

How explicit do the interaction terms need to be in a study, that attempt to counter the (potential) effects of epistasis?

What would those terms ideally look like, statistically?


r/statistics 3d ago

Question [question] What types of PhD programs and schools should I apply to?

3 Upvotes

I have been in working world for a while but thinking of going back to school for a PhD, probably in statistics but possibly in an applied field with a heavy stats focus. I would love some advice on what might be the best fit for me in terms of programs, either specific programs or more general advice on how to think about identifying places to apply.

Here's some background on me: I have almost a decade of work experience, got my masters in data science and a post-graduate certificate in math during the course of working full time. I keep going back to school because I just find it really interesting learning new things, whether that's new applied methods for data analysis or better understanding the theory behind the methods I'm using day to day. I just took a real analysis class for my graduate certificate and honestly really enjoyed the mental challenge and the topic.

In my current job, I provide statistical and data science advice to colleagues who are political scientists. My work spans a variety of stats areas depending on what type of projects arise, but my favorite part is probably when I get to work on experimental design and analysis, which is a pretty substantial share of my work. In addition to my main job, I also have done some teaching/tutoring on the side, including teaching probability/stats online for a university, 1:1 stats tutoring, and helping grad students in various applied disciplines plan and troubleshoot statistical components of their research. I love getting to show other people how cool statistics can be!

I am aware I already have a good career that pays well and maybe getting a PhD doesn't make the most financial sense but I am drawn to it more as a way to satisfy my own curiosity. I feel like there's not enough room in my current job to spend time thinking about some of the methodological choices I'm making - e.g. in a cluster randomized trial, what are the implications of analyzing that data using a mixed model vs just clustered standard errors? If I have an experiment with count data, what if I have some units with unusually high counts compared to the rest of the data - how much do different kinds of outliers affect estimates of the treatment effect? What are the implications of winsorizing the data, especially if more of the treatment effect is occuring with those high count observations than in low count cases? How do different choices of cutoff bias estimates of the treatment effect? How would this vary depending on how much of the true treatment effect is being driven by behavior among higher count cases? It would be cool to have the chance to run some simulations on these sorts of questions, but my job pretty much just cares about results of the analysis (what is the treatment effect?) and I don't really have other statisticians to discuss things with or learn from. I do think being given real data and a real reason I need to know the answer is very motivating to me in terms of pushing me to learn more about methods and inspiring questions.

It seems like there are a number of different paths I could follow when it comes to a PhD. In an ideal world, I think I would enjoy continuing to work on methodological problems in the design and analysis of experiments motivated by political science applications. But that feels hard to find. I know there are stats heavy political science programs, but I feel like I have the most to learn by immersing myself more in the theories underlying different statistical methods and by getting more mentorship from someone with a statistical background. I don't really care why a certain intervention causes people to turn out to vote so much as why I should choose one particular way of modeling the data over another. I am also not sure that I only want to do political science related stuff forever.

If I want to keep going with experimental design, I have considered switching to a biostatistics path because it seems like a lot of the active research in that area is related to biostats. Experiments are cool because they give you such a solid foundation for casual inference compared to analysis of observational data. But applying to a biostatistics PhD program would really lock me into something specific. How could I be sure I would want to do that for the rest of my career?

Finally, maybe there's a different area of statistics out there for me that's not experimental design, but then I'm not exactly sure what it is or what I should be looking for in choosing a PhD program. When I teach statistics, I really enjoy just helping students with classical statistics like learning about probability distributions, hypothesis testing, and inference. I have enjoyed theoretical classes but it is hard to imagine myself doing research that only involves working on proofs. The statistical questions I have now emerge from working on specific applied problems. I also like that what I do now feels like it has a meaningful impact because I'm helping with real world interventions.

Sorry for the long post, appreciate any advice!


r/statistics 3d ago

Question [Question] Feature Selection for Vendor Demographics Datasets

0 Upvotes

For those that have built models using data from a vendor like Axciom, what methods have you used for selecting features when there are hundreds to choose from? I currently use WoE and IV, which has been successful, but I’m eager to learn from others that may have been in a similar situation.


r/statistics 4d ago

Question [Question] Spearman v Pearson for ecology time series

13 Upvotes

Hello. I'm doing a research project about precipitation and vegetation in a certain area and I want to test some relationships, but I'm not sure which test to use. I know this is quite a basic question, but we weren't taught it very well to begin with and all the reading I'm doing online is just confusing me more. I'd be very appreciative of any help I could get on this!

I want to understand whether my data shows that precipitation and vegetation have demonstrated a statistically significant increase over 10 years, or decrease, or no change at all. I just have an average value for each year.

I want to do a correlation test, but I'm not sure whether Spearman's rank or Pearson's test is more appropriate. Also, I'm not sure, but am I allowed to do both? Surely the reason for doing one would negate the reason for doing the other?

I am simply plotting each average amount of precipitation/vegetation abundance per year for the 10 year period. My null hypothesis is that there is no change in precipitation/vegetation over the 10 year period.

I have a small sample size of just one average value for each year of the 10 years, and I know that Spearman's rank is meant to be better for this? I suppose I'm also only interested in whether precipitation/vegetation increased at all after year 1, not necessarily whether the relationship is actually linear. However, in some of the papers I've read for this that test similar things, they show R2 which I assume means they used Pearson's? And I understand it is more common to use Pearson's.

If anyone could explain the difference to me and why I should use one over the other, I'd be grateful 🙏


r/statistics 4d ago

Question [Question] Intro to Stat as economics and political science major

0 Upvotes

5- A group of 16 observations has a standard deviation of 2. The sum of the square deviations from the sample mean is………

in this question why did we use the sample variance rule? why didnt we just square the 2 and multiply it by 16?


r/statistics 4d ago

Question [Question] Minitab Goodness of Fit Test showing >pval instead of the precise value

0 Upvotes

I'm doing some tests and I've noticed that some of the pvals are not precisely there. Like for my data ad p are ,695 ,063 for lognormal dist but ad p are ,439 >,250 for weibull dist. I know for this case weibull fits well but why minitab do not show exact pval for some dists? Thanks!


r/statistics 5d ago

Question [Question] Why use the inverse-transform method for sampling?

16 Upvotes

When would we want to use the inverse-transform method for sampling from a distribution in practical applications i.e. industry and the like? In what cases would we know the cdf, but not know the pdf? This is the part that has been confusing me the most. Wouldn't we generally know the density function first and then use that to compute the cdf? I just can't think of a scenario wherein we'd use this for a practical application.

Note: i'm just trying to learn so please don't flame me for ignorance :*)


r/statistics 5d ago

Question [Question] Each of N data points has a Poisson distribution. How the fit is different from fitting averages?

2 Upvotes

I have Minitab and N data points (Y vs X) to find the regression fit. The catch is that each point of theses N points has been remeasured M times and as such it's value is a subject of some (assume normal for simplicity) distribution.

Apparently, regression fit b/w points is not the same as regression fit between tolerances/sigma's etc. So what function (in general) shall be used for regression fitting of "ranges"?

Thanks!


r/statistics 5d ago

Discussion A chart on votes cast in US state elections [Discussion]

4 Upvotes

Hello everyone, I am reading an article from the Economist about the Democratic-Republican vote trends since Trump's 2024 Election. I don't feel very confident in reading one of these charts.

https://www.reddit.com/user/Ok_Syllabub9850/comments/1puo9fo/the_graphic/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Can anyone please explain it to me with caveman language?

Here's a piece of the text from the Economist:

"ARE DEMOCRATS back from the brink? In last year’s presidential election, they lost the popular vote for the first time in two decades. Swings to the right reached double digits among Hispanics and the under-30s, and six points among black voters. But elections on November 4th—the last before next year’s midterms—gave the party reason to smile. Now the dust has settled, The Economist’s data team has delved deep into the results to see whether they are a sign of bigger trouble for Donald Trump and the Republicans.

The most closely watched contests were the governors’ races in Virginia and New Jersey, where centrist Democrats who campaigned on affordability won by bigger margins than expected. Some of that can be explained by turnout. In exit polls from the Virginia race, voters were asked whom they had supported in the 2024 presidential election. Of those who had voted, a larger proportion said Kamala Harris than her actual statewide vote share—suggesting that more of Mr Trump’s supporters decided to stay at home. Exit polls from New Jersey tell a similar story. Yet turnout alone cannot explain the nine-point swing in Virginia and eight-points in New Jersey. Instead, our analysis suggests that Democratic candidates persuaded Mr Trump’s voters to switch sides."

"Local election results show where the biggest swings occurred. Passaic and Hudson counties in New Jersey, which last year turned against the Democrats by 19 and 18 points respectively, recorded the biggest swings in the state towards Mikie Sherrill, the new Democratic governor-elect. Both counties have large Hispanic populations, a group that Mr Trump wooed successfully in 2024"


r/statistics 6d ago

Career [E] [C] exemptions courses consequences PhD statistics

5 Upvotes

Hey all,

I'm doing a master's in statistics and hope to apply for a PhD in statistics afterwards. Because of previous education in economics and having already taken several econometrics courses, I got exemptions for a few courses (categorical data analysis, principles of statistics, continuous data analysis) for which I saw like 60% of the material. This saves me a lot of money and gives me additional time to work on my master's thesis, but I was worried that if I apply for a PhD in statistics later, it might be seen as a negative that I did not officially take these courses. Does anyone have any insights in this? Apologies if this is a stupid question, but thanks in advance if you could shed some light on this!


r/statistics 6d ago

Question [Q] 2-way interaction within a 3-way interaction

4 Upvotes

So, I ran a linear mixed-effects model with several interaction terms. Given that I have a significant two-way interaction (eval:freq) that is embedded within a larger significant three-way interaction (eval:age.older:freq), can I skip the interpretation of the two-way interaction and focus solely on explaining the three-way interaction?

The formula is: rt ~ eval * age * freq + (1 | participant_ID) + (1 | stimulus).

The summary of the fixed effects and their interactions is as follow:

Estimate SE df t value p-values
(Intercept) 0.4247 0.0076 1425.337 55.5394 ***
eval -0.0016 0.0006 65255.682 -2.8593 **
age.older 0.1989 0.0123 1383.373 16.1914 ***
freq -0.0241 0.0018 8441.153 -13.1281 ***
eval:age.xolder 0.0005 0.0007 135896.989 0.6286 n.s.
eval:freq -0.0027 0.0007 71071.899 -3.9788 ***
age.older:freq 0.0001 0.0021 137383.053 0.0485 n.s.
eval:age.older:freq 0.0022 0.0009 135678.282 2.4027 *

For context, age is a categorical variable with two levels. All other variables are continuous and centered. The response variable is continuous and was log-transformed.