Hello everyone,
I am attempting to analyze a dataset containing extremely skewed data, namely a frequency list of a large Dutch language corpus.
The problem lies in that around 30% of words in the corpus are only used once, and the most frequently used words number in the tens of millions.
I have used a Log10 transformation to combat some of the skewness yet the data still violates the assumption of normality.
I need a usable mean and standard deviation.
Does anyone have tips/ideas how to proceed?
Thank you in advance.
------------------------------
Stijn Euverman
------------------------------