Extreme skewed data

    Posted Fri March 29, 2024 12:51 PM

    Hello everyone,

    I am attempting to analyze a dataset containing extremely skewed data, namely a frequency list of a large Dutch language corpus.

    The problem lies in that around 30% of words in the corpus are only used once, and the most frequently used words number in the tens of millions.

    I have used a Log10 transformation to combat some of the skewness yet the data still violates the assumption of normality.

    I need a usable mean and standard deviation.

    Does anyone have tips/ideas how to proceed?

    Thank you in advance.

    Stijn Euverman