PPS Sampling in Python
Context
During my internship here in East India, I was given a task to sample booths from a constituency using PPS Sampling Method. The problem was that I was asked to do it manually. Now, I am a big fan of automating the boring things. (Thanks to Al Sweigart ;) ). So, I made a small Python program which did the sampling and for the benefit of all, I am sharing the code here.
What is PPS?
In PPS Sampling, the probability of selection of element is proportional to its size. This improves the accuracy of population estimates, because the larger elements have the greatest impact on the population estimates.
This example from Wikipedia illustrates clearly the concept of PPS Sampling.
Suppose we have six schools with populations of 150, 180, 200, 220, 260, and 490 students respectively (total 1500 students), and we want to use student population as the basis for a PPS sample of size three. To do this, we could allocate the first school numbers 1 to 150, the second school 151 to 330 (= 150 + 180), the third school 331 to 530, and so on to the last school (1011 to 1500). We then generate a random start between 1 and 500 (equal to 1500/3) and count through the school populations by multiples of 500. If our random start was 137, we would select the schools which have been allocated numbers 137, 637, and 1137, i.e. the first, fourth, and sixth schools.
The Code
import pandas as pd
import numpy as npdf = pd.read_csv('filename.csv')total = df['Total'].sum()
sample_size = 81 #number of samples to be selected
interval_width = int(total/sample_size)df['cum_sum'] = df['Total'].cumsum()
num = interval_width #can be a random number also as in the example
sampled_series = np.arange(num, total, interval_width)cum_array = np.asarray(df['cum_sum'])
selected_samples = np.zeros(sample_size, dtype='int32')idx = np.searchsorted(cum_array,sampled_series) #the heart of code
result = cum_array[idx-1] ndf = df[df.cum_sum.isin(result)]
del ndf['cum_sum'] #so that new file doesn't have cum_sum columnndf.to_csv('sampled_file.csv', index=None)
Explanation
- Importing pandas for reading CSVand numpy for performing array operations.
- Finding the total of all people in all the booths.
- Deciding the sample size. We chose 33% of the total, which was 81.
- Calculating interval width, and generating a column of cumulative frequencies.
- Making a new series which starts from num, goes till total, with step of interval_width
- Performing the searchsorted operation on the sampled series.
- Making a new dataframe with only the sampled booths.
- Deleting the cumulative frequency column from the new frame.
- Exporting the CSV file.
The explanation is not easy. But if you read the example, and then see the code, everything becomes easy to understand.
Conclusion
Using this one can perform the PPS Sampling easily for a small or a large dataset. I thank people at Stackoverflow who helped me in writing this program.