Python Automation
Here’s how I used Python and QGIS to find out 80% of India’s camel population.
I used Python to develop a script that parsed India’s Open Data Platform to get animal population from 2019 Census and used it for finding out the top 18 districts where 80% of India’s camel population resides. Read more to find out how.
“Hey Aayush, I need to know where I need to focus for my camel-based livelihoods intervention for low income rural women in India”, requested my friend who is from Rajasthan. She knew that Rajasthan’s desert districts would be the answer, but she needed numbers from official sources to back up her intuition. This is where I entered the picture. Here are the steps I took to find this information out for her.
Step 1
Go the Government of India’s Open Data Platform at data.gov.in and browse the data sheets available on 20th Livestock Census of 2019 Department of Animal Husbandry and Dairying, Ministry of Fisheries, Animal Husbandry and Dairying available here.
Step 2
Browse the developer tools to get the cURL for getting the URL for accessing all the files at one go. If I would not have done this, I would have to click on each file individually, fill a form, fill a captcha, and then download the CSV manually. This saved me quite a bit of time.
Step 3
Post the URL in Postman and get the corresponding Python requests code to parse the response.
Step 4
I used Python to parse the response, and got all the URL links using this code.
from pprint import pprint
import os
import pandas as pd
import requests
url = "https://data.gov.in/backend/dmspublic/v1/resources?filters[catalog_reference]=6885101&offset=0&limit=35&sort[changed]=desc&filters[domain_visibility]=4"
payload = {}
headers = {
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Cookie': 'fontSize=67.5; citrix_ns_id=AAA7TRx1ZjuD0ksAAAAAADuMGtjGAxHPGX4gOzVglnj-t-2_KYp3QS5pOwB3wsrGOw==jyF1Zg==9zCLz_Tsia4CNE6H2-pAKy8Ou1w=; citrix_ns_id=AAA7TRx1ZjuD0ksAAAAAADuMGtjGAxHPGX4gOzVglnj-t-2_KYp3QS5pOwB3wsrGOw==uiR1Zg==JTC1HaNqvL2oNi2kwWYolcsi_TU=',
'Referer': 'https://data.gov.in/catalog/20th-livestock-census',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
'dnt': '1',
'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-gpc': '1'
}
response = requests.request("GET", url, headers=headers, data=payload)
rjson = response.json()
rows = rjson['data']['rows']
urls = []
for row in rows:
url = "https://"+row['datafile'][0]
print(url)
urls.append(url)
The final output is the list of the URLs that I parsed to get the CSVs
Step 5
I downloaded all the files and concatenated to get one final dataframe. Subsequently, I used 80% cutoff mark to get the districts which had 80% of the camel population.
save_directory = './csv_files/'
if not os.path.exists(save_directory):
os.makedirs(save_directory)
def download_file(url, save_directory):
try:
response = requests.get(url)
response.raise_for_status() # Check if the request was successful
file_name = os.path.join(save_directory, url.split('/')[-1])
with open(file_name, 'wb') as file:
file.write(response.content)
print(f"Downloaded: {file_name}")
except requests.exceptions.RequestException as e:
print(f"Failed to download {url}: {e}")
for url in urls:
download_file(url, save_directory)
csv_directory = save_directory
dataframes = []
all_headers = set()
for filename in os.listdir(csv_directory):
if filename.endswith('.csv'):
file_path = os.path.join(csv_directory, filename)
df = pd.read_csv(file_path)
dataframes.append(df)
all_headers.update(df.columns)
# Check if all dataframes have the same headers
headers_match = all(len(df.columns.difference(all_headers)) == 0 for df in dataframes)
if headers_match:
combined_df = pd.concat(dataframes, ignore_index=True)
else:
combined_df = pd.DataFrame(columns=all_headers)
for df in dataframes:
df = df.reindex(columns=all_headers) # Ensure all columns are present
combined_df = pd.concat([combined_df, df], ignore_index=True)
combined_csv_path = os.path.join(csv_directory, 'combined.csv')
combined_df.to_csv(combined_csv_path, index=False)
print(f"Combined CSV saved to {combined_csv_path}")
df['camel'] = pd.to_numeric(df['camel'], errors='coerce')
df_sorted = df.sort_values(by='camel', ascending=False).reset_index(drop=True)
df_sorted['cumulative_sum'] = df_sorted['camel'].cumsum()
total_camels = df_sorted['camel'].sum()
threshold = 0.8 * total_camels
df_sorted['cumulative_percentage'] = df_sorted['cumulative_sum'] / total_camels
districts_80_percent = df_sorted[df_sorted['cumulative_percentage'] <= 0.8]
print(districts_80_percent[['state_name','district_name', 'camel', 'cumulative_sum', 'cumulative_percentage']])
Step 6
I got India’s districts geojson files, created a filter to filter these 18 districts, plotted it on QGIS and set the background layer from Google Satellite to get the final output.
If you want to know more about any step, feel free to reach out to me. This is just a tiny example of what possibilities there are for making decisions using openly available data. Enjoy!