# importimportnetworkxasnxfrominspectimportgetmembers# Fetching the name of all drawing related members of NetworkX class.forxingetmembers(nx):if'draw'inx[0]:print(x)
# clean the previous build filespythonsetup.pyclean--all# build the new distribution filespythonsetup.pysdistbdist_wheel# upload the latest version to pypitwineupload--skip-existingdist/*
Python virtual environment
1234567
# Create the virtual environment in the current directorypython-mvenvprojectnamevenv# pick one of the below based on your OS (default: Linux)# activate the virtual environment - Linux.projectnamevenv\Scripts\activate# activate the virtual environment - windows# .\\projectnamevenv\\Scripts\\activate.bat
Where is my Python installed?
To know the exact location of where the python distribution is installed, follow the steps as suggested here
glob is a very efficient way to extract relevant files or folders using python.
A few example are shown below.
1 2 3 4 5 6 7 8 9101112
# importfromglobimportglob# Ex 1: fetch all files within a directoryglob("../data/01_raw/CoAID/*")# Ex 2: fetch all files within a directory with a pattern 'News*COVID-19.csv'glob("../data/01_raw/CoAID/folder_1/News*COVID-19.csv")# Ex 2: fetch all files within multiple directories which# follow a pattern 'News*COVID-19.csv'glob("../data/01_raw/CoAID/**/News*COVID-19.csv")
Increase the pandas column width in jupyter lab or notebook
Most of the times, we have text in a dataframe column, which while displaying gets truncated.
One way to handle this to increase the max width of all columns in the dataframe (as shown below)
12
importpandasaspdpd.set_option('max_colwidth',100)# increase 100 to add more space for bigger text
Parse date and time from string
There are basically 2 ways to do this, (1) Trust machine 🤖: for majority of the 'famous' date writing styles, you can use dateparser package that automatically extracts the date and parse it into datetime format.
While pymongo provides insert_many function for bulk insert, it breaks in case of duplicate key. We can handle it with following function, which in its worse case is similar to insert_one, but shines otherwise.
# importimportpymongo# functiondefinsert_many_wrapper(df,col):"""bulk insert docs into the MongoDB while handling duplicate docs Parameters - df (pandas.dataframe): row as a doc with `_id` col - col (pymongo.db.col): pymongo collection object in which insertion is to be done """# make a copy and reset indexdf=df.copy().reset_index(drop=True)# varsall_not_inserted=Trueduplicate_count=0ttl_docs=df.shape[0]# iterate till all insertions are done (or passed in case of duplicates)whileall_not_inserted:# try insertiontry:col.insert_many(df.to_dict(orient='records'),ordered=True)all_not_inserted=Falseexceptpymongo.errors.BulkWriteErrorase:id_till_inserted=e.details['writeErrors'][0]['keyValue']['_id']index_in_df=df[df['_id']==id_till_inserted].index[0]print(f"Duplicate id: {id_till_inserted}, Current index: {index_in_df}")df=df.loc[index_in_df+1:,:]duplicate_count+=1# final statusprint(f"Total docs: {ttl_docs}, Inserted: {ttl_docs-len(duplicate_count)}, Duplicate found: {len(duplicate_count)}")
Search top StackExchange questions
Stack Exchange exposes several API endpoints to process the questions, answers or posts from their website.
A simple implementation to search and download the latest (from yesterday) and top voted questions is shown below. For more such API endpoints, consult their official doc.
"""Request StackExchange API to get the top 10 most voted questions and their answer from yesterday"""importrequestsimportjsonimportdatetimeimporttime# Get the current datetoday=datetime.date.today()yesterday=today-datetime.timedelta(days=1)# Get the current timenow=datetime.datetime.now()# Get the time of yesterdayyesterday_time=now.replace(day=yesterday.day,month=yesterday.month,year=yesterday.year)# Convert the time to epoch timeyesterday_epoch=int(time.mktime(yesterday_time.timetuple()))# Get the time of todaytoday_time=now.replace(day=today.day,month=today.month,year=today.year)# Convert the time to epoch timetoday_epoch=int(time.mktime(today_time.timetuple()))# Get the top 10 most voted questions and their answer from yesterdayurl="https://api.stackexchange.com/2.2/questions?pagesize=10&fromdate="+ \
str(yesterday_epoch)+"&todate="+str(today_epoch)+ \
"&order=desc&sort=votes&site=stackoverflow"# Get the response from the APIresponse=requests.get(url)# Convert the response to JSONdata=response.json()# Print the dataprint(json.dumps(data,indent=4))
Export complete data from ElasticSearch
Due to several memory and efficiency related limitations, it is non-trivial to export complete data from ElasticSearch database.
That said, it is not impossible. PFB an scan based implementation that does the same for a dummy test_news index.
1 2 3 4 5 6 7 8 9101112131415161718192021
# import importpandasaspdfromelasticsearchimportElasticsearchfromelasticsearch.helpersimportscanfromtqdmimporttqdm# configindex_name='test_news'db_ip='http://localhost:9200'# connect to elasticsearches=Elasticsearch([db_ip])# fetch all data from elasticsearchscroll=scan(es,index=index_name,query={"query":{"match_all":{}}})data=[]forresintqdm(scroll):data.append(res['_source'])# convert to pandas dataframe and export as csvpd.DataFrame(data).to_csv("news_dump.csv",index=False)
Convert python literals from string
While I/O from database or config files, we may get some literals (ex list) in form of string, wherein they maintain their structure but the type. We can use ast package to convert them back to their correct type.
Quoting the documentation, "With ast.literal_eval you can safely evaluate an expression node or a string containing a Python literal or container display. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, booleans, and None."
123456
# importimportast# list literal in string formatlist_as_string='["a", "b"]'# convertlist_as_list=ast.literal_eval(list_as_string)# Output: ["a", "b"]
Plotly visualization on Pandas dataframe
If you want to visualize your pandas dataframe using plotly package, there is no need to use the package explicitly. It can be done right from the pandas dataframe object, with just a couple of lines of code as shown below:
1234
# set the backend plotting optionpd.options.plotting.backend="plotly"# do a normal plot!pd.DataFrame(result).plot(x='size',y='mean')
Conda cheat sheet
Conda an open-source, cross-platform, language-agnostic package manager and environment management system. Therein again we have multiple varieties,
Miniconda: it's a minimilistic package with python, conda and some base packages.
Anaconda: it's a bigger package with all the things in Miniconda plus around 150 high quality packages.
While the complete documentation can be accessed from here, some important snippets are:
1 2 3 4 5 6 7 8 910111213141516171819202122
# list all supported python versionscondasearchpython# create a new global conda environment (with new python version)# note, py39 is the name of the envcondacreate-npy39python=3.9# create a new local conda environment # (under venv folder in current directory and with new python version)condacreate-p./venv-npy39python=3.9# list all of the environmentscondainfo--envs# activate an environmentcondaactivatepy39# where py39 is the name of the env# deactivate the current environmentcondadeactivate# delete an environmentcondaenvremove-npy39
Requirement files
Requirement file is a collection of packages you want to install for a Project. A sample file is shown below,
# fine name requirements.txt
package-one==1.9.4
git+https://github.com/path/to/package-two@41b95ec#egg=package-two
package-three==1.0.1
package-four
- Note three ways of defining packages, (1) with version number, (2) with github source and (3) without version number (installs the latest). Once done, you can install all these packages at one go by pip install -r requirements.txt
Reading .numbers file
.numbers file is a proprietary file format of Apple's Numbers application. It is a spreadsheet file format that is used to store data in a table format. To process and load the data from .numbers file, we can use numbers_parser package. Below is an example of how to read the data from multiple .numbers files and combine them into one file.
1 2 3 4 5 6 7 8 910111213141516171819202122232425
# importimportosimportglobimportpandasaspdfromtqdmimporttqdmfromnumbers_parserimportDocument# Get the list of files in the directory using globnumbers_files=glob.glob(cwd+'/*.numbers')# Combine the .numbers files into one filecombined_df=[]forfileintqdm(numbers_files):doc=Document(file)sheets=doc.sheetstables=sheets[0].tablesdata=tables[0].rows(values_only=True)df=pd.DataFrame(data[1:],columns=data[0])df['file']=file# Append the dataframe to the combined dataframecombined_df.append(df)# Save the combined filecombined_df=pd.concat(combined_df)combined_df.to_csv('combined_numbers_new.csv',index=False)
Pandas Groupby Function
Pandas can be utilised for fast analysis of categorical data using groupby. Let's have a look.
1 2 3 4 5 6 7 8 9101112131415161718
#importimportnumpyasnpimportpandasaspd# load a dummy dfdf=pd.Dataframe('dummy.csv')# example below## Name | Gender | Salary## Ravi | Male | $20,000## Sita | Female | $40,000## Kito | Female | $11,000# perform groupby to get average salary per gender## Option 1df.groupby(['Gender']).agg({'Salary':[np.mean]})## Option 2df.groupby(['Gender']).mean()## Option 3df.groupby(['Gender']).apply(lambdax:x['Salary'].mean())
Save and Load from Pickle
Pickle can be used to efficiently store and load python objects and data. Refer StackOverflow
1 2 3 4 5 6 7 8 910111213141516
# importimportpickle# create dataa={'a':1,'b':[1,2,3,4]}# save picklewithopen('filename.pickle','wb')ashandle:pickle.dump(a,handle,protocol=pickle.HIGHEST_PROTOCOL)# load picklewithopen('filename.pickle','rb')ashandle:b=pickle.load(handle)# checkassertprint(a==b)
Download Youtube video
Youtube video can be downloaded using the pytube package. Here is an example.
1 2 3 4 5 6 7 8 9101112
# importfrompytubeimportYouTube## var: link to downloadvideo_url="https://www.youtube.com/watch?v=JP41nYZfekE"# create instanceyt=YouTube(video_url)# downloadabs_video_path=yt.streams.filter(progressive=True,file_extension='mp4').order_by('resolution').desc().first().download()## print(f"Video downloaded at {abs_video_path}")
Machine Translation
EasyNMT lets you perform state-of-the-art machine translation with just 3 lines of python code!
It supports translation between 150+ languages and automatic language detection for 170+ languages. Pre-trained machine translation models are auto-downloaded and can perform sentence and document translations!
123456789
# importfromeasynmtimportEasyNMT# load modelmodel=EasyNMT('opus-mt')#Translate a single sentence to Germanprint(model.translate('This is a sentence we want to translate to German',target_lang='de'))## Output: Dies ist ein Satz, den wir ins Deutsche übersetzen wollen
Pandas read excel file
While pandas is quite famous for CSV analysis, it can be used to read and process Excel files as well. Here are some snippets,
1 2 3 4 5 6 7 8 910111213
# importimportpandasaspd# if you just want to read one sheet, by default it reads the first one. df=pd.read_excel("file.xlsx",sheet_name="Page1")# if you want to get the names of sheet and do more selective readingexcel_data=pd.ExcelFile("file.xlsx")# get the sheet namesprint(excel_data.sheet_names)# read one sheet (decide using last print result)sheet_name='..'df=excel_data.parse(sheet_name)
Send Slack Messages
One of the easiest way to send Slack message is via unique Incoming Webhook.
Basically, you need to create a Slack App, register an incoming webhook with the app and whenever you want to post a message - just send a payload to the webhook. For more details on setup, you can refer the official page
Once done, you just need to send the message like shown below,
1 2 3 4 5 6 7 8 9101112
# import requests (needed to connect with webhook)importrequests# funcdefsend_message_to_slack(message):# set the webhookwebhook_url="...enter incoming webhook url here..."# modify the message payloadpayload='{"text": "%s"}'%message# send the messageresponse=requests.post(webhook_url,payload)# testsend_message_to_slack("test")
Colab Snippets
Google Colab is the go-to place for many data scientists and machine learning engineers who are looking to perform quick analysis or training for free. Below are some snippets that can be useful in Colab.
If you are getting NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968 or similar error when trying to run !pip install or similar CLI commands in Google Colab, you can fix it by running the following command before running !pip install. But note, this might break some imports. So make sure to import all the packages before running this command.
12345
importlocalelocale.getpreferredencoding=lambda:"UTF-8"# now import# !import ...
Asyncio Gather
asyncio.gather is a powerful function in Python's asyncio module that allows you to run multiple coroutines concurrently and collect the results. Here is an example of how to use asyncio.gather to process a list of inputs concurrently and maintain order.
# import importasyncioimportrandom# func to process inputasyncdefprocess_input(input_value):# Generate a random sleep time between 1 and 5 secondssleep_time=random.uniform(1,5)print(f"Processing {input_value}. Will take {sleep_time:.2f} seconds.")# Simulate some processing timeawaitasyncio.sleep(sleep_time)returnf"Processed: {input_value}"asyncdefmain():# List of inputs to processinputs=["A","B","C","D","E"]# Create a list of coroutines to runtasks=[process_input(input_value)forinput_valueininputs]# Use asyncio.gather to run the coroutines concurrently and maintain orderresults=awaitasyncio.gather(*tasks)# Print the resultsforinput_value,resultinzip(inputs,results):print(f"Input: {input_value} -> {result}")# Run the main functionif__name__=="__main__":asyncio.run(main())
Once you run the above code, here is an example of the output you might see:
As you can see from the output, the inputs are processed concurrently, and the results are collected in the order of the inputs. This is true even if the processing times are different for each input.