Community#

A summary of the demographic information of the NumPy survey respondents.

Hide code cell source
fname = "data/2021/numpy_survey_results.tsv"
column_names = [
    'age', 'gender', 'lang', 'lang_other', 'country', 'degree', 'degree_other',
    'field_of_study', 'field_other', 'role', 'role_other', 'version', 'version_other',
    'primary_use', 'share_code', 'programming_exp', 'numpy_exp', 'use_freq', 'components',
    'use_c_ext', 'prog_lang', 'prog_lang_other', 'surv2020'
]
demographics_dtype = np.dtype({
    "names": column_names,
    "formats": ['<U600'] * len(column_names),
})

data = np.loadtxt(
    fname, delimiter='\t', skiprows=3, dtype=demographics_dtype, 
    usecols=list(range(11, 33)) + [90], comments=None, encoding='UTF-16'
)

glue('2021_num_respondents', data.shape[0], display=False)

Demographics#

Age#

Of the 522 survey respondents, 456 (87%) shared their age.

The majority of respondents are in the age groups 25-34 and 35-44. Very few respondents are older than 55, and even fewer are younger than 18.

Hide code cell source
# Ignore empty fields and "prefer not to answer"
drop = np.logical_and(data['age'] != '', data['age'] != 'Prefer not to answer')
age = data['age'][drop]
labels, cnts = np.unique(age, return_counts=True)
ind = np.array([5,0,1,2,3,4])
labels, cnts = labels[ind], cnts[ind]

fig, ax = plt.subplots(figsize=(12, 8))
ax.bar(
  np.arange(len(labels)),
  100 * cnts / age.shape[0], 
  tick_label=labels,
)
ax.set_ylabel('Percentage of Respondents')
ax.set_xlabel('Age Group')
ax.set_title("Age Distribution of Survey Respondents");
fig.tight_layout()

glue('2021_num_age_respondents', gluval(age.shape[0], data.shape[0]), display=False)
../../_images/f8c057a1f32f05ba44e852cd79427c93bd42de320d9b85aa057c7a6f67ba83bd.png

Gender#

Of the 522 survey respondents, 453 (87%) shared their gender.

An majority of respondents identify as male. Only about 11% of respondents identify as female.

Hide code cell source
# Ignore empty fields and "prefer not to answer"
drop = np.logical_and(data['gender'] != '', data['gender'] != 'Prefer not to answer')
gender = data['gender'][drop]
labels, cnts = np.unique(gender, return_counts=True)

fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(cnts, labels=labels, autopct='%1.1f%%')
ax.set_title("Gender Distribution");
fig.tight_layout()

glue('2021_num_gender', gluval(gender.shape[0], data.shape[0]), display=False)
../../_images/4f8f13f3be2f359ddbbcead91ccd0a4d837482d7021d98fdde9ba3c67ec22e85.png

Language Preference#

Of the 522 respondents, 497 (95%) shared their preferred language.

Over 67% of respondents reported English as their preferred language.

Hide code cell source
# Ignore empty fields
lang = data['lang'][data['lang'] != '']
# Self-reported language
lang_other = data['lang_other'][data['lang_other'] != '']
lang_other = capitalize(lang_other)
lang = np.concatenate((lang, lang_other))
labels, cnts = np.unique(lang, return_counts=True)
cnts = 100 * cnts / cnts.sum()
I = np.argsort(cnts)[::-1]
labels, cnts = labels[I], cnts[I]

# Create a summary table
with open('_generated/language_preference_table.md', 'w') as of:
    of.write('| **Language** | **Preferred by % of Respondents** |\n')
    of.write('|--------------|-----------------------------------|\n')
    for lbl, percent in zip(labels, cnts):
        of.write(f'| {lbl} | {percent:1.1f} |\n')

glue('2021_num_lang_pref', gluval(lang.shape[0], data.shape[0]), display=False)

Click to show/hide table

Language

Preferred by % of Respondents

English

67.2

Other

6.6

Spanish

6.2

French

5.4

Japanese

5.0

Russian

1.8

German

1.4

Mandarin

1.0

Portuguese

0.6

Polish

0.6

Swedish

0.4

Czech

0.4

Italian

0.4

Turkish

0.4

Chinise

0.2

中文

0.2

Chinese

0.2

Bulgarian

0.2

Allemand

0.2

Hungarian

0.2

Dutch

0.2

Hebrew

0.2

Hindi

0.2

Indonesian

0.2

Korean

0.2

“norwegian “

0.2

Country of Residence#

Of the 522 respondents, 440 (84%) shared their current country of residence. The survey saw respondents from 54 countries in all. A quarter of respondents reside in the United States.

The following chart shows the relative number of respondents from ~10 countries with the largest number of participants. For privacy reasons, countries with fewer than a certain number of respondents are not included in the figure, and are instead listed in the subsequent table.

Hide code cell source
# Preprocess data
country = data['country'][data['country'] != '']
country = strip(country)
# Distribution
labels, cnts = np.unique(country, return_counts=True)
# Privacy filter
num_resp = 10
cutoff = (cnts > num_resp)
plabels = np.concatenate((labels[cutoff], ['Other']))
pcnts = np.concatenate((cnts[cutoff], [cnts[~cutoff].sum()]))

# Plot
fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(pcnts, labels=plabels, autopct='%1.1f%%')
ax.set_title("Country of Residence");
fig.tight_layout()

# Map countries to continents
import pycountry_convert as pc
cont_code_to_cont_name = {
    'NA': 'North America',
    'SA': 'South America',
    'AS': 'Asia',
    'EU': 'Europe',
    'AF': 'Africa',
    'OC': 'Oceania',
}
def country_to_continent(country_name):
    cc = pc.country_name_to_country_alpha2(country_name)
    cont_code = pc.country_alpha2_to_continent_code(cc)
    return cont_code_to_cont_name[cont_code]
c2c = np.vectorize(country_to_continent, otypes='U')

# Organize countries below the privacy cutoff by their continent
remaining_countries = labels[~cutoff]
continents = c2c(remaining_countries)
with open('_generated/countries_by_continent.md', 'w') as of:
    of.write('|  |  |\n')
    of.write('|---------------|-------------|\n')
    for continent in np.unique(continents):
        clist = remaining_countries[continents == continent]
        of.write(f"| **{continent}:** | {', '.join(clist)} |\n")

glue('2021_num_unique_countries', len(labels), display=False)
glue(
    '2021_num_country_respondents',
    gluval(country.shape[0], data.shape[0]),
    display=False
)
../../_images/9cc04d5b6de25a9fda37f5904bb5d7695a60cac705bc3436d4f37e64253cc3b8.png

Africa:

Egypt, Ghana, Kenya, Nigeria, South Africa, Zimbabwe

Asia:

Bangladesh, China, Indonesia, Israel, Malaysia, Pakistan, Singapore, South Korea, Turkey

Europe:

Austria, Belgium, Bulgaria, Denmark, Finland, Hungary, Iceland, Ireland, Italy, Latvia, Norway, Poland, Russia, Slovakia, Spain, Sweden, Switzerland, Ukraine

North America:

Canada, Cuba, Mexico

Oceania:

Australia, Vanuatu

South America:

Argentina, Chile, Colombia, Ecuador, Paraguay, Peru, Venezuela

Education#

440 (84%) respondents shared their education history, spanning the range from pre-highschool graduation through Doctorate level with many other specialist degrees.

Generally, respondents are highly educated. Nine out of ten have at least a Bachelor’s degree and one in three holds a PhD.

The following figure summarizes the distribution for the highest degrees obtained by respondents.

Hide code cell source
degree = data['degree'][data['degree'] != '']
labels, cnts = np.unique(degree, return_counts=True)

fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(cnts, labels=labels, autopct='%1.1f%%', labeldistance=None)
ax.legend()
ax.set_title("Highest Level of Education");
fig.tight_layout()

glue('2021_num_education', gluval(degree.shape[0], data.shape[0]), display=False)
../../_images/8ef7875975c0e6934709c62b7402ac1a003c5748e6f6803f6e5c60d7a2540d0f.png

Job Roles#

205 (47%) of the 435 respondents who shared their occupation identify as an Data scientist, Scientist (in academia), or Software engineer.

Hide code cell source
role = data['role'][data['role'] != '']
labels, cnts = np.unique(role, return_counts=True)

# Sort results by number of selections
inds = np.argsort(cnts)
labels, cnts = labels[inds], cnts[inds]

fig, ax = plt.subplots(figsize=(12, 8))
ax.barh(np.arange(len(cnts)), cnts, align='center')
ax.set_yticks(np.arange(len(cnts)))
ax.set_yticklabels(labels)
ax.set_xlabel("Number of Respondents")
ax.set_title("Current Role");
fig.tight_layout()

glue('2021_num_occupation', role.shape[0], display=False)
glue(
    '2021_num_top_3_categories',
    gluval(cnts[-3:].sum(), role.shape[0]),
    display=False,
)
glue('2021_top_3_categories', f"{labels[-3]}, {labels[-2]}, or {labels[-1]}", display=False)
../../_images/b66318cd03bf1b4b7109623bceaa4b5abed9928d9451e732f934ccd026b6f015.png
Hide code cell source
# Ignore empty fields and "prefer not to answer"
drop = np.logical_and(data['surv2020'] != '', data['surv2020'] != 'Not sure')
surv2020 = data['surv2020'][drop]
labels, cnts = np.unique(surv2020, return_counts=True)

glue('2021_num_surv2020_respondents', gluval(surv2020.shape[0], data.shape[0]), display=False)
yes_percent = 100 * cnts[1].sum() / cnts.sum()
glue('2021_yes_percent', f"{yes_percent:1.1f}", display=False)

Experience and Usage#

Programming Experience#

68% of respondents have significant experience in programming, with veterans (10+ years) taking the lead. Interestingly, when it comes to using NumPy, noticeably more of our respondents identify as beginners than experienced users.

Hide code cell source
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# Ascending order for figure
ind = np.array([-1, 0, 2, 3, 1])
for exp_data, ax in zip(('programming_exp', 'numpy_exp'), axes):
    # Analysis
    prog_exp = data[exp_data][data[exp_data] != '']
    labels, cnts = np.unique(prog_exp, return_counts=True)
    cnts = 100 * cnts / cnts.sum()
    labels, cnts = labels[ind], cnts[ind]
    # Generate text on general programming experience
    glue(f'2021_{exp_data}_5plus_years', f"{cnts[-2:].sum():2.0f}%", display=False)
    # Plotting
    ax.bar(np.arange(len(cnts)), cnts)
    ax.set_xticks(np.arange(len(cnts)))
    ax.set_xticklabels(labels)
axes[0].set_title('General Programming Experience')
axes[0].set_ylabel('Percentage of Respondents')
axes[1].set_title('Experience with NumPy');
fig.autofmt_xdate();
fig.tight_layout();
../../_images/76899c1264405cdd12e64c2ad2604c8800d571d998edf76be8af55350698fbe6.png

Programming Languages#

401 (77%) of survey participants shared their experience with other programming languages. 64% of respondents are familiar with np.str_(‘C / C++’), and 40% with np.str_(‘Matlab’).

Hide code cell source
pl = data['prog_lang'][data['prog_lang'] != '']
num_respondents = len(pl)
glue('2021_num_proglang_respondents', gluval(len(pl), data.shape[0]), display=False)

# Flatten & remove 'Other' write-in option
other = 'Other (please specify, using commas to separate individual entries)'
apl = []
for row in pl:
    if 'Other' in row:
        row = ','.join(row.split(',')[:-2])
        if len(row) < 1:
            continue
    apl.extend(row.split(','))
labels, cnts = np.unique(apl, return_counts=True)
cnts = 100 * cnts / num_respondents
I = np.argsort(cnts)
labels, cnts = labels[I], cnts[I]

fig, ax = plt.subplots(figsize=(12, 8))
ax.barh(np.arange(len(cnts)), cnts, align='center')
ax.set_yticks(np.arange(len(cnts)))
ax.set_yticklabels(labels)
ax.set_xlabel("Percentage of Respondents")
ax.set_title("Programming Language Familiarity")
fig.tight_layout()

# Highlight two most popular
glue('2021_num_top_lang', f"{cnts[-1]:2.0f}%", display=False)
glue('2021_top_lang', labels[-1], display=False)
glue('2021_num_2nd_lang', f"{cnts[-2]:2.0f}%", display=False)
glue('2021_second_lang', labels[-2], display=False)
../../_images/fea851214a1f3a5e40cd5f26a1f207c1cf44549a61bdae4ec351ecf8a9cfdc9c.png

84 (21%) percent of respondents reported familiarity with computer languages other than those listed above. Of these, Rust was the most popular with 17 (4%) respondents using this language. A listing of other reported languages can be found below (click to expand).

Hide code cell outputs
['"' '"cobol' '"tcl' '-' 'amigae' 'asembler mc68k' 'asembler x86'
 'assembly' 'awk' 'bash' 'basic' 'clojure' 'cython' 'dart' 'dpl' 'elixir'
 'elm' 'f#' 'forth' 'go' 'golang' 'haskel' 'haskell' 'idl' 'kotlin' 'lisp'
 'llvm ir' 'logo' 'lua' 'mathematica' 'matlab' 'mysql' 'nim' 'nim lang'
 'nix' 'none' 'none except python' 'objective-c' 'ocaml' 'pascal' 'perl'
 'php' 'prolog' 'python' 'rust' 'scala' 'scheme' 'shell' 'sql' 'stata'
 'swift' 'systemverilog' 't-sql' 'tcl' 'tcl/tk' 'tex' 'typescript' 'vba'
 'verilog' 'visual basic' 'visual foxpro' 'what does familiar mean?'
 'wolfram']

Code Sharing#

418 (80%) of survey participants shared information on how many others they typically share code with. Most respondents share code with 1-2 people.

Hide code cell source
from numpy_survey_results.utils import flatten

share_code = data['share_code'][data['share_code'] != '']
labels, cnts = np.unique(flatten(share_code), return_counts=True)

# Sort categories in ascending order (i.e. "0", "1-2", "3-5", "5-10", "10+")
ind = np.array([0, 1, 3, 4, 2])
labels, cnts = labels[ind], cnts[ind]

fig, ax = plt.subplots(figsize=(12, 8))
ax.bar(
    np.arange(len(labels)),
    100 * cnts / share_code.shape[0], 
    tick_label=labels,
)
ax.set_ylabel('Percentage of Respondents')
ax.set_xlabel('Number of people you typically share code with')
fig.tight_layout()

# Highlights most popular
glue('2021_top_share', str(labels[np.argmax(cnts)]), display=False)

# Number who answered question
glue(
    '2021_num_share_code',
    gluval(share_code.shape[0], data.shape[0]),
    display=False
)
../../_images/69bfacb74ae66c3ace65abed68efb47814b8a80a3e7c692b00383690283d99ab.png