No images? Click here

Hi there,

The Data Analytics Practice Committee (DAPC), the Young Data Analytics Working Group (YDAWG), and the Actuaries Institute are pleased to bring the latest in the world of data analytics to your inbox and to share some of our recent work with you.

In this edition of the Data Analytics Newsletter, we have some exciting research developments using machine learning to improve code completion by Google, another regulator expressing concerns about commercial data collection, and a real-life example of how DALL·E 2 can you pick images for your writing and save you some time. Let’s dive in!

Cutting Edge News

ML-enhanced code completion improves developer productivity

The increasing complexity of code poses a key challenge to productivity in software engineering. Code completion has been an essential tool that has helped mitigate this complexity in integrated development environments (IDEs).

Recent research has demonstrated that large language models enable longer and more complex code suggestions than the conventional rule-based semantic engines. However, the question of how code completion powered by machine learning (ML) impacts developer productivity, beyond perceived productivity and accepted suggestions, remains open.

In this article, the researchers describe how they combined ML and SE to develop a novel transformer-based hybrid semantic ML code completion, now available to internal Google developers.

 

RStudio is becoming Posit

RStudio has been rebranded to Posit. This is on the back of the company’s vision to expand beyond R to other languages (e.g. Python) and the RStudio name would be too restricting. The company clarified that the name change does not signify a shift away from R or a belief that Python is somehow supplanting R for data science, as some argue on social media and elsewhere. Read the full announcement.

 

AI predicts shape of nearly every known protein

Researchers have used AlphaFold — the revolutionary artificial intelligence (AI) network — to predict the structures of more than 200 million proteins from some one million species, covering almost every known protein on the planet.

The data dump is freely available on a database set up by DeepMind, the London-based AI company owned by Google. DeepMind developed AlphaFold and the European Molecular Biology Laboratory’s European Bioinformatics Institute, an intergovernmental organisation. Find out more.

Regulation

What the FTC’s scrutiny of data collection
and security may mean

Similar to the European Union’s concern on public data collection for commercial purposes (as we included in the June newsletter), the US Federal Trade Commission (FTC) recently also weighed in on data privacy and how data collection is handled by businesses.

The FTC said it would explore introducing rules on what it calls 'commercial surveillance', referring to the collection, analysis, and commercial profit gleaned from data gathered from and about the public. The FTC also claimed the massive scale of such surveillance increased risks of data breaches and manipulation. Find out more.

Technical

Why do tree-based models still outperform
deep learning on tabular data?

 

While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. To understand this gap, the authors conducted an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs:

  1. Be robust to uninformative features
  2. Preserve the orientation of the data
  3. Be able to easily learn irregular functions.

Find out more.

 

We replaced all our blog thumbnails using DALL·E 2 for $45.
Here’s what we learned

 

Blog posts with images get more engagement from readers. However, the authors might not have time to hand pick images especially for technical topics. Can AI generated images from DALL-E (an AI we have mentioned in several prior newsletters) make better blog thumbnails, do it cheaper, and generally just be more fun? The answer is yes, and this is how to do it.

 

AI Coding with CodeRL: Toward mastering program synthesis
with deep reinforcement learning

 

CodeRL is a new framework for program synthesis through holistic integration of pre-trained language models and deep reinforcement learning. By utilising unit test feedback as part of model training and inference, and integrating with an improved CodeT5 model, CodeRL achieves state-of-the-art results on competition-level programming tasks.

In this article, the research team details the technical aspects of CodeRL and some of the use cases in which CodeRL could improve software development efficiency and accessibility.

Humor

Do data-driven companies actually win?

Imagine you are a venture capital fund manager and there are five nearly identical companies with similar products and operating models. The only things they differ from each other are the management styles and emphasis by their execs. Some believe in the intuition built upon prior experiences, some focus on operational discipline, and some focus on data.

Which company do you choose to back? Which one is more likely to be successful in the long term? In this rather humorous thought experiment, the author discusses what data-driven really means for a company when there are a number of attributes and potential trade-offs to consider.

Editor's note

We hope you enjoyed this month’s content. It’s exciting to see the continuous development in the ML-powered code auto-completion space. This type of beneficial research will help improve the accessibility of data analytics and coding in general. As usual, if you come across any interesting reading over the month, please reach out and drop us a line!

Check out Actuaries Digital for more data-related material and the DAPC microsite for great learning resources and past editions of the Data Analytics Newsletter.

Jacky Poon, Henry Ma, Grant Lian and Dan Wang
Editors, Data Analytics Newsletter

Disclaimer: The Institute wishes it to be understood that any opinions put forward in this publication are not necessarily those of the Institute.

Actuaries Institute
Level 2, 50 Carrington Street
Sydney NSW 2000, Australia
t +61 (0) 2 9239 6100

This email may contain privileged and/or confidential information. If you are not the intended recipient, please delete the email and notify the Actuaries Institute immediately on +61 (0) 2 9239 6100 or by return email. You must not disseminate, copy or take any action in reliance on the email. Neither any privilege nor confidentiality in the contents of this email is waived, lost or destroyed by reason that it has been transmitted other than to the intended addressee. If you send an email to us (including any emails addressed to a staff email address) the information in your email (including any ‘Personal Information’ as defined in the Privacy Act 1988 (Cth)) may be retained on our systems in accordance with our Privacy Policy and applicable data retention procedures.

 
Update my Preferences
 
 
  Share    Forward 
Unsubscribe