Predicting the stock market using neural networks in python

Predicting the stock market using neural networks in python

Stock Market and Modern Portfolio analysis

Stock market prediction has always been the philosopher’s stone for many who seek to convert the lead into gold. It’s has been elusive for most of us however as we enter into the age of information the tables seem to have turned.

Getting Started

When I say get rich I mean rich in data after all data is the new currency. Before we begin let’s just make one thing clear we cannot predict with 100% accuracy what the price is going to be even a confidence interval of 90% is unattainable. For this article, I am referencing datacamp course: Machine Learning for Finance in Python

Exploratory Data Analysis:

The first step needed to be done is to smell check the data. Here we generally look at the columns and datatype.

print(lng_df.head())  # examine the DataFrames
print(spy_df.head())  # examine the SPY DataFrame
# Plot the Adj_Close columns for SPY and LNG
spy_df['Adj_Close'].plot(label='SPY', legend=True)
lng_df['Adj_Close'].plot(label='LNG', legend=True, secondary_y=True)  # show the plot
plt.clf()  # clear the plot space
plt.clf()  # clear the plot space

Handling Anomalies:

To handle anomalies we generally use scatter plot and box plot to locate the data points and use mean, median,ffill or bfill to replace them. Since this is a time-series data we won’t be checking for anomalies.

Feature Engineering

Feature Engineering is done to create new data from the existing dataset with high correlation with the target variable. Over here there are two features which we will generate.

Pct_change is the per cent change in value concerning the previous value

# Create 5-day % changes of Adj_Close for the current day, and 5 days in the future
lng_df['5d_future_close'] = lng_df['Adj_Close'].shift(-5)
lng_df['5d_close_future_pct'] = lng_df['5d_future_close'].pct_change(5)
lng_df['5d_close_pct'] = lng_df['Adj_Close'].pct_change(5)
# Calculate the correlation matrix between the 5d close pecentage changes (current and future)
corr = lng_df[['5d_close_pct', '5d_close_future_pct']].corr()
# Scatter the current 5-day percent change vs the future 5-day percent change
plt.scatter(lng_df['5d_close_pct'], lng_df['5d_close_future_pct'])

Another common technical indicator is the relative strength index (RSI). This is defined by:


RS=average gain over n periods average loss over n periods

feature_names = ['5d_close_pct']  # a list of the feature names for later
# Create moving averages and rsi for timeperiods of 14, 30, 50, and 200
for n in [14, 30, 50, 200]:
# Create the moving average indicator and divide by Adj_Close
lng_df['ma' + str(n)] = talib.SMA(lng_df['Adj_Close'].values,
timeperiod=n) / lng_df['Adj_Close']
# Create the RSI indicator
lng_df['rsi' + str(n)] = talib.RSI(lng_df['Adj_Close'].values, timeperiod=n)

# Add rsi and moving average to the feature name list
feature_names = feature_names + ['ma' + str(n), 'rsi' + str(n)]



Neural nets can capture complex interactions between variables but are difficult to set up and understand. Recently, they have been beating human experts in many fields, including image recognition and gaming (check out AlphaGo) — so they have great potential to perform well.

To build our nets we’ll use the keras library. This is a high-level API that allows us to quickly make neural nets, yet still, exercise a lot of control over the design. The first thing we’ll do is create almost the simplest net possible — a 3-layer net that takes our inputs and predicts a single value. Much like the sklearn models, keras models have a .fit() the method that takes arguments of (features, targets).

from keras.models import Sequential
from keras.layers import Dense
# Create the model
model_1 = Sequential()
model_1.add(Dense(100, input_dim=scaled_train_features.shape[1], activation='relu'))
model_1.add(Dense(20, activation='relu'))
model_1.add(Dense(1, activation='linear'))
# Fit the model
model_1.compile(optimizer='adam', loss=sign_penalty)
history =, train_targets, epochs=25)
# Create loss function
def sign_penalty(y_true, y_pred):
penalty = 100.
loss = tf.where(tf.less(y_true * y_pred, 0), \
penalty * tf.square(y_true - y_pred), \
tf.square(y_true - y_pred))
return tf.reduce_mean(loss, axis=-1)
# Plot the losses from the fit
# Use the last loss as the title
plt.title('loss:' + str(round(history.history['loss'][-1], 6)))
from sklearn.metrics import r2_score
# Calculate R^2 score
train_preds = model_1.predict(scaled_train_features)
test_preds = model_1.predict(scaled_test_features)
print(r2_score(train_targets, train_preds))
print(r2_score(test_targets, test_preds))
# Plot predictions vs actual
plt.scatter(train_preds, train_targets, label='train')

Modern Portfolio Theory

Modern portfolio theory (MPT) is a theory on how risk-averse investors can construct portfolios to optimize or maximize expected return based on a given level of market risk, emphasizing that risk is an inherent part of higher reward. According to the theory, it’s possible to construct an “efficient frontier” of optimal portfolios offering the maximum possible expected return for a given level of risk.

Our first step towards calculating modern portfolio theory (MPT) portfolios is to get daily and monthly returns. Eventually, we’re going to get the best portfolios of each month based on the Sharpe ratio.

The Sharpe ratio was developed by Nobel laureate William F. Sharpe and is used to help investors understand the return of an investment compared to its risk. The ratio is the average return earned more than the risk-free rate per unit of volatility or total risk. Subtracting the risk-free rate from the mean return allows an investor to better isolate the profits associated with risk-taking activities. Generally, the greater the value of the Sharpe ratio, the more attractive the risk-adjusted return.

The easiest way to do this is to put all our stock prices into one DataFrame, then to resample them to the daily and monthly time frames. We need daily price changes to calculate volatility, which we will use as our measure of risk.

full_df = pd.concat(lng_df,spy_df,smlv_df, axis=1).dropna()
# Resample the full dataframe to monthly timeframe
monthly_df = full_df.resample('BMS').fir
# Calculate daily returns of stocks
returns_daily = full_df.pct_change()
# Calculate monthly returns of the stocks
returns_monthly = monthly_df.pct_change().dropna()
# Daily covariance of stocks (for each monthly period)
covariances = {}
rtd_idx = returns_daily.index
for i in returns_monthly.index:
# Mask daily returns for each month and year, and calculate covariance
mask = (rtd_idx.month == i.month) & (rtd_idx.year == i.year)
# Use the mask to get daily returns for the current month and year of monthy returns index
covariances[i] = returns_daily[mask].cov()
portfolio_returns, portfolio_volatility, portfolio_weights = {}, {}, {}
# Get portfolio performances at each month
for date in sorted(covariances.keys()):
cov = covariances[date]
for portfolio in range(10):
weights = np.random.random(3)
weights /= np.sum(weights) # /= divides weights by their sum to normalize
returns =, returns_monthly.loc[date])
volatility = np.sqrt(,, weights)))
portfolio_returns.setdefault(date, []).append(returns)
portfolio_volatility.setdefault(date, []).append(volatility)
portfolio_weights.setdefault(date, []).append(weights)
# Get latest date of available data
date = sorted(covariances.keys())[-1]
# Plot efficient frontier
# warning: this can take at least 10s for the plot to execute...
plt.scatter(x=portfolio_volatility[date], y=portfolio_returns[date],  alpha=.1)

Get best Sharpe ratios

We need to find the “ideal” portfolios for each date so we can use them as targets for machine learning. We’ll loop through each date in, then loop through the portfolios we generated with portfolio_returns[date]. We’ll then calculate the Sharpe ratio, which is the return divided by volatility (assuming a no-risk return of 0).

We use enumerate() to loop through the returns for the current date (portfolio_returns[date]) and keep track of the index with i. Then we use the current date and current index to get the volatility of each portfolio with portfolio_volatility[date][i]. Finally, we get the index of the best Sharpe ratio for each date using np.argmax(). We’ll use this index to get the ideal portfolio weights soon.

# Empty dictionaries for sharpe ratios and best sharpe indexes by date
sharpe_ratio, max_sharpe_idxs = {}, {}
# Loop through dates and get sharpe ratio for each portfolio
for date in portfolio_returns.keys():
for i, ret in enumerate(portfolio_returns[date]):

# Divide returns by the volatility for the date and index, i
sharpe_ratio.setdefault(date, []).append(ret / portfolio_volatility[date][i])
# Get the index of the best sharpe ratio for each date
max_sharpe_idxs[date] = np.argmax(sharpe_ratio[date])
# Calculate exponentially-weighted moving average of daily returns
ewma_daily = returns_daily.ewm(span=30).mean()
# Resample daily returns to first business day of the month with the first day for that month
ewma_monthly = ewma_daily.resample('BMS').first()
# Shift ewma for the month by 1 month forward so we can use it as a feature for future predictions 
ewma_monthly = ewma_monthly.shift(1).dropna()

Make features and targets

To use machine learning to pick the best portfolio, we need to generate features and targets. Our features were just created in the last exercise — the exponentially weighted moving averages of prices. Our targets will be the best portfolios we found from the highest Sharpe ratio.

targets, features = [], []
# Create features from price history and targets as ideal portfolio
for date, ewma in ewma_monthly.iterrows():
# Get the index of the best sharpe ratio
best_idx = max_sharpe_idxs[date]
features.append(ewma) # add ewma to features
targets = np.array(targets)
features = np.array(features)
# Get most recent (current) returns and volatility
date = sorted(covariances.keys())[-1]
cur_returns = portfolio_returns[date]
cur_volatility = portfolio_volatility[date]
# Plot efficient frontier with sharpe as point
plt.scatter(x=cur_volatility, y=cur_returns, alpha=0.1, color='blue')
best_idx = max_sharpe_idxs[date]
# Place an orange "X" on the point with the best Sharpe ratio
plt.scatter(x=cur_volatility[best_idx], y=cur_returns[best_idx], marker='x', color='orange')
X is the best Sharpe Ratio

Make predictions with a random forest

To fit a machine learning model to predict ideal portfolios, we need to create train and test sets for evaluating performance. We will do this as we did in previous chapters, where we take our features and targets arrays, and split them based on a train_size we set. Often the train size maybe around 70-90% of our data.

# Make train and test features
train_size = int(0.85 * int(features.shape[0]))
train_features = features[:train_size]
test_features = features[train_size:]
train_targets = targets[:train_size]
test_targets = targets[train_size:]
# Fit the model and check scores on train and test
rfr = RandomForestRegressor(n_estimators=300, random_state=42), train_targets)
print(rfr.score(train_features, train_targets))
print(rfr.score(test_features, test_targets))
# Get predictions from model on train and test
train_predictions = rfr.predict(train_features)
test_predictions = rfr.predict(test_features)
# Calculate and plot returns from our RF predictions and the SPY returns
test_returns = np.sum(returns_monthly.iloc[train_size:] * test_predictions, axis=1)
plt.plot(test_returns, label='algo')
plt.plot(returns_monthly['SPY'].iloc[train_size:], label='SPY')

Evaluate returns

Let’s now see how our portfolio selection would perform as compared with just investing in the SPY. We’ll do this to see if our predictions are promising, despite the low R2 value.

We will set a starting value for our investment of Rs.1000, then loop through the returns from our predictions as well as from SPY. We’ll use the monthly returns from our portfolio selection and SPY and apply them to our starting cash balance. From this, we will get a month-by-month picture of how our investment is doing, and we can see how our predictions did overall vs the SPY. Next, we can plot our portfolio from our predictions and compare it to SPY.

# Calculate the effect of our portfolio selection on a hypothetical $1k investment
cash = 1000
algo_cash, spy_cash = [cash], [cash] # set equal starting cash amounts
for r in test_returns:
cash *= 1 + r
# Calculate performance for SPY
cash = 1000 # reset cash amount
for r in returns_monthly['SPY'].iloc[train_size:]:
cash *=(1+r)
# Plot the algo_cash and spy_cash to compare overall returns
plt.plot(algo_cash, label='algo')
plt.plot(spy_cash, label='SPY')
plt.legend() # show the legend

The algorithm doesn’t take into account event-based changes like an election or a sudden increase in petrol prices. Another dataset which we can look into is the movement of shipment, the number of vehicles sold. We need to consider various macro factors to improve our model.


Leave a Reply

Your email address will not be published. Required fields are marked *