PIC18F65J10 is not responding after some days.

(instructions, reset, WDT, specifications...) PIC17Cxx, PIC18Fxxx

PIC18F65J10 is not responding after some days.

Postby amit12 » Wed Jul 16, 2014 4:09 am

In my design i have interfaced the PIC18F65J10 with FPGA through SPI and one GPS module through Uart interface.
PIC is performing following tasks which are given below :
1. PIC is having timer0 ISR which is running after every 1msec. So this timer0 ISR is executing after every 1msec.
2. After every second, GPS module is sending some 100byte packet to the PIC through UART interrupts.
3. In addition to that FPGA is requesting for Time from PIC after every 60 sec through SPI communication and PIC responds.

These all above mentioned operations are working fine for some days. Then suddenly after 5 to 6 days of operation (power up), we could see that PIC is not responding to FPGA and for checking, with the help of timer 0, i am blinking a led. That led is also not blinking. So it means PIC went to a bad state and also from FPGA we are getting status as command execution failed.

Can anyone help me please? Why pic is showing this kind of behavior after 4 to 5 days ? Whats going bad?

Regards,
Amit Nischal
amit12
 
Posts: 2
Joined: Wed Jul 16, 2014 4:07 am
PIC experience: Professional 2-5 years with MCHP products

Re: PIC18F65J10 is not responding after some days.

Postby Joseph Watson » Wed Jul 16, 2014 6:46 am

Hi Amit,

I am sure that you do not really expect that somebody is going to be able to tell you what is wrong with your system with only this meager information as a guide. You will have to supply much more information even to have someone get started with your issue. In the meantime, I have a few questions that you should be asking yourself.

Have you measured the time period until failure? Is it consistently the same length of time? At one point, you say it is 5 to 6 days. Then you say 4 to 5 days. Does it actually vary? Does it vary that much or does it just seem like it? Has it ever failed in a much shorter or longer period of time?

Are there software counters in your application that might take several days to overflow?

There are many possible sources of this problem. Does the PIC chip have proper decoupling capacitors? Does it have a stable power supply? Is it possible there are electrical noise spikes getting into your system somewhere such as electromagnetic impulses? Do the failures ever coincide with other external events such as thunderstorms or the starting or stopping of heavy industrial equipment? Have you ever been able to cause a failure by doing something?

Are there operator controls interfaced to the system? You did not mention any. Might failures be taking place when operator controls are used or when external sensors such as limit switches are operated?

You said there is an LED that you are blinking in response to Timer 0. Does the LED always stop in the On or in the Off state or does it vary?

Have you conducted any tests to help lead you to the problem? What tests did you perform? What were the results? Have you done anything to narrow down the cause of the problem?

Have you tried stopping the GPS data to see if the system still fails? Are there other aspects of your system that you could stop or run at different speeds or with different data to see what effect it causes?

Can you duplicate the problem on a second system or is there only one system available for testing?

How about having various key locations in your program set a value into some externally visible register? Then when it stops, looking at the contents of the register will at least give you some idea of where the program was when the system failed. Repeat the test several times and see if it is always failing in the same place or a variety of places.

Clearly, one distinct possibility is that the problem is data dependent. That is, the failure may only occur when certain combinations of data show up in your system and with just the right timing to trigger the failure.

Of course, there are far more ways to test a system. I am sure that others will have good questions for you and good tests to try as well.

I recommend that you go gather some information that will help you to locate the problem. Be as creative about devising tests as you are about creating the system in the first place.

Good luck.
NCR once refused to hire me because I was too short. I'm still waiting on my growth spurt.
User avatar
Joseph Watson
 
Posts: 49
Joined: Sat May 31, 2014 8:06 pm
Location: Ohio, USA
PIC experience: Experienced Hobbyist

Re: PIC18F65J10 is not responding after some days.

Postby amit12 » Wed Jul 16, 2014 11:48 am

Hi Joseph,

Have you measured the time period until failure? Is it consistently the same length of time? At one point, you say it is 5 to 6 days. Then you say 4 to 5 days. Does it actually vary? Does it vary that much or does it just seem like it? Has it ever failed in a much shorter or longer period of time?
Ans : No it never fails in shorter duration. i have checked this on two circuits. In one device, PIC went bad after 4 days and in other it went bad after some 6 days. So its not consistent with the same length of time

Are there software counters in your application that might take several days to overflow?
Ans: I didn't get you here what this software counter is referencing?

There are many possible sources of this problem. Does the PIC chip have proper decoupling capacitors? Does it have a stable power supply? Is it possible there are electrical noise spikes getting into your system somewhere such as electromagnetic impulses? Do the failures ever coincide with other external events such as thunderstorms or the starting or stopping of heavy industrial equipment? Have you ever been able to cause a failure by doing something?
Ans: Here above mentioned conditions are not playing any role.

Are there operator controls interfaced to the system? You did not mention any. Might failures be taking place when operator controls are used or when external sensors such as limit switches are operated?
Ans: Only control is with the help of FPGA which is sending commands to the PIC for various time sources(Internal RTC, GPS etc.)


You said there is an LED that you are blinking in response to Timer 0. Does the LED always stop in the On or in the Off state or does it vary?
Ans: LED always stop in off state and its not vary. I have come through this kind of situation 4 times.

Have you conducted any tests to help lead you to the problem? What tests did you perform? What were the results? Have you done anything to narrow down the cause of the problem?
Ans: Only these kind of tests with different timer period i have conducted. if you have any better idea then please suggest.

Have you tried stopping the GPS data to see if the system still fails? Are there other aspects of your system that you could stop or run at different speeds or with different data to see what effect it causes?
Ans: As per our design, we cant stop GPS because time source is GPS and now i am doing test with timer period of 5 msec. So that PIC can be overload less. Earlier time period was 1 msec.

Can you duplicate the problem on a second system or is there only one system available for testing?
Ans: Yes its there on many systems. On 4 devices i have observed.

How about having various key locations in your program set a value into some externally visible register? Then when it stops, looking at the contents of the register will at least give you some idea of where the program was when the system failed. Repeat the test several times and see if it is always failing in the same place or a variety of places.
Ans: This test i am thinking to perform like what logic i can apply.
amit12
 
Posts: 2
Joined: Wed Jul 16, 2014 4:07 am
PIC experience: Professional 2-5 years with MCHP products

Re: PIC18F65J10 is not responding after some days.

Postby Tom Maier » Wed Jul 16, 2014 1:35 pm

All the advice Joseph gave you is good.

The LED you are toggling is an old trick called a "heartbeat". It is a cheap and simple troubleshooting device that can help you see if the processor is locked up. If this LED toggle is in a timer interrupt then it should never stop as long as your software never turns off the interrupt.

So... if you have done your heatbeat LED properly, the lack of the LED indicates that the processor has stalled or jumped code, ending up in an infinate loop. This is often an electrical problem.

What I mean by having the heartbeat done properly is that no matter what, even if your main line code gets locked in a loop, the heatbeat should still be toggling. It's purpose is to indicate the micro is running code.

Try stripping your software down to just the heartbeat and seeing if it still locks up. That eliminates the possiblity of lockup caused by communication contention between the micro and other devices. If a simple heartbeat locks up, then it really points to electrical problem.

Most common sources of this type of electrical problem like you are describing is a marginal violation of some parameter in the datasheet, such as:

1.) power droops
2.) power spikes
3.) too much current through an output pin
4.) too much total package current draw
5.) exposure to a high electromagnetic field
6.) bad solder joint
7.) oscillator circuit is marginal
8.) design is operating below the minimum required voltage (increased clock speed requires higher voltage on most parts)

While you have a couple test circuits running with just a heartbeat, you can stay busy by getting out a scope and checking you power supply.

Are you using BOR and/or watchdog? They should not be used for the "big fix", so you should still chase this problem down before implimenting the fail-safe devices.

What is your voltage and clock speed?

Does that chip use Vcore pin, and if so, what cap do you have on that pin?

Do you have 0.1 uF cap right close to the micro Vdd pin?

What are you using for oscillator?
User avatar
Tom Maier
Verified identity
 
Posts: 179
Joined: Mon May 26, 2014 2:37 pm
PIC experience: Professional 5+ years with MCHP products

Re: PIC18F65J10 is not responding after some days.

Postby Tom Maier » Wed Jul 16, 2014 2:47 pm

In my first post I was stressing that it is likely an electrical problem, but since the timer 0 is not only doing the heartbeat, but also handling the communications, then a contention in the communications might also cause this lockup. That is why you need to have a test of using ONLY the heartbeat, to see which problem this is.

A contention is when you have a protocol that can become locked up in an infinate loop due to a hole in the algorithm of the communication, such as infinately waiting for something that never happens (which would be really bad in a timer interrupt, right?). Sometimes it is caused by a variable in the algorithm that is way out of range from what you ancipated, and it crashes the algorithm.

So divide and conquer... does your algorithm have you locked up or is it electrical? Need to test and find out.

Staring at the same code over and over often doesn't revel the bug.
User avatar
Tom Maier
Verified identity
 
Posts: 179
Joined: Mon May 26, 2014 2:37 pm
PIC experience: Professional 5+ years with MCHP products

Re: PIC18F65J10 is not responding after some days.

Postby jtemples » Wed Jul 16, 2014 6:04 pm

Leaving the unit running on the debugger until it fails might be informative.
jtemples
Verified identity
 
Posts: 195
Joined: Sun May 25, 2014 2:23 am
Location: The 805
PIC experience: Professional 5+ years with MCHP products


Return to 16-Bit Core

Who is online

Users browsing this forum: No registered users and 14 guests

cron